Text Pre-Preprocessing is one of the important steps in making any good natural language processing model. In this post, we are going to talk about some of the essential text Pre-Processing steps that we generally use in almost all the natural language processing task. So at the end of this post, you will learn the following text pre-processing techniques:
Here I will not provide any specific laptop/brand, but I will try to provide the details about what configuration’s laptop will be sufficient for machine learning and deep learning tasks.
- Removing the HTML tags from the text data.
- Remove the URL from the text data.
- Remove punctuation, special character such as #, ; , "", etc from the dataset
- Remove Stopwords such as is, am, are, the, etc
- Changing sort words to full words like don't -> do not.
Here we are importing all the required libraries to preprocess the data.
Most of the time URL is not very much informative in the text-based modeling, so simply removing them will be useful.
HTML tags are like a noise in the text data. So by removing the HTML tags from the dataset, we may get good results as compare to without removing them.
Output: "Machine Learning is a Fun"
Punctuation and special symbols generally do not contain much information, so we can remove them from the dataset for the final modeling.
Output: "Machine learning is fun"
In the below-given code, we have manually provided all the stopwords that we want to remove from our text data. We can change them according to our needs.
Output: "Machine Learning Fun, enjoy earn well."
In this step, we will convert words like don't to do not, haven't to have not, etc.
Output: "I have not learn machine learning yet."
You can copy the below code and define the set of stopwords. The below method will return the cleaned text in the form of a sentence after applying the above steps to the raw sentence.
Exercise that you can do:
- Download the dataset from Kaggle : https://www.kaggle.com/taniaj/australian-election-2019-tweets
- Perform all the above steps on this dataset.
- After performing the above steps just comment in the comment section and let us know about the analysis.
If you are stuck with any of the text pre-processing step, just comment below, we will add that as soon as possible.
This article is contributed by Karan Kumar Rajput. Click Here To Read More Articles. To work with us, Please fill out the form, we will get back to you soon.