Text Data Pre-Processing using Python

Important Part of Any Natural Language Processing Project

Text Data Pre-Processing using Python

Text Pre-Preprocessing is one of the important steps in making any good natural language processing model. In this post, we are going to talk about some of the essential text Pre-Processing steps that we generally use in almost all the natural language processing task. So at the end of this post, you will learn the following text pre-processing techniques:

  1. Removing the HTML tags from the text data.
  2. Remove the URL from the text data.
  3. Remove punctuation, special character such as #, ; , "", etc from the dataset
  4. Remove Stopwords such as is, am, are, the, etc
  5. Changing sort words to full words like don't -> do not.
Here I will not provide any specific laptop/brand, but I will try to provide the details about what configuration’s laptop will be sufficient for machine learning and deep learning tasks.

Import All the Required Libraries: :

Here we are importing all the required libraries to preprocess the data.

Removing the URL from the text data:

Most of the time URL is not very much informative in the text-based modeling, so simply removing them will be useful.

Removing the HTML tags from the text data:

HTML tags are like a noise in the text data. So by removing the HTML tags from the dataset, we may get good results as compare to without removing them.

Output: "Machine Learning is a Fun"

Remove Punctuation and Special Character:

Punctuation and special symbols generally do not contain much information, so we can remove them from the dataset for the final modeling.

Output: "Machine learning is fun"

Remove Stopwords:

In the below-given code, we have manually provided all the stopwords that we want to remove from our text data. We can change them according to our needs.

Output: "Machine Learning Fun, enjoy earn well."

Changing Short Words to Full Words:

In this step, we will convert words like don't to do not, haven't to have not, etc.

Output: "I have not learn machine learning yet."

All the Steps Combined:

You can copy the below code and define the set of stopwords. The below method will return the cleaned text in the form of a sentence after applying the above steps to the raw sentence.

Exercise that you can do:
  1. Download the dataset from Kaggle : https://www.kaggle.com/taniaj/australian-election-2019-tweets
  2. Perform all the above steps on this dataset.
  3. After performing the above steps just comment in the comment section and let us know about the analysis.

If you are stuck with any of the text pre-processing step, just comment below, we will add that as soon as possible.

This article is contributed by Karan Kumar Rajput. Click Here To Read More Articles. To work with us, Please fill out the form, we will get back to you soon.