One of the most common tasks in Natural Language Processing (NLP) is to clean text data. In order to maximize your results, it's important to distill you text to the most important root words in the corpus. This post will show how I typically accomplish this. The following are general steps in text preprocessing:
- Tokenization: Tokenization breaks the text into smaller units vs. large chunks of text. We understand these units as words or sentences, but a machine cannot until they’re separated. Special care has to be taken when breaking down terms so that logical units are created. Most software packages handle edge cases (U.S. broke into the US and not U and S), but it’s always essential to ensure it’s done correctly.
- Cleaning: The cleaning process is critical to removing text and characters that are not important to the analysis. Text such as URLs, noncritical items such as hyphens or special characters, web scraping, HTML, and CSS information are discarded.
- Removing Stop Words: Next is the process of removing stop words. Stop words are common words that appear but do not add any understanding. Words such as “a” and “the” are examples. These words also appear very frequently, become dominant in your analysis, and obscure the meaningful words.:
- Spelling: Spelling errors can also be corrected during the analysis. Depending on the medium of communication, there might be more or fewer errors. Official corporate or education documents most likely contain fewer errors, where social media posts or more informal communications like email can have more. Depending on the desired outcome, correcting spelling errors or not is a critical step.
- Stemming and Lemmatization: Stemming is the process of removing characters from the beginning or end of a word to reduce it to their stem. An example of stemming would be to reduce “runs” to “run” as the base word dropping the “s,” where “ran” would not be in the same stem. However, Lemmatization would classify “ran” in the same lemma.
The following is a script that I’ve been using to clean a majority of my text data.
import pandas as pd import re import string from bs4 import BeautifulSoup import nltk from nltk.stem import PorterStemmer from nltk.stem.wordnet import WordNetLemmatizer import spacy
Removing HTML is optional and depending on what your data source is. I’ve found beautiful soup is the best way to clean this versus RegEx.
def clean_html(html): # parse html content soup = BeautifulSoup(html, "html.parser") for data in soup(['style', 'script', 'code', 'a']): # Remove tags data.decompose() # return data by retrieving the tag content return ' '.join(soup.stripped_strings)
Cleaning the Rest
Now the workhorse.
- Make the text lowercase. As you probably know, NLP is case-sensitive.
- Remove line breaks. Again, depending on your source, you might have encoded line breaks.
- Remove punctuation. This is using the string library. Other punctuation can be added as needed.
- Remove stop words using the NLTK library. There is a list in the next line to add additional stop words to the function as needed. These might be noisy domain words or anything else that makes the contextless clear.
- Removing numbers. Optional depending on your data.
- Stemming or Lemmatization. This process is an argument in the function. You can choose either one via with
Lem. The default is to use none.
# Load spacy nlp = spacy.load('en_core_web_sm') def clean_string(text, stem="None"): final_string = "" # Make lower text = text.lower() # Remove line breaks text = re.sub(r'\n', '', text) # Remove puncuation translator = str.maketrans('', '', string.punctuation) text = text.translate(translator) # Remove stop words text = text.split() useless_words = nltk.corpus.stopwords.words("english") useless_words = useless_words + ['hi', 'im'] text_filtered = [word for word in text if not word in useless_words] # Remove numbers text_filtered = [re.sub(r'\w*\d\w*', '', w) for w in text_filtered] # Stem or Lemmatize if stem == 'Stem': stemmer = PorterStemmer() text_stemmed = [stemmer.stem(y) for y in text_filtered] elif stem == 'Lem': lem = WordNetLemmatizer() text_stemmed = [lem.lemmatize(y) for y in text_filtered] elif stem == 'Spacy': text_filtered = nlp(' '.join(text_filtered)) text_stemmed = [y.lemma_ for y in text_filtered] else: text_stemmed = text_filtered final_string = ' '.join(text_stemmed) return final_string
To apply this to a standard data frame, use
apply function from Pandas like below. Let's take a look at the starting text:
<p><a href="https://forge.autodesk.com/en/docs/data/v2/tutorials/download-file/#step-6-download-the-item" rel="nofollow noreferrer">https://forge.autodesk.com/en/docs/data/v2/tutorials/download-file/#step-6-download-the-item</a></p>\n\n<p>I have followed the tutorial and have successfully obtained the contents of the file, but where is the file being downloaded. In addition, how do I specify the location of where I want to download the file?</p>\n\n<p>Result on Postman\n<a href="https://i.stack.imgur.com/VrdqP.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/VrdqP.png" alt="enter image description here"></a></p>\n
Let's start by cleaning the HTML.
# To remove HTML first and apply it directly to the source text column. df['body'] = df['body'].apply(lambda x: clean_html(x))
After applying the function to clean HTML, this is the result:
I have followed the tutorial and have successfully obtained the contents of the file, but where is the file being downloaded. In addition, how do I specify the location of where I want to download the file? Result on Postman
Next let's apply the
# Next apply the clean_string function to the text df['body_clean'] = df['body'].apply(lambda x: clean_string(x, stem='Stem'))
And the final resulting text:
follow tutori success obtain content file file download addit specifi locat want download file result postman
Fully clean and ready to use in your NLP project.
Note: I often create a new column like above,
body_clean, so I preserve the original in case punctuation is needed.
And that’s about it. The order in the above function does matter. You should complete certain steps before others, such as making lowercase first. The function contains one RegEx example for removing numbers; a solid utility function that you can adjust to remove other items from the text using RegEx.
Spacy vs. NLKT Lemmatization
The above function contains two different ways to Lemmatize your text. The NLTK
WordNetLemmatizer requires a Part of Speech (POS) argument (
verb) and therefore either requires multiple passes to get each word or will only capture one POS. The alternative is to use
Spacy which will automatically lemmatize each word and determine which POS it belongs to. The issue is that Spacy's performance will be signficantly slower than NLTK.
If you liked what you read, subscribe to my newsletter and you will get my cheat sheet on Python, Machine Learning (ML), Natural Language Processing (NLP), SQL, and more. You will receive an email each time a new article is posted.