Text Data Exploration

The cleaning step

Antoine Ghilissen



For this article, I wanted to cover a few key concepts unique to handling text data.

I say unique to NLP because unlike the pre-processing of visual or audio data, the pre-processing of text data requires a few more steps. Even if the data is valid (by valid, I mean text, characters), it might not be in the language you understand (i.e. English), it might not be in Latin alphabet or it might not be encoded encoded properly (in UTF-8 for example). If you are dealing with web scrapped data, most of the time you'll encounter <html> or other formatting tags that will need to be addressed are they are not part of the content per se. Not to say this information is not relevant, it can reveal useful patterns, but if treated as any other word, it will most likely generate noise. So it is important to pick those up early enough.

Unfortunately one of the solution for this kind a data processing is Regular Expressions or regex... I would like to try and avoid this as much as possible as it hurts my brain but I have to admit there are a few, easy to remember, patterns that will help greatly during the cleaning process and also during the tokenization process.


Before we start, let’s talk about the lingo:

  • Corpus — It is a collection of (similar) documents.
  • Document — It is a collection of sentences that have the same context. It can be a review, a paragraph, a log file, etc.
  • NLP — Natural Language Processing. It is an area of AI which deals with interpreting human language. In our context, it refers to the analysis of text data by a computer.
  • NLTK — Natural Language Tool Kit. It is a very powerful library for NLP.

The dataset

To support this article, I am going to use a dataset from Kaggle: 60k Stack Overflow Questions with Quality Rating. Ultimately I’ll probably try to classify them but as I’ve found this dataset today, it is exploration time!

Put your hat on!

The dataset covers 60,000 questions asked on StackOverflow between 2016 and 2020. To link this back to the lingo, this is our corpus, and a specific question is going to be called a document.

Looking at the Tasks tab, we are challenged to classify SO questions based on the text quality, with a hint about the last column being added to ease a supervised classification.


It comes as a file named data.csv containing 60,000 rows and 5 columns (+ 1 Id column which will be our index). When displayed, it looks like this:

import pandas as pd
df = pd.read_csv('data.csv', index_col='Id')

As with most data, displaying raw data is quite a revealing step. I think it is even more important with text data as not only do we need to understand the metadata:

  • is there any Nan values? (df.describe() has already told me there is no gap in the data)
  • any out of place values? (like a 0 meaning no records for example)
  • etc.

But we also need to analyse and understand the actual value of the field. As we can see in the table above, the following columns are available:

  • Title which seems to only contain text data. It appears to be the subject of the post, in English.
  • Body which seems to contain text data but in various "format/language". We can se the <p> html flag, the \r\n tag which probably the "code" cell formatting.
  • Tags which seems to contain the various tags one can flag a question with.
  • CreationDate which seems to be the posting date. The Kaggle description mentions editing of the post as a metric for post quality. This might be worth keeping in mind when dealing with the data.
  • Y which seems to be the manually added field, the one with the labels (for supervised classification).

Text cleaning

The dataset seems to be quite neat, the only field really needing work is the Body one. The date will be parsed as date later on, the tags can easily be extracted (Who said wordclouds? :angel:) and the Y probably should be OneHotEncoded for the classification step.

In the Body column, we can see the <html> tag are enclosed in the formatting \r\n (or when we are lucky the code is placed within <code> tags) with the exceptions of the <p> tag which delimits the Body field in most cases. We also notice some characters are escaped: \'.

Because I am more interested in the human factor — but also because I am not qualified to review 60k snippets of code -, I am going replace the code cells by a tag that won’t affect the structure of the text, without destructing the surrounding information. Maybe submitting some code is a sign of quality on SO, who knows?

There are a few ways to do this. One of them is to build a cleaning function and use the .apply(lambda x: cleaning_function(x)) method on the column to clean. It is indicated for more complex cleaning.

Another way to clean text is to use the .replace() method on pandas' series which is straight forward and allows regex.

from bs4 import BeautifulSoup
import re

# This functions addresses the \r\n blocs and converts them to CODE
def regex(text):
pattern = "r'\\t(.*?)\\t'"
text = re.sub(pattern, " ", text)
pattern = "\\r\\n(.*?)\\r\\n"
return re.sub(pattern, " CODE ", text)

# The advantage of using BeautifulSoup is that all the <html> tags are parsed and disappear.
def code_block(field):
soup = BeautifulSoup(field)
for f in soup.findAll('code'):
return (soup.get_text()).replace('\n',' ')
# These are reminders, to ensure we don't forget
# df['Body'].replace('<br/>\r\n','.', inplace=True)
# df['Body'].replace('**', ' ', inplace=True)
# df['Body'].replace("\'", "'", inplace=True)
# df['Body'].replace('<p>', '.' inplace=True)
# df['Body'].replace('\n', ' ', inplace=True)
# df['Body'].replace('\ +', ' ', regex=True, inplace=True)

df['Body'] = df['Body'].apply(lambda x: regex(x))

df['Body'] = df['Body'].apply(lambda x: code_block(x))



As we can see, our Body field has been sanitized and, although there are repetitions in the CODE tag, we have managed to significantly reduce the noise generated by the various code snippets.

Next step is to analyse these words to see what makes a great question on SO. But this is going to be my next article.

Thanks for reading!