Text Data Exploration


I say unique to NLP because unlike the pre-processing of visual or audio data, the pre-processing of text data requires a few more steps. Even if the data is valid (by valid, I mean text, characters), it might not be in the language you understand (i.e. English), it might not be in Latin alphabet or it might not be encoded encoded properly (in UTF-8 for example). If you are dealing with web scrapped data, most of the time you'll encounter <html> or other formatting tags that will need to be addressed are they are not part of the content per se. Not to say this information is not relevant, it can reveal useful patterns, but if treated as any other word, it will most likely generate noise. So it is important to pick those up early enough.

Unfortunately one of the solution for this kind a data processing is Regular Expressions or regex... I would like to try and avoid this as much as possible as it hurts my brain but I have to admit there are a few, easy to remember, patterns that will help greatly during the cleaning process and also during the tokenization process.


  • Corpus — It is a collection of (similar) documents.
  • Document — It is a collection of sentences that have the same context. It can be a review, a paragraph, a log file, etc.
  • NLP — Natural Language Processing. It is an area of AI which deals with interpreting human language. In our context, it refers to the analysis of text data by a computer.
  • NLTK — Natural Language Tool Kit. It is a very powerful library for NLP.

The dataset

Put your hat on!

The dataset covers 60,000 questions asked on StackOverflow between 2016 and 2020. To link this back to the lingo, this is our corpus, and a specific question is going to be called a document.

Looking at the Tasks tab, we are challenged to classify SO questions based on the text quality, with a hint about the last column being added to ease a supervised classification.


import pandas as pd
df = pd.read_csv('data.csv', index_col='Id')

As with most data, displaying raw data is quite a revealing step. I think it is even more important with text data as not only do we need to understand the metadata:

  • is there any Nan values? (df.describe() has already told me there is no gap in the data)
  • any out of place values? (like a 0 meaning no records for example)
  • etc.

But we also need to analyse and understand the actual value of the field. As we can see in the table above, the following columns are available:

  • Title which seems to only contain text data. It appears to be the subject of the post, in English.
  • Body which seems to contain text data but in various "format/language". We can se the <p> html flag, the \r\n tag which probably the "code" cell formatting.
  • Tags which seems to contain the various tags one can flag a question with.
  • CreationDate which seems to be the posting date. The Kaggle description mentions editing of the post as a metric for post quality. This might be worth keeping in mind when dealing with the data.
  • Y which seems to be the manually added field, the one with the labels (for supervised classification).

Text cleaning

In the Body column, we can see the <html> tag are enclosed in the formatting \r\n (or when we are lucky the code is placed within <code> tags) with the exceptions of the <p> tag which delimits the Body field in most cases. We also notice some characters are escaped: \'.

Because I am more interested in the human factor — but also because I am not qualified to review 60k snippets of code -, I am going replace the code cells by a tag that won’t affect the structure of the text, without destructing the surrounding information. Maybe submitting some code is a sign of quality on SO, who knows?

There are a few ways to do this. One of them is to build a cleaning function and use the .apply(lambda x: cleaning_function(x)) method on the column to clean. It is indicated for more complex cleaning.

Another way to clean text is to use the .replace() method on pandas' series which is straight forward and allows regex.

from bs4 import BeautifulSoup
import re

# This functions addresses the \r\n blocs and converts them to CODE
def regex(text):
pattern = "r'\\t(.*?)\\t'"
text = re.sub(pattern, " ", text)
pattern = "\\r\\n(.*?)\\r\\n"
return re.sub(pattern, " CODE ", text)

# The advantage of using BeautifulSoup is that all the <html> tags are parsed and disappear.
def code_block(field):
soup = BeautifulSoup(field)
for f in soup.findAll('code'):
return (soup.get_text()).replace('\n',' ')
# These are reminders, to ensure we don't forget
# df['Body'].replace('<br/>\r\n','.', inplace=True)
# df['Body'].replace('**', ' ', inplace=True)
# df['Body'].replace("\'", "'", inplace=True)
# df['Body'].replace('<p>', '.' inplace=True)
# df['Body'].replace('\n', ' ', inplace=True)
# df['Body'].replace('\ +', ' ', regex=True, inplace=True)

df['Body'] = df['Body'].apply(lambda x: regex(x))

df['Body'] = df['Body'].apply(lambda x: code_block(x))



As we can see, our Body field has been sanitized and, although there are repetitions in the CODE tag, we have managed to significantly reduce the noise generated by the various code snippets.

Next step is to analyse these words to see what makes a great question on SO. But this is going to be my next article.

Thanks for reading!

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store