Photo by Glen Noble on Unsplash

Tag analysis (Stack Overflow questions)

Antoine Ghilissen
7 min readSep 9, 2020

--

This article is following the steps of the analysis started here.

We are going to have a look at the tags used in our 60,000 questions from StackOverflow with Quality Rating. It should give us a better understanding of the situation and, with a bit of work, we might already be able to spot some trends.

Introduction

In this article, we want to do a few things using the Tags field. We want to have a look at what the bulk of the questions are about but we also want to see if there are some common combinations. All this will eventually be confronted to the quality of the post to try and identify trends.

To that end, we are going to use the lambda function, build cleaning functions, build a bag of words, create a wordcloud and use nltk's FreqDist.

Imports and cleaning functions

Nothing too fancy with the cleaning functions but the one we are going to use for our wordclouds is a little more invasive to try and get rid of some noise.

from nltk import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (30,30)
def wc(text):
"""
Cleaning function to be used with our first wordcloud
"""

if text:
tags = text.replace('><',' ')
tags = tags.replace('-','')
tags = tags.replace('.','DOT')
tags = tags.replace('c++','Cpp')
tags = tags.replace('c#','Csharp')
tags = tags.replace('>','')
return tags.replace('<','')
else:
return 'None'

def clean_tags(text):
"""
Cleaning function for tags
"""

if text:
tags = text.replace('><',' ')
tags = tags.replace('>','')
return tags.replace('<','')
else:
return 'None'

Wordclouds

wordcloud() needs a document of space-separated words. We are going to create a list of words then use the ' '.join() method to build that document.

tags = [tag for i in df['Tags'].apply(lambda x: wc(x)) for tag in i.split()]
wordcloud = WordCloud(width = 3000,
height = 2000,
regexp = '\w+', # Allows C, R to be parsed as words
background_color = 'white'
)

wordcloud.generate(" ".join(tags))

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
png
list(wordcloud.words_)[:20]['java',
'python',
'html css',
'Cpp',
'java android',
'javascript jquery',
'Csharp',
'c',
'python python3DOTx',
'javascript html',
'ios swift',
'javascript',
'php mysql',
'jquery html',
'sql sqlserver',
'angular',
'php html',
'javascript php',
'javascript reactjs',
'php']

We can see that although our list only contained single words, wordcloud() recognises some commonly paired ones.

A few things worth noticing as well:

  • java and python seem to be the most asked about language, followed by html css.
  • javascript seems to be quite high up as well but has been paired with other technologies like html,jquery, etc.
  • Talking about pairs, we can already spot a few meaningful ones such as java android or less valuable like python python 3.x.

Let’s look at the actual counts to see in which proportion those languages are talked about.

FreqDist(tags).plot(50)
plt.show()
png

It seems our assumption was correct, javascript is the most asked about language. But then pyhon overtakes java by a couple of hundred questions.

Let’s double check this by reducing the list of tags to the first one only. We will assume programmers know how to tag a post, with the first tag being the core of the question, followed by other tags giving more context about the question.

first_tags = df['Tags'].apply(lambda x: wc(x)).apply(lambda x: x.split()[0])wordcloud = WordCloud(width = 3000,
height = 2000,
regexp = '\w+', # Allows C, R to be parsed as words
background_color = 'white'
)

wordcloud.generate(" ".join(first_tags))

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
png
list(wordcloud.words_)[:20]['javascript',
'python',
'java',
'Csharp',
'php',
'android',
'Cpp',
'io',
'c',
'html',
'sql',
'angular',
'mysql',
'swift',
'nodeDOTjs',
'reactjs',
'linux',
'css',
'docker',
'git']
FreqDist(first_tags).plot(50)
plt.show()
png

So most questions on SO are about javascript, python, java and c#.

Let’s have a deeper look at the top languages by taking a bag of words approach.

Javascript

bag = {}
for tag in df['Tags'].apply(lambda x: clean_tags(x)):
# Get the previous entry, or 0 if not yet documented; add 1
bag[tag] = bag.get(tag, 0) + 1
for words in sorted(bag, key=bag.get, reverse=True):
if 'javascript' in words:
print(words, bag[words])
javascript 879
javascript jquery 348
javascript html 260
javascript jquery html 156
javascript html css 148
javascript jquery html css 116
javascript arrays 114
javascript reactjs 113
javascript regex 111
javascript php 55
javascript node.js 54
javascript angularjs 52
javascript json 50
javascript php html 48
javascript ecmascript-6 45
javascript angular 34
javascript typescript 34
javascript css 33
javascript jquery css 28
javascript vue.js 26
javascript arrays json 25
javascript jquery ajax 23
javascript angular typescript 23
javascript php jquery 20
javascript lodash 20
javascript jquery json 20
javascript php jquery html 17
javascript arrays sorting 17
javascript date 16
javascript php jquery ajax 16
javascript reactjs react-hooks 16
javascript reactjs react-native 15
javascript webpack 15
javascript vue.js vuejs2 15
javascript php html css 14
javascript momentjs 14
javascript arrays object 14
javascript google-maps 13
javascript object 13
javascript react-native 13
javascript reactjs redux 12
javascript if-statement 11
javascript reactjs react-router 11
javascript function 11
javascript html reactjs 11
javascript node.js express 10
javascript ajax 10
javascript promise 10
javascript html angularjs 10
javascript jquery regex 10

Javascript is usually associated with jquery, html and css. reactjs and regex are also worth noticing but if you were thinking about a webdev career, I guess now you know where to start 😄

Python

for words in sorted(bag, key=bag.get, reverse=True):
if 'python' in words:
print(words, bag[words])
python 1068
python python-3.x 375
python pandas 155
python regex 98
python list 86
python python-2.7 77
python-3.x 72
python django 53
python dictionary 44
python numpy 43
python pandas dataframe 41
python matplotlib 40
python tkinter 37
python string 31
python tensorflow 29
python list dictionary 27
python python-2.7 python-3.x 24
python arrays numpy 23
python function 20
python python-3.x python-2.7 20
python opencv 19
python json 19
python csv 18
python python-3.x tkinter 18
python python-3.x list 18
python flask 16
python arrays 16
python selenium 16
python pandas numpy 15
python pytest 13
python if-statement 13
python pygame 13
python regex python-3.x 13
python python-3.x dictionary 13
python algorithm 12
python syntax 12
python file 12
python linux 12
python pycharm 11
python scikit-learn 11
python string list 11
python turtle-graphics 11
python list list-comprehension 10
python beautifulsoup 10
python anaconda 10
python airflow 10

We can see pandas, regex, lists and dictionaries are troubling pythonistas. django and tkinter are also up there but the bulk of people posting questions on Stack Overflow are working on data analysis / data science projects.

Java

for words in sorted(bag, key=bag.get, reverse=True):
if 'java' in words and 'javascript' not in words:
print(words, bag[words])
java 1013
java android 339
java arrays 100
java string 60
java regex 55
java android android-studio 45
java arraylist 34
java eclipse 28
java swing 24
java android xml 22
java java-8 20
java javafx 20
java collections 20
java java-8 java-stream 19
java android json 19
java for-loop 17
java android sqlite 17
java selenium 17
java algorithm 15
java spring spring-boot 15
java spring 15
java recursion 15
java generics 15
java json 15
java multithreading 15
java java.util.scanner 14
java if-statement 14
java android firebase firebase-realtime-database 14
java arrays arraylist 13
java android android-layout 13
java methods 12
java oop 12
java android nullpointerexception 12
java file 12
java arrays string 12
java android kotlin 12
java android android-fragments 11
java exception 11
java mysql 11
java nullpointerexception 11
java loops 11
java java-stream 11
java random 11
java android sqlite android-sqlite 10
java android-studio 10
java date 10
java hashmap 10
java c++ 10
java linked-list 10

On SO, the java community seems to be significantly focused on android and java-8 development. Most of the question being about fundamentals of java: arrays, string, arraylist, for-loop.

Quality

Is there any link between the topic of a question and its quality?

Javascript

df[df['Tags'].str.contains('javascript')]['Y'].value_counts()/71.13LQ_CLOSE    41.56%
LQ_EDIT 31.22%
HQ 27.22%

Python

df[df['Tags'].str.contains('python')]['Y'].value_counts()/71.62LQ_CLOSE    37.04%
LQ_EDIT 33.08%
HQ 29.88%

Java

df[~df['Tags'].str.contains('javascript')][df['Tags'].str.contains('java')]['Y'].value_counts()/62.94


LQ_CLOSE 44.58%
LQ_EDIT 38.16%
HQ 17.25%

No clear trend is appearing here. Python seems to have slightly more high quality questions than Javascript but the bulk of it is low quality. Interestingly Java has significantly less high quality questions.

Conclusion

There doesn’t seem to be any obvious correlation between the quality of the questions and the tags used. It might be worth digging a little more to identify tags combination that could be more “high value” but it would only explain a very minimal fraction of the questions posted. It is definitely not worth spending more time on this type of analysis.

We have learnt a few things about Stack Overflow though:

  • Javascript is the most asked about language, followed by Python then Java.
  • Javascript, jquery and html seem to be the most talked about topics.
  • Python’s community is mainly developers and data scientist/analysts.
  • Java seems to attract a lot of low quality questions and android seem to be one of the most hype java applications.

--

--