I am thankful for Stack Overflow

You’ve come through for me in my dark moments of swearing at Tetris, forgetting any and all JavaScript syntax, and more. I dedicate this lesson to all the Stack Overflow contributors who didn’t make me feel like a moron, and to all the new students fixing their bugs one at a time with 26 tabs open.

Fake book covers from the Practical Developer, mimicking O’Reilly textbooks

Stack Exchange Queries

One of the coolest things about learning Data Science and statistics is that we can apply our methods to any dataset that interests us. This is an excellent opportunity for creativity and expression. But it can be difficult to know where to begin; especially because our questions might be too advanced, or difficult to get data for. These lessons approach this problem by analyzing data from Software Engineering research, and guiding the learner through the findings. We also use sources like Stack Overflow, GitHub, and Eclipse (Java IDE) Bug Reports. For this lesson, it’s all about Stack Overflow tags. Stack Overflow is a forum where developers can ask and answer questions about code. There are other features to the site, like gaining reputation and earning badges, but above all else they market the site as “ask questions, get answers, no distractions”. There is certainly a culture to be reckoned with; as some people can be unkind to beginners on the site. But never fear, you are a coder and you belong.

In this lesson we use the Stack Exchange API to get data from Stack Overflow. We use this query to get a .csv file that pulls data from Stack Overflow. We can actually download that file to our local machine and work with the data from there.

Load Your Libraries


Load the Data

The data already exists in the data/stackoverflow folder, but if you download any of your own data from Stack Exchange, you will need to put that csv file in the correct location so that this R code can reach it in the correct path.

data <- read.csv("../data/stackoverflow/top10tags.csv")

#this is the same as calling head(data,13) which shows the top 13 rows, but put inside a "kable" which displays it nicely in RMarkdown
  kable_styling(bootstrap_options = c("striped", "hover"))
Month TagName X
2008-07-01 00:00:00 c# 3
2008-07-01 00:00:00 html 1
2008-08-01 00:00:00 android 4
2008-08-01 00:00:00 c# 513
2008-08-01 00:00:00 c++ 164
2008-08-01 00:00:00 html 111
2008-08-01 00:00:00 ios 9
2008-08-01 00:00:00 java 220
2008-08-01 00:00:00 javascript 161
2008-08-01 00:00:00 jquery 28
2008-08-01 00:00:00 php 162
2008-08-01 00:00:00 python 124
2008-09-01 00:00:00 android 9
#you can also use the View command to inspect the entire data frame

Summarize the Variables

Here we have the top 10 tags used on StackOverflow over time, with the number of questions containing that tag per month since 2008. Let’s summarize a bit more of what we have by going over each variable and making some tables. This is just a nice way to get yourself situated with the data.

##    android         c#        c++       html        ios       java 
##        133        134        133        134        133        133 
## javascript     jquery        php     python 
##        133        133        133        133
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1    4182    8364    8675   12476   24202
  kable_styling(bootstrap_options = c("striped", "hover"))
Month TagName X
1320 2019-07-01 00:00:00 jquery 4129
1321 2019-07-01 00:00:00 php 8429
1322 2019-07-01 00:00:00 python 23483
1323 2019-08-01 00:00:00 android 5262
1324 2019-08-01 00:00:00 c# 5679
1325 2019-08-01 00:00:00 c++ 2460
1326 2019-08-01 00:00:00 html 4397
1327 2019-08-01 00:00:00 ios 1820
1328 2019-08-01 00:00:00 java 7089
1329 2019-08-01 00:00:00 javascript 10873
1330 2019-08-01 00:00:00 jquery 2119
1331 2019-08-01 00:00:00 php 4748
1332 2019-08-01 00:00:00 python 13024

Total Tags and dplyr Intro

We’ve explored the data a little bit, but there are important insights in summarizing that data into totals. We do that by using dplyr, part of the tidyverse. It is a package that allows for verb-driven data manipulation. Basically, that means that there are slightly more intuitive names for stuff you can do to your data. In the below example, we care about collapsing over Month and gathering up the total number of questions with the given tag, across all time values.

We still want to separate the top 10 tags from one another. Let’s look at what happens if we don’t include group_by(TagName) in our code:

not_so_helpful <- data %>%
##        sum
## 1 11555029

It actually only gives us one number back. This is the total number of questions collected in our dataset as a whole. This is across tags and Month. But we would still like to know about the different tags, so we use the following dplyr code to achieve that, including a step where we arrange() the data in descending order. Next we create a ggplot bar plot to show the total questions with the top 10 Stack Overflow tags.

#summarizing up the totals, collapsing over Month
summary <- data %>%
  group_by(TagName) %>%

  kable_styling(bootstrap_options = c("striped", "hover"))
TagName sum
javascript 1856183
java 1577823
c# 1336488
php 1301057
python 1228792
android 1214594
jquery 962366
html 842133
c++ 629479
ios 606114
  ggtitle("Total Questions With Top 10 Stack Overflow Tags")+

Tags Over Time By Month

Let’s reintroduce the Month variable and take a look at how the tags differ from month to month, over time. Note, this is not total over time, but the number of questions using that tag on that given day. Therefore, the curves will not be smooth or always increasing.

NOTE: oh crap are we just getting the data for that specific day of the month but not collecting all days? this isn’t total totals over time then it’s a sample oops

plt = ggplot(data,aes(Month,X,group=TagName,color=TagName))+
  ggtitle("Top 10 Stack Overflow Tags Over Time")+