Here is a blog post that I wrote about this lesson and about playing with data and how to infuse joy into learning statistics.

Curiosity killed the cat, but satisfaction brought it back

Sometimes, some of the most intricate problems arise from simple musings among friends. This is one of the beautiful things about problem solving, programming, and statistics. When you are curious enough and excited about a problem, you may find yourself exploring every caveat just to prove your point. (Be careful though, don’t forgo sleep. We cover the detrimental effects of missing sleep on your coding performance in another lesson.) While it is very important to tackle real problems that are genuinely affecting people’s lives, first we have to learn to wrangle statistics and ask the right kinds of questions. Learning this skill will help you to ask critical questions when it really counts, like when building software for hospitals, news sites, financial institutions, or maybe the next rocket to Mars. One of the best ways to teach these skills is to lock on to a personally meaningful, but low-stakes, question; and answer it.

Playing with Data

In this lesson, you will learn to meaningfully play with data. So often in our academic careers we are expected to perfectly follow along and do things the “right” way, in order to build our skills to become an essential and productive member of a team/company/society. We can get through our math classes by “plugging-it-in” to equations, get through English by practicing flash cards, and get through History by memorizing dates. But we can gain a lot through creatively expressing ourselves in our learning. So this lesson will demonstrate how to make queries to archived GitHub data, generate plots of interest, and answer our own curiosities (sometimes frivolous, but who cares?) You will have several chances to creatively express yourself in this lesson, while still being guided by examples.

GHTorrent and BigRQuery

We need to begin by getting data. Luckily the GHTorrent project has been archiving data from the GitHub API for the past several years, and we can get access to commits, users, forks, repos, locations (see GitHub in glossary) and more. While there are several ways to load this data into R, we are going to rely on bigrquery to interface with Google BigQuery and the public GHTorrent data on the Google Cloud. Here are the steps you need to take:

The next piece of this is to not only understand that the data is being held on the Google Cloud (that we can access with SQL queries, but to understand how we can do that from our local machines, in R. We will use the bigrquery package, with a tutorial here. bigrquery allows us to link our R code to our project on the cloud. You will be asked to authenticate. For the record, I did not know what the “cloud” really meant for a while, and was too afraid to ask.

DataFiles for this Lesson

For this lesson, we actually stored all of the queries in datafiles that you can load from the ghtorrentdata/ directory. Accessing the data required a credit card, which is unreasonable to expect from anyone just trying to use these lessons to learn. If you do want to perform different queries of your own, there are several ways to do that. I have included how you could do it using BigQuery and bigrquery (that’s the R package, see the r in there?). Any time you see query_exec() it will be commented out and you will use the data we have provided from the result of that query as of September 2019. That way, you can see how the query would be executed but also you don’t need to fiddle with APIs to follow along in this lesson.

library(bigrquery)
library(ggplot2)
#- bq_auth(path="../../servicetoken.json") # you will need to get your own service token from your account. Don't share this!
#- project <- "gitstats" #your project name here

Alan Turing’s Birthday Examples (learn some SQL queries)

Here are some examples for using bigrquery to access GHTorrent data. Here, we take a look at Alan Turing’s birthday (June 23) for the year 2016.This should help you get the hang of some SQL commands and collecting data. I chose Alan Turing because I’m an Artificial Intelligence nerd, and once even had a hamster named Turing. You can plug in your own birthday, or the birthday of someone else you admire. The point is to start playing.

There is a lot to unpack in your first SQL command. Here’s a small recipe for reading the examples:

#- # walking through an SQL example
#- # language with most commits on Alan Turing's birthday in 2016 
#- sql_birthday <- "SELECT  p.language as language, count(c.id) as num_commits
#- from [ghtorrent-bq.ght.project_commits] pc join
#-      (SELECT id, author_id, created_at FROM [ghtorrent-bq.ght.commits] WHERE
#-      date(created_at) = date('2016-06-23') )c on pc.commit_id = c.id join
#-      (SELECT id, language, description
#-      FROM [ghtorrent-bq.ght.projects] WHERE language != 'null')p on p.id = pc.project_id join
#-      (SELECT login,  id
#-      FROM [ghtorrent-bq.ght.users]) u on c.author_id = u.id,
#- group by language
#- order by num_commits desc;"

Execute the query

We use query_exec to execute the query with bigrquery to the ghtorrent database, hosted on BigQuery. We get back a dataframe, stored in bday_commits.

#- # executing the query if you aren't using data provided
#- bday_commits  <- query_exec(sql_birthday, project = project, useLegacySql = FALSE) #remember we defined 'project' up above

# reading in data for the above query that we stored earlier
bday_commits <- read.csv("ghtorrentdata/bdaycommits.csv")

kable(head(bday_commits)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
language num_commits
JavaScript 45497
Java 34379
Python 31760
HTML 23366
C++ 19956
C 17601

Visualize the Data

The following is a ggplot of the top 7 languages used on Alan Turing’s birthday, by number of commits to each.

plt = ggplot(bday_commits[1:7,],aes(language,num_commits,fill=language))+
  geom_bar(stat="identity")+
  xlab("Language")+
  ylab("Number of Commits")+
  ggtitle("GitHub Commits by Language on Alan Turing's Birthday")+
  theme_bw()
plt

ggsave("bdaycommits.png",plt)
## Saving 7 x 5 in image

I can’t imagine that Alan Turing would have been the biggest JavaScript fan. Let’s take a look at projects where the project description includes “AI”. Here, we see Python emerge as top commits for the day.

#- # languge with most commits on Alan Turing's birthday in 2016 
#- sql_example2 <- "SELECT p.description as description, p.language as language, count(c.id) as num_commits
#- from [ghtorrent-bq.ght.project_commits] pc join
#-      (SELECT id, author_id, created_at FROM [ghtorrent-bq.ght.commits] WHERE
#-      date(created_at) = date('2016-06-23') )c on pc.commit_id = c.id join
#-      (SELECT id, language, description
#-      FROM [ghtorrent-bq.ght.projects] WHERE language != 'null' and description LIKE '%AI%')p on p.id = pc.project_id join
#-      (SELECT login,  id
#-      FROM [ghtorrent-bq.ght.users]) u on c.author_id = u.id,
#- group by description,language
#- order by num_commits desc;"

#- # executing the query if you aren't using data provided
#- example2  <- query_exec(sql_example2, project = project, useLegacySql = FALSE)

# reading in data for the above query that we stored earlier
example2 <- read.csv("ghtorrentdata/example2.csv")

plt = ggplot(example2[1:6,],aes(language,num_commits,fill=language))+
  geom_bar(stat="identity")+
  ggtitle("Commits to 'AI' Projects on Alan Turing's Birthday",subtitle="Who would even want to know this?")+
  theme_bw()
plt

python_desc <- example2[example2$language=='Python',]
python_desc <- python_desc[order(-python_desc$num_commits),]

kable(python_desc[1:10,])%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
description language num_commits
1 Record for playing with OpenAI Gym Python 42
5 rllab is a framework for developing and evaluating reinforcement learning algorithms, fully compatible with OpenAI Gym. Python 14
8 AI of bombliss(almost tetris) on NES Python 8
9 :alien: An AI project. :alien: Python 8
16 AI for 2048 using simple reinforcement learning Python 6
27 SIMA and RAIN comparison Python 4
35 The OAI Harvest module handles metadata gathering between OAI-PMH v.2.0 compliant repositories. Python 3
37 This depot is used for learning of AI for robotics in Udacity Python 3
46 Website version of the AIS Network Mapper project Python 3
53 Gen-6 Pokemon Battling Sim and AI Python 2

Language Growth over Time

Let’s investigate another question. Do you know anyone who is obsessed with a certain language? Maybe they can’t stop talking about how Julia (programming language) is gonna rule the world, or how Golang is the future because of all that multithreading (I’ve never used Golang). Or maybe you’re a straight up fan of some language and you want to make sure that other people know how awesome it is (Scala, anyone?) This is a fun exercise in seeing how languages grow over time, by the number of commits for those projects on GitHub. In this example:

#- language_sql <- "SELECT  p.language as language, date(created_at) as day,count(c.id) as num_commits
#- from [ghtorrent-bq.ght.project_commits] pc join
#-      (SELECT id, author_id, created_at FROM [ghtorrent-bq.ght.commits] WHERE
#-      date(created_at) between date('2012-01-01')
#-                           and  date('2016-09-05') )c on pc.commit_id = c.id join
#-      (SELECT id, language, description
#-      FROM [ghtorrent-bq.ght.projects] WHERE language == 'Scala' || language == 'Python' || language == 'Julia' || language=='Ruby')p on p.id = pc.project_id join
#-      (SELECT login,  id
#-      FROM [ghtorrent-bq.ght.users]) u on c.author_id = u.id,
#- group by language,day
#- order by num_commits desc;"
#- # executing the query if you aren't using data provided
#- lang_growth  <- query_exec(language_sql, project = project, useLegacySql = FALSE)

# reading in data for the above query that we stored earlier
lang_growth <- read.csv("ghtorrentdata/lang_growth.csv")

kable(head(lang_growth))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
language day num_commits
Ruby 2014-06-26 1166068
Ruby 2014-10-21 1022681
Python 2014-11-11 881825
Ruby 2014-10-15 823475
Python 2014-11-18 790873
Ruby 2014-12-10 787341
# I want the top 3

summary <- lang_growth %>%
            group_by(language)%>%
            summarise(total_num_commits = sum(num_commits))%>%
            arrange(-total_num_commits)%>%
            head(3)

top_langs <- lang_growth[lang_growth$language %in% summary$language,]

plt =ggplot(top_langs,aes(day,num_commits,color=language,group=language))+
  geom_line()+
  theme_bw()+
  theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
plt