Have You Ever Wondered…

We come across statistical questions all the time, even if we don’t immediately realize it. From “whoah, how did Amazon know I was looking for a collection of rubber ducks that look like cowboys?” to more applied questions like “When will this server run out of storage space?” or “When will this project be completed?” Fortunatley, we live in a data-driven society; with access to billions of data sources from all over the world, we should be able to answer anything, right? Data comes in all kinds of forms; heartrate data every second from your fitness watch, location check-ins from that pizza place you went to last week, transactions, clicks, views, posts, tags… It’s all digital data that can tell us something about the world. But making sense of it all requires statistical thinking and data-wrangling skills.

How many repositories are there on GitHub?

Consider this question. The immediate response is to go Google it; voila! You have an answer (it seems to say that there are 100 million repositiories). But how would you actually arrive at that answer? And how would you know you were correct? Similarly to a straightforward programming task, seemingly simple questions can devolve into tons of unforseen caveats, constraints, and workarounds.

Our Data (GHTorrent)

An important thing to understand early on in your programming career is about APIs and access to data. A lot of the time, you’ll get data via a .csv or .txt file and can run everything on your local machine. But when working with really large datasets, there’s just no way you can run it all on your own laptop. If you wanted to use a program to count the number of repositories on GitHub, you might consider downloading all of them and keeping a counter of each new one. Well, that’s terrabytes of data, so it’s a no-go. So for our question, we are using something called GHTorrent (Gousios 2013). GitHub itself keeps track of all the behavior happening across the site. Commits, pushes, pull requests, new repositories, issues; and you can access those things through the GitHub API when you have a query. GHTorrent holds on to all that data, through the years. So now we can get a good picture of how fast GitHub is growing over time. Below, we query the ghtorrent project on Google Big Query, where id is a project id or repository with a created_at value. Grouping by day, we see the number of projects created each day. We also have access to other information about those repositories, like commits, users, language, description and more. For now, let’s look at the simplest count.

DataFiles for this Lesson

For this lesson, we actually stored all of the queries in datafiles that you can load from the ghtorrentdata/ directory. Accessing the data required a credit card, which is unreasonable to expect from anyone just trying to use these lessons to learn. If you do want to perform different queries of your own, there are several ways to do that. I have included how you could do it using BigQuery and bigrquery (that’s the R package, see the r in there?). Any time you see query_exec() it will be commented out and you will use the data we have provided from the result of that query as of September 2019. That way, you can see how the query would be executed but also you don’t need to fiddle with APIs to follow along in this lesson.

library(bigrquery)
library(ggplot2)
#- set_service_token("../../servicetoken.json") # you will need to get your own service token from your account. Don't share this!
#- project <- "gitstats" #your project name here


#- # an sql query string
#- sql <- "select count(id) as num_projects,date(created_at) as day
#-                 FROM [ghtorrent-bq.ght.projects]
#-                 group by day"


#- # executing the query if you aren't using data provided
#- data  <- query_exec(sql, project = project, useLegacySql = FALSE)

# reading in data for the above query that we stored earlier
data <- read.csv("ghtorrentdata/data1.csv")

data <- data[data$day!="2016-05-01",] #crazy outlier? #like unreasonably so

# number of projects created each day (it's going up, but its not the running sum! don't mix them up)
plt = ggplot(data,aes(day,num_projects))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plt

data <-data[order(data$day),]
data$n <-seq.int(nrow(data)) #adding in indexing for the days, so they're numeric as opposed to dates

#what we care about is the number of projects total, over time
data$num_projects_sum <- cumsum(data$num_projects) 


TOTAL <- sum(data$num_projects) #here's our first answer

Different Strokes for Different Folks

Could it be that depending on what we investigate and take into account, our final result changes? Does a repository count if it’s not code? What if it hasn’t been committed to in over a year? Do we care about forked repositories? It depends on what we care about… Do we care about active code on GitHub? or do we care about the sheer amount of storage space being taken up on the site? If the latter is the case, do we need information about the size of the repository; maybe even the lines of code? If we care about the former, do we need to look at which files are getting committed to, and which ones aren’t being touched? Let’s keep track of some of our findings in a table.

library(knitr)
library(kableExtra)
library(dplyr)

Final_Result <- c(TOTAL)
Description <- c("simplest raw sum")
different_results <- data.frame(Final_Result,Description)
kable(different_results,caption="different results obtained from different setups of the problem") %>%
  kable_styling(bootstrap_options = "striped", full_width = T)
different results obtained from different setups of the problem
Final_Result Description
37473073 simplest raw sum

Training Data and Testing Data

You may have a heard about models being “trained”. Sometimes this conjures up images of a baby robot learning its ABCs. But training a model basically means building an equation or program off of some data, so that it can reliably generalize to new data. We try to find an equation that “fits” the data while also being flexible enough to generalize into the future. If you’ve heard of the term “trend line” or “line of best fit”, this is the concept we will start with when discussing training and test data. On our training data, we find a good fitting trend line. In order to evaluate how good that trend line is, we can compare how it performs on our test data. The benefit of reserving test data is that we can immediately evaluate our model, instead of simply waiting for it to be deployed in the world and either failing or succeeding (trust me, that’s not a good idea). So we hold on to some data where we know the true values, and compare to what our trained model would predict. Below, we split our data into a training and test set, with a 70/30 split. With some data, you would sample in order to get a wide range of test data from across the entire set. However, because our model is over time we will reserve the last 30% of true data in order to evaluate how effective our models are. You do not always want to just grab the last bit of data, as this can be biased for some problems where time is not the independent variable. (Imagine you were trying to fit a model to determine the genre of a movie, and your test set was accidentally all holiday movies because you pulled from December for your test set). In our case, we will grab the most recent 30%.

third <- length(data$n)%/%3

test <- tail(data,third)
train <- head(data,third*2)

ggplot(train,aes(n,num_projects_sum))+
  geom_point(size=.2,alpha=.2)+
  ggtitle("Training Data")+
  theme_bw()

ggplot(test,aes(n,num_projects_sum))+
  geom_point(size=.2,alpha=.2)+
  ggtitle("Testing Data")+
  theme_bw()