Learning to Code

This is a survey of over 15,000 respondants about their experiences of learning to code. You can read more about their goals, process, and survey here. You can learn about the data here. The “Learn to Code!” buzz is not the same everywhere in the world, but it is clear that there is a need for these skills across almost every domain. But who is trying to learn? And who is failing? How might we intervene? You are a learner too, and it will be fun to compare your own answers to the survey as we go along. Today’s lesson will be a warmup in R, Statistics, and handling data.

Topics Covered:

Load Your Libraries

A library is a collection of other programs and functions written to support a specific task. Basically, it saves you the time of writing these things by hand, because someone has already done it. We load these libraries so we can use their features in our code.

library(ggplot2)
library(dplyr)
library(kableExtra)

Structure of the Data

Here we read in a .csv file using the read.csv function, and take a look at the dimensions of the dataframe. If you’ve used spreadsheets before, dataframes are similar to that except represented in the R code in a way that allows us to more freely manipulate that data and perform coding actions on it. In this little exercise we also explore the is.na() function, which tells us about missing data, which is often represented as NA.

data <- read.csv("../data/newcoders.csv")
#head(data)

# how many people are in this dataset?
nrow(data)
## [1] 15620
# how many columns/questions?
ncol(data)
## [1] 113
#this dataset is almost 50% NAs. It's not a problem, but something to be aware of
sum(!is.na(data))/sum(is.na(data))
## [1] 0.4905222

Useful Things To Do With Variables

We can explore properties of our data using typeof, summary, and str. A data type indicates a kind of data item that constrains what actions can be performed on it. So for instance, we might have integers, floats (not whole numbers; numbers with decimal points), character types (encoded as alphabetic or numeric characters and treated like words), or factors (categories). These are all examples of data types. We might be able to treat Age as a factor (category), but that would mean that we could not get the average age, because you cannot take the mean of a category.

#get the data type
typeof(data$Age)
## [1] "integer"
#get some descriptive statistics on the column
summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.00   23.00   27.00   29.18   33.00   86.00    2007
#get the structure: datatype, range, and head
str(data$Age)
##  int [1:15620] 28 22 19 26 20 34 23 35 33 33 ...

Measures of Centrality

We use the idea of centrality to say something about the population. Sometimes the middle is meaningful. If most values occur around a certain point, maybe we can use that information to infer things about our sample. However, it’s important to keep in mind that sometimes those measures can give different results. Here we explore mean, median, and mode. While these are usually where we begin with statistics, be aware that taking the average of a group is not always the right way to go, especially when the distribution has more than one peak.


#R doesn't have a built-in mode, but this means find the highest count for any of the values and return that value
getmode <- function(v) {
   uniqv <- unique(na.omit(v))
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

#mean, median, and mode

mean_age <-mean(data$Age,na.rm=TRUE)
med_age <- median(data$Age,na.rm=TRUE)
mode_age <-getmode(data$Age)

ggplot(data,aes(Age))+
  geom_histogram()+
  geom_vline(xintercept = mean_age,linetype="dashed",color="red")+
  geom_vline(xintercept = med_age,linetype="dashed",color="blue")+
  geom_vline(xintercept = mode_age,linetype="dashed",color="green")+
  theme_bw()

Factors and Outcomes We Care About

When developing a research question, we often start with a vague idea of what you’d like to know about.

“I want to know about people who learn to code in coding bootcamps”

The next steps involve questioning every inch of this idea until you have a fully formed, answerable research investigation. What do you want to know? Any bootcamps? What people? What are they learning? Where? Do you want to know about their experience during or after? There is so much to know, and this process can quickly become overwhelming. I remember during the first quarter of my PhD being so frustrated every time someone would poke at my barely formed research question, saying “yeah but what is learning exactly?” This lesson will put into practice some of this “narrowing down” necessary to make progress. We might have a large dataset, with 15000 respondents and over 100 questions, but we will systematically choose our focus to make sense of a research question.By the end of the lesson, you may want to investigate something else in this dataset, and you’ll have the tools to do that.

Outcomes

Let’s begin with identifying the outcomes we want to know about. These may also be referred to as response variables or dependent variables. They may also simply be represented by the variable y. So, we have a dataset of people who went through coding bootcamps. Maybe we want to know about their success in the bootcamp (whether they finished or not), their job prospects, and their income. We are also interested in their Employment field, which was answered categorically.

Here we can see the summaries of the relevant outcome variables. Note that both BootcampPostSalary and Income (in the past year) were recorded. I’m assuming that these are different because maybe some people got salary boosts for participating in bootcamps, were offerred short contracting jobs, or other factors that made it so that it does not match up with the current Incomes at the time of the survey. More people answered about their Income, so we will use that measure primarily.

summary(data$BootcampFinish)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   0.689   1.000   1.000   14687
summary(data$BootcampFullJobAfter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   0.584   1.000   1.000   14985
summary(data$BootcampPostSalary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   50000   60000   63741   77000  200000   15290
summary(data$Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   37000   44930   60000  200000    8291
kable(summary(data$EmploymentField))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
x
architecture or physical engineering 150
arts, entertainment, sports, or media 416
construction and extraction 118
education 610
farming, fishing, and forestry 19
finance 274
food and beverage 279
health care 264
law enforcement and fire and rescue 29
legal 68
office and administrative support 414
sales 335
software development 134
software development and IT 4349
transportation 149
NA’s 8012

Factors

The factors are variables that may be affecting the outcomes we care about. These may also be referred to as independent variables or represented by the variable x (sometimes with numbered subscripts when there are multiple independent variables involved). Often these are demographics variables or things like time. Here, I select a few interesting factors: Age, Gender, Money invested in the bootcamps, Education (including school major), Software development experience, and time spent practicing.

summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.00   23.00   27.00   29.18   33.00   86.00    2007
summary(data$Gender)
##     agender      female genderqueer        male       trans        NA's 
##          38        2840          66       10766          36        1874
summary(data$MoneyForLearning)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0      30    1108     300  180000     941
summary(data$MonthsProgramming)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.00    9.00   24.43   24.00  720.00     606
summary(data$HoursLearning)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   15.32   20.00  100.00     678
kable(summary(data$SchoolDegree))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
x
associate’s degree 649
bachelor’s degree 5644
high school diploma or equivalent (GED) 1356
master’s degree (non-professional) 1445
no high school (secondary school) 258
Ph.D. 160
professional degree (MBA, MD, JD, etc.) 692
some college credit, no degree 2268
some high school 764
trade, technical, or vocational training 443
NA’s 1941
majors <- data[!is.na(data$SchoolMajor),] %>%
  count(SchoolMajor)%>%
  arrange(desc(n)) 
kable(head(majors,10))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
SchoolMajor n
Computer Science 1387
Information Technology 408
Business Administration 284
Economics 252
Electrical Engineering 220
English 204
Psychology 187
Electrical and Electronics Engineering 164
Software Engineering 159
Liberal Arts 157

Descriptive vs. Inferential Statistics

A good place to start is working with the difference between descriptive and inferential statistics. Descriptive statistics are summmarizing and describing the data, without making conclusions about the larger population. This includes things like measures of centrality, ranges, counts, spreads and more. Inferential statistics](../glossary.html#inferentialstatistics) uses the sample data to make conclusions about hypotheses regarding the general population that the sample may have come from. This includes hypothesis testing and modeling. Inferential statistics is also where you would see things like effect size and significance (or p) values. The way that I think about hypothesis testing is that we are looking for evidence for whether or not there is a true difference between groups. A true difference would mean a difference in the means or distributions that could reliably generalize if we got more people in the sample. We can never prove anything with statistics, but we can collect evidence. And that is powerful.

Descriptive

We started looking at descriptive statistics when we used the summary() command in R. Another tool for descriptive statistics is data visualization and histograms. Below, we have a histogram for one of our outcome variables, Income

ggplot(data,aes(Income))+
  geom_histogram(bins=40)+
  theme_bw()

Let’s refer to one of our factors and take a look at a visualization of Income. I’ve selected SchoolDegree to see if the type of degree you have interacts with income in any way. What do you think?

levels(data$SchoolDegree)
##  [1] "associate's degree"                      
##  [2] "bachelor's degree"                       
##  [3] "high school diploma or equivalent (GED)" 
##  [4] "master's degree (non-professional)"      
##  [5] "no high school (secondary school)"       
##  [6] "Ph.D."                                   
##  [7] "professional degree (MBA, MD, JD, etc.)" 
##  [8] "some college credit, no degree"          
##  [9] "some high school"                        
## [10] "trade, technical, or vocational training"
ggplot(data[!is.na(data$SchoolDegree),],aes(Income,fill=SchoolDegree))+
  geom_histogram(position="dodge")+
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6368 rows containing non-finite values (stat_bin).

The above graph is a bit difficult to interpret. Instead, let’s look at a more isolated comparison. Let’s isolate the groups we are curious about using dplyr and the filter() command in order to get a clearer visualization of those groups.

ba_and_ma <- data %>%
                  filter(SchoolDegree %in% c("bachelor's degree", "master's degree (non-professional)","trade, technical, or vocational training"))



ggplot(ba_and_ma,aes(Income,fill=SchoolDegree))+
  geom_histogram(position="dodge")+
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3002 rows containing non-finite values (stat_bin).