Before this lesson:

What is “Evidence”?

You’re about to learn a term that hopefully won’t haunt you like it haunted me: epistemology. I recently spent an entire year using the term to the point of irony, after having it repeated over and over in my first-year PhD courses. Epistemology is the study of knowing. I won’t go too far into this, as it’s literally PhDs worth of studying, but it’s “how we know what we think we know”.

the theory of knowledge, especially with regard to its methods, validity, and scope. Epistemology is the investigation of what distinguishes justified belief from opinion

Miraculously, we can go through years of school and never question how we know anything. We learn from history books, practice math formulas, create artwork, and maybe conduct some neat experiments. All of that knowledge had to come from somewhere, and some of it is wrong.

Part of being a digital citizen, software engineer, and scholar is learning how to defend what you think you know. Whether that’s in political arguments or convincing someone why pickles are a polarized issue, you will be caught over and over again in some form of argumentation. In software engineering, there are opinions abound. On the best language, best practice, hottest new package, best workflow, etc.

Through these lessons, you will develop your own ability to question how we know what we know; walking through how we can use data and statistics to make conclusions about the software industry. Some people believe that statistics are the ground truth, while others believe that numbers could never capture the nuance of a problem. Both of these views are dangerous. We are about to embark on a journey into Statistical Wonderland, where some things are nonsense, some things are useful, some things are wrong, and some things are awesome. The goal of this curriculum is to give you a toolbox to find your own answers; to learn to read and dissect academic findings and to apply the useful results to your own practice of software engineering.

Surveying a Population

Sometimes we don’t know anything about anything. We all have to start somewhere. The paper “Analyze This!” approaches an unexplored research area with a certain elegance. They asked 1500 Microsoft engineers the following:

Please list up to five questions you would like a team of data scientists who specialize in studying how software is developed to answer

They then asked a new sample of 2500 Microsoft engineers to prioritize the questions.

So yeah, they just asked. If you want to know what the most important things that Data Scientists should work on, just ask. This is an incredible, high-powered starting point for distilling how we should investigate what is most important to the stakeholders involved. Let’s take a look at how this sample prioritized things, while also getting a lesson in R.

Loading Libraries

Libraries may also be referred to as packages. They are collections of functions that someone else has written that are not part of the base programming language, but have functions for a specific purpose. So for instance, R has a plot function, but we use ggplot2 because it has better graphics and more customization. It’s like an “add-on” or “expansion pack” to a programming language.


Reading in the Data

Here we have an Excel spreadsheet, but many data files will be Comma Separated Values (.csv). We are using the readxl library to convert the spreadsheet into a dataframe that R can work with. Here I have printed out the head and tail of this dataframe, but you should run the View(data) command in R to see the entire thing. You can run that command from the console after highlighting it in your R code.

data <- read_excel("../data/145Questions.xlsx",skip=3) #skip the first three rows because they have some copywrite stuff from Microsoft that reads in a little weird

Renaming Columns

The columns had names that weren’t as conducive to the code we will write, so here’s how to rename columns in a dataframe. You do need to include all of the names using this technique.

colnames(data) <- c("QuestionID","Category","Question","Essential",  "Worthwhile"  ,     "Unimportant" , "Unwise","Don't Know" , "Distribution",  "EssentialPercent" ,  "WorthwhilePercent" ,"UnwisePercent" ,     "EssentialRank"  , "WorthwhileRank", "UnwiseRank")

kable(head(data[1:8])) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
QuestionID Category Question Essential Worthwhile Unimportant Unwise Don’t Know
1 Bug measurement What is the impact and/or cost of findings bugs at a certain stage in the development cycle? 50 52 11 3 1
2 Bug measurement What kinds of mistakes do developers make in their software? Which ones are the most common? 48 65 2 0 1
3 Bug measurement In what places in their software code do developers make the most mistakes? 41 69 7 0 0
4 Bug measurement What kinds of mistakes are caught by static analysis? 24 59 24 2 4
5 Bug measurement How many new bugs are introduced for every bug that is fixed? 30 66 13 6 1
6 Bug measurement Is the number of bugs a good measure of developer effectiveness? 19 44 33 20 0
kable(tail(data[1:8])) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
QuestionID Category Question Essential Worthwhile Unimportant Unwise Don’t Know
140 Testing practices How do we measure test coverage with unit tests? 21 52 38 1 0
141 Testing practices How can we create and run unit tests whose code and test inputs can be shared across teams? 27 61 22 2 0
142 Testing practices Should we do Test-Driven Development? 32 48 28 2 2
143 Testing practices What are benefits of Test-Driven Development for Microsoft? 22 61 21 2 4
144 Testing practices How should we do Test-Driven Development while prototyping? 26 43 34 4 2
145 Testing practices When do I maintain or update a test vs. remove it? 21 56 31 3 1

Descriptive Statistics

Whenever we have data, we first report on descriptive statistics, which are things like the average values, the ranges of those values (highest and lowest), and other facts about the data that was collected. Descriptive statistics are contrasted with inferential statistics, which use statistical tests to draw conclusions about differences between groups, or fits of models, or other things that we can use to make sense of various phenomena. Let’s stick with descriptive statistics for now.

Note that the data we have access to is already in aggregate form (we don’t have access to each of the 2500 Microsoft engineer ratings, due to privacy). I will demonstrate how to get aggregate statistics across the categories with dplyr. This would work even if we hadn’t already aggregated the data:

aggregate <- data %>% # this symbol is referred to as a "pipe". it "pipes" the data into the other function calls
              group_by(Category) %>%
              summarise(AverageEssential = mean(Essential))

kable(aggregate) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Category AverageEssential
Bug measurement 36.28571
Customers and requirements 45.88889
Development Best Practices 24.88889
Development Practices 31.28571
Evaluating quality 33.93750
Productivity 27.92308
Reuse and Shared Components 42.66667
Services 36.62500
Software development process 30.78571
Software lifecycle; Time allocation 35.00000
Teams and collaboration 34.63636
Testing practices 26.35000

First Visualization

We are going to make some plots. You may or may not have learned about box-and-whisker plots before, but if you haven’t, there are several pieces to these plots that help us visualize descriptive statistics. The bar in the middle of the box is the mean, and any points outside of the box-and-whisker plot are outliers, meaning they are significantly above or below the inter-quartile range. We can’t actually know too much from these plots, as they simply show us, across all categories, how willing the participants were to assign a certain label to a question. We can observe that people seemed more willing to label something as “Worthwhile” than they did “Unwise” or “Essential”. This actually makes sense, because it’s easier to label something with a less-extreme judgment. “Unwise” is seriously suggesting that something should not be done, and “Essential” is making a serious judgment call on the value of something. “Worthwhile” is more relaxed, and it seems that more people were willing to use that label instead of strongly committing on most of the topics. It’s lucky for us that everything wasn’t labeled “Essential” or we wouldn’t even have a better idea of where to start researching.

Note that I’m using something called ggplot and cowplot to make these graphs. ggplot is a cornerstone of data visualization, whereas cowplot simply allows me to line them up horizontally. I’ve used theme(axis.text.x=element_blank() to remove the x-axis labels. I use theme_bw() which stands for “Theme Black and White” because it looks better to me. You should play around with your own preferences.

essential <- ggplot(data,aes(y=Essential,))+

worthwhile <- ggplot(data,aes(y=Worthwhile))+

unimportant <- ggplot(data,aes(y=Unimportant))+

unwise <- ggplot(data,aes(y=Unwise))+

#lining up all the boxplots horizontally using cowplot
cowplot::plot_grid(essential, worthwhile, unimportant,unwise ,
                   ncol = 4, rel_heights = c(1, 1),
                   align = 'h', axis = 'lr')

Distribution Across Questions

# How much was labeled Essential, Worthwhile, Unimportant, Unwise across the different categories?

# FIXME: IDK what the hell I'm trying to show here
  xlab("Distributed Priority Percentage Across Questions")+

Most Essential Question