In my experience, the world of software is an opinionated one. It seems like everyone has special tools, techniques, practices, and packages that they swear by. In this lesson, we will investigate one of these classic claims: Which is better? Test-first or test-last development? We will also explore what it looks like when people misattribute benefits to a specific practice, when in reality the cause is entirely different. The study we are using is: A Dissection of the Test-Driven Development Process: Does It Really Matter to Test-First or to Test-Last?(Fucci et al. 2016) Below is an excellent quote from the paper, highlighting how statistics can separate the relevant findings from other distracting factors:

“This information can liberate developers and organizations who are interested in adopting TDD, and trainers who teach it, from process dogma based on pure intuition, allowing them to focus on aspects that matter most in terms of bottom line.”

“Design Thinking” model emphasizes the iterative dynamic of creating something useful

“Design Thinking” model emphasizes the iterative dynamic of creating something useful

Test-Driven Development Process

The Study

Modeling Recipe

Here is a little recipe for modeling:

  1. Define the outcome variable(s) that you care about

  2. Define the factors that might affect (1); be willing to iterate!

  3. Determine which kind of model can explain the data we see: lm,glm,chisq,anova etc.

  4. Keep it as simple as possible, and try to include as few features and interactions as you can. simpler is better, without underfitting

  5. Determine an appropriate metric to evaluate the model: rmse, R^2, AIC, BIC, etc.

  6. Compare models to find the model with a nice balance between simplicity and goodness of fit.

  7. Continue to ask yourself; does this model actually make sense in the world? Even if it mathematically fits really well, is it plausible?

  8. If so, then you have a very reasonable model! If you can, test it on new data, or perform an experiment that confirms your hypothesis.

  9. Without experimentation, you cannot know anything about causality. Instead, your model provides a representation of the phenomenon and the factors involved.

Outcomes We Care About

Which subset of the factors best explain the variability in external quality?

So basically this question is asking “what combo of variables contributes to making stuff good?”

With any measure, we have to operationalize it in some way. “Code Quality” is defined as: , with QLTYi defined as:

Don’t panic. I’ve actually wanted to do a research study where I use eyetracking technology to watch how learners react to equations in an academic paper. If you’re anything like me, your eyes glaze right over it and then you curse yourself for not immediately understanding it by just absorbing it into your mind without reading it. When you dig into these equations, you see that it’s a fancy way of counting stuff. TUS is the number of “Tackled User Stories”, or “how many problems attempted”. So, for each of those stories, count up the proportion of passing assert statements, and take the average over all the stories attempted. This way, we are simply measuring functional correctedness, with no accounting for things like style or readability.

Which subset of the factors best explain the variability in developer productivity?

And this question is asking “what combo of variables contributes to developers producing more?” (though you may remember from Analyze This! that this may be unwise to measure!)

Productivity is defined as: with OUTPUT as the total passing assert statements. TIME measures from the time the task is opened until closed.

Factors We Measure

The following is taken directly from Table 2.

data <- read.csv("../data/dissectionTDD/dataset.csv",sep=";")
head(data)
##   ID  QLTY PROD   GRA  UNI   SEQ   REF
## 1  1 77.27 0.07  0.87 0.29  0.00 45.45
## 2  2 58.71 0.21 31.07 3.26 50.00 50.00
## 3  3 85.42 0.32  1.66 1.78 28.12  6.25
## 4  4 84.52 0.34  1.05 0.91  8.62 50.00
## 5  5 45.91 0.17  7.30 6.40 20.00 26.66
## 6  6 77.95 0.37  8.02 8.36  0.00 44.44

Descriptive Statistics

There is a difference between descriptive statistics and inferential statistics. Descriptive statistics describe properties of the data; such as means, ranges, and normality of the variables of interest. Inferential statistics will draw the actual conclusions about the data; reporting on correlations, hypothesis tests, and [**estimation of parameters]**](../glossary.html#parameter). Inferential statistics helps to generalize about a larger population, that can go beyond the descriptive statistics of an immediate sample.

library(ggplot2)

#small function to generate colors for ggplot
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}

titles <- c("Quality","Productivity","Granularity","Uniformity","Sequencing","Refactoring")
#get some colors for each
cols = gg_color_hue(length(titles))

# loop to create 6 density plots to look at spread for each variable
loop <- 2:7
for( i in loop){
  x <- data[[i]]
  plt <- ggplot(data,aes(x)) + 
    ggtitle(paste("Histogram and Density for",titles[i-1]))+
  geom_histogram(aes(y = ..density..), bins=25,color="black",fill=cols[i-1])+
  geom_density(aes(y = ..density..),color="black",fill="black", alpha=.2,stat = 'density')+
    xlab(titles[i-1])+
  theme_bw()
  
  print(plt)
  print(shapiro.test(x))
}

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.97198, p-value = 0.0695

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.80645, p-value = 5.326e-09