So far, we have been using frequentist statistics; relying on hypothesis testing and p-values. Frequentist statistics were particularly popular when computational resources were scarce; but this is no longer the case. It is time to reexamine the benefits of applying Bayesian statistics to software engineering empirical work. If you’ve been lucky enough to never have been caught in the crossfire between a Bayesian and a Frequentist, then hopefully this lesson will go smoothly for you. But if you’ve firmly planted your flag on either side of the “debate”, take a deep breath. This lesson merely demonstrates some of the benefits of using a Bayesian approach, while also pointing out that sometimes both approaches tend to result in functionally the same courses of action. Hopefully by the end, you’ll feel much more prepared to chime in if a raging Frequentist and an exasperated Bayesian walk into a bar.

~

Bayesian Statistics in Software Engineering: Practical Guide and Case Studies

Researcher-Centered Design of Statistics: Why Bayesian Statistics Better Fit the Culture and Incentives of HCI

Frequentist Statistics

Let’s try to concisely summarize what we have been doing so far, as we obtain p-values and make conclusions about hypotheses. Using frequentist methods, we have been comparing groups in our data to determine the probability that those groups were drawn from the same underlying data-driven process. In basic terms, we have been figuring out if the groups are truly different or not. Because we live in a messy world, even samples from the exact same source will differ a little bit. Imagine taking a sample of algae from a pond, or collecting everyone’s feelings on a random Monday. If you were to do the same sample again, under the same circumstances, you’d still get slightly different results. The point of hypothesis testing is to make claims in general about the samples. A small p-value (less than .05) indicates that there is a low probability that the two groups are actually the same, whereas a larger p-value indicates a higher probability that the two groups are actually the same (no difference). We have seen this in several of our lessons so far, and you will come across these methods in most empirical software engineering papers.

Bayesian Statistics

Original Study

they use a U test

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
data <- read.csv("../data/agile/survey.csv")
head(data)
##         type group outcome stakeholders       time
## 1 Structured     0       9            9 Not at all
## 2      Agile     0       9            9   A little
## 3      Agile     0       7            8 Not at all
## 4      Agile     0      10           10 Not at all
## 5 Structured     0      10           10   A little
## 6 Structured     0       7            8   A little
plt <- ggplot(data[data$type=="Structured",],aes(outcome)) + 
  geom_histogram(aes(y = ..density..), bins=25,color="black")+
  geom_density(aes(y = ..density..),color="black",fill="black", alpha=.2,stat = 'density')+
  theme_bw()
  
plt

shapiro.test(data$outcome[data$type=="Structured"])
## 
##  Shapiro-Wilk normality test
## 
## data:  data$outcome[data$type == "Structured"]
## W = 0.86996, p-value = 0.01777
plt <- ggplot(data[data$type=="Agile",],aes(outcome)) + 
  geom_histogram(aes(y = ..density..), bins=25,color="black")+
  geom_density(aes(y = ..density..),color="black",fill="black", alpha=.2,stat = 'density')+
  theme_bw()
  
plt

shapiro.test(data$outcome[data$type=="Agile"])
## 
##  Shapiro-Wilk normality test
## 
## data:  data$outcome[data$type == "Agile"]
## W = 0.86798, p-value = 0.001824
#HERE IS THE U TEST THEY DID
wilcox.test(data$outcome[data$type=="Agile"],data$outcome[data$type=="Structured"])
## Warning in wilcox.test.default(data$outcome[data$type == "Agile"],
## data$outcome[data$type == : cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$outcome[data$type == "Agile"] and data$outcome[data$type == "Structured"]
## W = 236, p-value = 0.5792
## alternative hypothesis: true location shift is not equal to 0
# larger scale study they ran, we will do the bayesian analysis here
itp <- read.csv("../data/agile/itproj.csv")
head(itp)
##   type group percent    success  challenge    failure           time
## 1    H     0   1-10%       None     81-90%     11-20%        Neutral
## 2    H     0   1-10% Don't Know Don't Know Don't Know Not Applicable
## 3    H     0  31-40%     11-20%     81-90%       None    Ineffective
## 4    H     0  81-90%     71-80%     71-80%     21-30% Not Applicable
## 5    H     0  61-70% Don't Know Don't Know Don't Know        Neutral
## 6    H     0   1-10%      1-10%     81-90%      1-10%      Effective
##              ROI   stakeholders          quality
## 1        Neutral        Neutral        Effective
## 2 Not Applicable Not Applicable   Not Applicable
## 3        Neutral        Neutral Very Ineffective
## 4 Not Applicable Not Applicable   Not Applicable
## 5        Neutral Very Effective   Very Effective
## 6        Neutral    Ineffective Very Ineffective

Mathematical Peacocking, stop it!