Someone wise once asked me:

“How do you choose a heart surgeon?”

“I don’t know, how?”

“You look for the surgeon with the highest failure rate who is still accepting patients”

“And why on earth would you want that?”

Because it means that they’re getting all the hardest cases and people still trust them to do the surgery.”

This little exchange is surprisingly relevant to how we measure developer performance. Bear with me. If I were to tell you that the best developer at my company fixed the least number of bugs, what would you think? Perhaps they are fixing the really tough ones. It could also be that they simply don’t make mistakes but that is just untrue. We all make mistakes in our code and spend a lot of time debugging (novices more than experts, but also the problems get harder so there’s a lot at play!). Imagine that at the end of the workweek, everyone has to report how many bugs they fixed. One person reports fixing 78 and another person reports fixing 2. Who is the better developer?

Maybe you have some theories, or some immediate reactions. Perhaps you have some explanations of why that’s not enough information. This is great practice for figuring out how to measure a concept like performance.

One way to measure performance is to look at “number of faults found” in software testing. Finding and fixing faults may be a reasonable way to determine if developers can identify a problem and implement solutions to help software behave as expected. Reasonable enough, right? Hopefully you’re learning that when we measure anything, we have tooperationalize it in a reasonable way, taking into account prior research and interpreting results with our nuanced operationaliztion in mind. “Performance” is not necessarily captured entirely by “number of faults found” but it is totally reasonable to start somewhere. We take a look at data provided from Iivonen’s Identifying and Characterizing Highly Performing Testers–A Case Study in Three Software Product Companies pg 52

How do developers differ in their measurable output performance?

data <- read.csv("performance_dev.csv")

#just adding in a check on the percentages, we see some off-by-ones but that's probably due to rounding
data$Totals <- data$Fixed + data$No.Fix +data$Duplicate + data$Cannot.reproduce

  kable_styling(bootstrap_options = c("striped", "hover"))
Tester Defects Extra.Hot Hot Normal Status.Open Fixed No.Fix Duplicate Cannot.reproduce Totals
A 74 4 1 95 12 62 26 12 0 100
B 73 0 56 44 15 87 6 2 5 100
C 70 0 29 71 36 71 24 0 4 99
D 51 0 27 73 33 85 6 0 9 100
E 50 2 16 82 30 89 9 0 3 101
F 18 0 22 78 22 64 14 0 21 99
G 17 18 18 65 18 71 14 0 14 99
H 17 53 18 29 6 94 0 0 6 100
I 12 8 17 75 42 100 0 0 0 100
J 2 0 0 100 0 50 50 0 0 100
K 80 21 59 20 13 90 7 0 3 100
L 55 0 29 71 27 80 15 0 5 100
M 48 13 21 67 19 97 3 0 0 100
N 48 17 38 42 8 89 7 0 5 101

Note that everything except the first column is in percentages

The Measure Matters

Here we have the developer with the least defects and the developer with the most defects. Let’s play with personas a little bit:

Jesse Jordan (Tester J) found 2 defects, each self-reported as “Normal” faults. They fixed one of them, but couldn’t fix the other. The paper points out that Jesse actually resigned during the data collection period, and this is only a partial time series. We don’t know what actually happened to Jesse, or why they resigned.

##    Tester Defects Extra.Hot Hot Normal Status.Open Fixed No.Fix Duplicate
## 10      J       2         0   0    100           0    50     50         0
##    Cannot.reproduce Totals
## 10                0    100

Kendall Kennedy (Tester K) found 80 defects. They’re still working on 13% of the faults, but for the finished ones they fixed 90%. They weren’t able to fix 7% of those, though. They labeled the faults as primarily “Hot”, with a few “Normal” and a few “Extra Hot”. 3% were just not reproducible at all.

##    Tester Defects Extra.Hot Hot Normal Status.Open Fixed No.Fix Duplicate
## 11      K      80        21  59     20          13    90      7         0
##    Cannot.reproduce Totals
## 11                3    100

Do you think we can tell who is performing better from that measure? Surely there is more to the story than just number of defects found.

These lessons are here not only to provide some statistical skills but also to get you thinking about how to measure things. Towards the end of this lesson, we will also try to devise a system for grading group work. Keep that in mind as we operationalize what it means to measure performance.

Weighted Features

Each of the columns represents a feature of interest. If you’ve heard the term “feature engineering” when reading about Big Data or Machine Learning, it’s talking about choosing which factors to pay attention to. Sometimes we get so nervous about the buzzwords like “Machine Learning” and don’t ever get the chance to break down what it all really means. Well, this problem is an introduction to what it might mean to select features and combine them to predict an outcome. Whether it’s determining what series you will watch on Netflix, or how well you are performing at your job, this kind of problem persists across all disciplines. It might occur to you that it’s simply not fair to use such a measure to determine someone’s productivity. Most of the time, you’d probably be correct. Our data sources, the people designing the measures, and the people interpreting the results are ill-equipped to account for the tons of diversity in the world. Hopefully now that you’re armed with more statistical know-how, you’d be able to better advocate for yourself against harmful models.

While it may not be ideal to use a blanket measure for someone’s productivity, it might still be useful to implement different support systems for workers in a company; if someone is struggling to feel productive and accomplished, perhaps we could catch it with some metrics and intervene to help them be the most fulfilled version of themselves.

We discussed that “number of faults found” isn’t the best measure of performance in isolation, and we talked about how you need to include other features in combination. But should every feature be weighted equally? Or do some things matter more than others? In the performance data we have, we even have the developer ratings of how serious the faults were that they found. Let’s look at how Jesse and Kendall rated everything.

min_and_max <- data[data$Defects==max(data$Defects) | data$Defects==min(data$Defects) ,] #an overly verbose way to do this

  xlab("Number of Defects Found")+
  ylab("Percentage Fixed")+
  ggtitle("Defects Found by Percentage Fixed")+

cor(data$Defects,data$Fixed) #nada
## [1] 0.1754979
#I want the distribution of Extra Hot, Hot, Normal for each tester
hot <- gather(data, Heat, Value, c("Extra.Hot","Hot","Normal"), factor_key=TRUE)

  geom_histogram(aes(y = ..density..), color="black")+
  geom_density(aes(y = ..density..),color="black",fill="black", alpha=.2,stat = 'density')+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.