We have talked a bit about how it’s important to properly **operationalize** what we **measure**. Not to add more to your plate, but even if we get the right *measurement*, and do the correct *statistics*, when you’re studying people, the *reliability of measurement* also matters. This lesson will highlight both a software engineering question (predicting when a project will be done), and how to think of reliability of what you measure. After walking through the *Need for Sleep* lesson, you’re surely realizing how careful we have to be with our experimental setup. You can imagine that the best way to know if something is “true” is to test a concept over and over again, under lots of different circumstances, and see if your results hold true. Imagine testing participants on multiple occasions, trying to not only test your **hypothesis**, but seeing how steady that hypothesis holds over time. All of that is ideal, of course. Most of the time, there is limited resources, limited participants, limited time, and a rush towards a deadline. But convenience should never get in the way of good science. So let’s take a look at how unreliable measurements might be.

In the following study, *Inconsistency of expert judgment-based estimates of software development effort*, expert participants rated how much effort (work-hours) they estimate for sixty software projects. They unknowingly rated six of those projects *twice*, helping to answer questions about the reliability of professional judgments about software effort estimation. The paper tries to answer the following research question:

How consistent are software professionals’ expert judgment-based effort estimates?

This lesson is going to explore how we estimate how long software will take to finish (or your homework, for that matter). We will explore several statistical and scientific concepts:

- reliability of measurement
- generalization of results
- features that could matter
- data wrangling
- accuracy

```
source("../data/ESEUR_config.r")
library("plyr")
library("dplyr")
library("reshape2")
library("ggplot2")
```

Whatever your problem, you will come across the need to wrangle your data into different shapes. At first, it can seem either pointless or too complicated to be worth it. I remember first hearing the terms “melt” or “longform data” and wondering if I could just avoid ever doing that. Turns out, everyone was right and it’s way better to wrap your head around a good `melt`

command and move on with your life. Let’s get a quick briefing on **long form**,**wide form**, **reshaping**, and **melting** data.

`incon=read.csv(paste0(ESEUR_dir, "../data/estimation/inconsistency_est.csv.xz"), as.is=TRUE)`

This data format “stretches” the results by each Task across several columns. Each task has its own column. I’ve even demonstrated the *even wider* version of the data, where the **Order** variable is also “stretched” across the columns, instead of being in long form. What we are left with is simply the Subject ID number and then a separate column for each subject’s ratings for each Task in each Order.

`head(incon)`

```
## Order Subject T1 T2 T3 T4 T5 T6
## 1 1 1 32 8.0 32 4.0 16 6
## 2 2 1 30 8.0 28 4.0 40 10
## 3 1 2 6 6.0 7 1.0 10 4
## 4 2 2 13 2.5 11 2.5 15 3
## 5 1 3 5 5.0 4 2.0 6 1
## 6 2 3 5 2.0 5 2.0 16 3
```

```
wider <-reshape(incon, direction="wide", idvar=c("Subject"), timevar="Order")
head(wider)
```

```
## Subject T1.1 T2.1 T3.1 T4.1 T5.1 T6.1 T1.2 T2.2 T3.2 T4.2 T5.2 T6.2
## 1 1 32.0 8 32 4 16 6.0 30 8.0 28 4.0 40.0 10.0
## 3 2 6.0 6 7 1 10 4.0 13 2.5 11 2.5 15.0 3.0
## 5 3 5.0 5 4 2 6 1.0 5 2.0 5 2.0 16.0 3.0
## 7 4 7.5 5 7 2 7 1.5 7 6.0 4 4.0 5.5 1.5
## 9 5 8.0 4 16 3 80 1.0 25 6.0 8 2.0 30.0 1.0
## 11 6 7.0 6 7 2 40 3.0 20 8.0 10 1.0 40.0 1.0
```

Personally, I never understood the “melt” terminology but basically we will take the wide form data and transform it to long form. I guess it kind of goes from being stretched to dripping down and melting together? Whatever helps you to think about it, here’s an example of “melting” down `tasks`

.

```
tasks=melt(incon, measure.vars=paste0("T", 1:6),
variable.name="Task", value.name="Estimate")
```

Here you can see that tasks has been melted down to now represent each Task (T1, T2, T3..) as a factor of `Task`

in one column. You can also see a difference with `Subject`

, as it has to be recorded *twice* per column, in order to demonstrate the *repeated measure* for each task.

`head(tasks)`

```
## Order Subject Task Estimate
## 1 1 1 T1 32
## 2 2 1 T1 30
## 3 1 2 T1 6
## 4 2 2 T1 13
## 5 1 3 T1 5
## 6 2 3 T1 5
```

Yet another reshape will allow us to manipulate the data by whichever variable we choose. In this case, I would like to plot the *First Estimate* by the *Second Estimate*. To this day, it still helps me to draw out the exact kind of dataframe I need in order to make the comparisons or plots I have visualized in my mind. I realized that I had **Order** recorded, but those two groups of **Estimate**s were not easily comparable. I needed to manipulate my data further to have a mix of long and wide form to be the most amenable for `ggplot`

.

```
# reshape data
tasks <-reshape(tasks, direction="wide", idvar=c("Subject", "Task"), timevar="Order")
head(tasks)
```

```
## Subject Task Estimate.1 Estimate.2
## 1 1 T1 32.0 30
## 3 2 T1 6.0 13
## 5 3 T1 5.0 5
## 7 4 T1 7.5 7
## 9 5 T1 8.0 25
## 11 6 T1 7.0 20
```

I’m truly not exaggerating when I once tried to avoid any and all reshaping/reformatting of my data. It was a difficult concept to wrap my head around, and I hoped I could just format my data in one way and stick to it. Turns out that it’s much easier to commit to learning these tools and apply them to make your life easier when answering statistical questions. I’m also not exaggerating when I say I’ve come across almost every configuration of data you can imagine; *long form, wide form, half-melted-who-knows-what form*. If I can swallow my pride and learn a `melt`

command, I believe you can too.

Here we include a **reference line**. This is the *y==x* line, and it is describing a hypothetical scenario where each participant perfectly reliably rates each software task. On the x axis we have their first estimate for each task, and on the y axis we see the second estimate. If they were perfectly reliable, the *y==x* line would represent the data. If they are not perfectly reliable, the points will deviate from the *y==x* line. The following plot also uses **log scale**, a common technique to visualize data with a spread that would otherwise be difficult to interpret. Try it without the log transformation and see which plot is more interpretable…

```
plt = ggplot(tasks,aes(Estimate.1,Estimate.2,color=Task))+
geom_point()+
theme_bw()+
ggtitle("Repeated estimates of effort for six software tasks")+
xlab("First Estimate")+
ylab("Second Estimate")+
geom_abline(slope=1,linetype="dashed")+
scale_x_continuous(trans='log10') +
scale_y_continuous(trans='log10')
plt
```

`#ggsave("reliability.png",plt)`

`shapiro.test(tasks$Estimate.1)`

```
##
## Shapiro-Wilk normality test
##
## data: tasks$Estimate.1
## W = 0.55771, p-value = 4.725e-10
```

`shapiro.test(tasks$Estimate.2)`

```
##
## Shapiro-Wilk normality test
##
## data: tasks$Estimate.2
## W = 0.77446, p-value = 1.32e-06
```

```
plt = ggplot(tasks,aes(Estimate.1,Estimate.2,color=Task))+
geom_point()+
theme_bw()+
ggtitle("Repeated estimates of effort for six software tasks")+
xlab("First Estimate")+
ylab("Second Estimate")+
geom_abline(slope=1,linetype="dashed")+
scale_x_continuous(trans='log10') +
scale_y_continuous(trans='log10')
plt
```

We have explored the idea of looking at the strength of a linear relationship through (../glossary.html#correlation). It might seem intuitive that correlation between the **First Estimate** and **Second Estimate** could tell us something about rater accuracy. Let’s look into this correlation. First of all, we test the normality of our responses. Neither of the distributions are normally distributed, meaning that we need to employ a rank-based (nonparametric) method when looking at correlation. Here we use a method referring to **Kendall’s Tau**. Our result is `.69`

which indicates a moderately strong correlation between the first and second estimate. So, they must be pretty close then, right? If they’re strongly correlated, it’s gotta be that they’re related enough to be accurate.

`shapiro.test(tasks$Estimate.1)`

```
##
## Shapiro-Wilk normality test
##
## data: tasks$Estimate.1
## W = 0.55771, p-value = 4.725e-10
```

`shapiro.test(tasks$Estimate.2)`

```
##
## Shapiro-Wilk normality test
##
## data: tasks$Estimate.2
## W = 0.77446, p-value = 1.32e-06
```

```
# Correlation is a misleading method of comparing accuracy
cor.test(tasks$Estimate.1, tasks$Estimate.2,method="kendall")
```

```
## Warning in cor.test.default(tasks$Estimate.1, tasks$Estimate.2, method =
## "kendall"): Cannot compute exact p-value with ties
```

```
##
## Kendall's rank correlation tau
##
## data: tasks$Estimate.1 and tasks$Estimate.2
## z = 6.1993, p-value = 5.671e-10
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.69277
```

```
# Percentage difference?
# This is one of those awkward cases... TODO: switch brain
# library("boot")
```