# Hypothesis Testing

## Checking hypotheses via simulation

Terms defined: *p* hacking, *p* value, YAML, alternative hypothesis, automatic variable, character encoding, convergence, de-anonymization, magic number, minification, null hypothesis, one-sided test, pattern rule, personal identifying information, phony target, pre-registration, random number generator, seed, two-sided test

• Problem: how to estimate development effort or likely fault rates
• Intuitively seems like this should depend on some measurable properties of the software
• Lines of code (LoC) is easy to count but seems like a pretty coarse measure
• How can we count other things?
• How can we tell whether a hypothesis is true or not?
• Steps are:
• Collect data
• Formulate testable hypotheses
• Calculate likelihood of getting the answer we observe
• Decide whether that answer is so unlikely that we probably didn’t get it by chance

## How can we get data on file sizes?

• Start by asking whether Python and JavaScript files are different, i.e., how our measurements depend on source language
• Parsing arguments and writing CSV is normal by now
• Use os.walk to recurse through directories
• After the first failure because of bad character encoding, add error handling
• Then capture stderr to a log file for later inspection
• Then realize that storing (filename, length, count):
1. Is too big for GitHub unless we compress it
2. Contains a lot of redundancy
• Instead of compressing the file, create two files:
• One stores (filename, unique ID)
• The other stores (unique ID, length, count)
• So the largest field (filename) only appears once
• And gives us a natural way to record files we couldn’t read because of character encoding issues (NA for file ID)
• This technique is also useful when storing personal identifying information (PID)
• Shareable data is keyed
• Personal data is not shared
• Beware of de-anonymization
• Main driver is straightforward
def main():
'''
Main driver.
'''
args = parse_args()
counts = lengths(args)
report_filenames(args.output, counts)
report_counts(args.output, counts)

• Parse arguments
• Root directory
• Filename extension
• Stem of output files (because there are two)
• E.g., processing data/python will produce data/python-filenames.csv and data/python-counts.csv
def parse_args():
'''
Handle command-line arguments.
'''
parser = argparse.ArgumentParser()
parser.add_argument('--output', type=str, help='stem of output files')
return parser.parse_args()

• Find files and record lengths
• A dictionary mapping filenames to Counter objects
• Or to None if the file couldn’t be read
def lengths(args):
'''
Find files and count line lengths.
'''
counts = {}
for (curr_dir, sub_dirs, files) in os.walk(args.root):
for filename in [x for x in files if x.endswith(args.ext)]:
path = os.path.join(curr_dir, filename)
try:
counts[path] = Counter()
counts[path][len(x)] += 1
except Exception as e:
file=sys.stderr)
counts[path] = None
return counts

• Write the filenames or NA
• Uses X if test else Y, which is frequently confusing
def report_filenames(output_stem, counts):
'''
Report filename-to-ID as CSV with 'NA' for errored files.
'''
with open(f'{output_stem}-filenames.csv', 'w') as writer:
writer = csv.writer(writer, lineterminator='\n')
writer.writerow(['Filename', 'FileID'])
for (file_id, filename) in enumerate(sorted(counts.keys())):
writer.writerow([filename, file_id if counts[filename] else 'NA'])

• And the counts where available
def report_counts(output_stem, counts):
'''
Report file ID-to-count as CSV for non-errored files.
'''
with open(f'{output_stem}-counts.csv', 'w') as writer:
writer = csv.writer(writer, lineterminator='\n')
writer.writerow(['FileID', 'Length', 'Count'])
for (file_id, filename) in enumerate(sorted(counts.keys())):
if counts[filename]:
for (length, freq) in counts[filename].most_common():
writer.writerow([file_id, length, freq])

• python bin/file-size.py --root /anaconda3 --ext .py --output data/python
• Get some errors
• Store in log/file-size-python-errors.txt for later reference
Failed to read /anaconda3/pkgs/xlwt-1.3.0-py37_0/lib/python3.7/site-packages/xlwt/BIFFRecords.py: 'utf-8' codec can't decode byte 0x93 in position 68384: invalid start byte
Failed to read /anaconda3/pkgs/xlwt-1.3.0-py37_0/lib/python3.7/site-packages/xlwt/UnicodeUtils.py: 'utf-8' codec can't decode byte 0xb7 in position 1950: invalid start byte
Failed to read /anaconda3/pkgs/pylint-2.3.1-py37_0/lib/python3.7/site-packages/pylint/test/functional/implicit_str_concat_in_sequence_latin1.py: 'utf-8' codec can't decode byte 0xe9 in position 96: invalid continuation byte
Failed to read /anaconda3/pkgs/joblib-0.13.2-py37_0/lib/python3.7/site-packages/joblib/test/test_func_inspect_special_encoding.py: 'utf-8' codec can't decode byte 0xa4 in position 64: invalid start byte
...

• Also run python bin/file-sizes.py ~/blocks/node_modules
• blocks is a JavaScript project we just have to have lying around
• node_modules is where the project’s JavaScript packages are stored
• Create scatter plots of to show frequency of line lengths
• Add a pattern rule to Makefile
• Assign the list of CSV files to a variable
• Transform it into a list of figures
• Create a phony target that depends on all of those figure files
• Use automatic variables in the pattern rule
DATA=$(wildcard data/*.csv) FIGURES=$(patsubst data/%.csv,%.svg,${DATA}) # ... figures :${FIGURES}

%-counts-all.svg : data/%-counts.csv
@mkdir -p figures
python bin/length-frequency-plot.py --data $< --fig$@ --logx

• Create log-log histograms for frequency of line lengths in JavaScript and Python
• Exclude lines of length 1
• These only contain the newline character, i.e., they’re empty
• Exclude lines longer than 200 characters
• Possibly a character encoding issue
• Or minification of JavaScript
• These curves look different, but are they?

## How can we tell if two populations are different?

• Null hypothesis: there is no significant difference between these curves
• Alternative hypothesis: there is a significant difference between these curves
• Reject the null hypothesis if the probability of getting this data if the null hypothesis is true is less than a chosen probability called a $$p$$ value
• Typically use $$p = 0.05$$, i.e., less than 1 chance in 20 of getting the data if there isn’t actually a difference
• The lower the $$p$$ value, the higher our confidence

### Kick the ball then move goal

$$p$$ values can be mis-used in several ways. The most obvious is to choose a $$p$$ value after the fact in order to get a significant result: if you ever see reports that mix several different $$p$$ values or use odd numbers like 0.073, this is probably what’s going on.

The second form of abuse, called p hacking, is to re-analyze the data over and over until a “significant” result emerges. Consider: if the odds of getting a false positive for one analysis are 0.05, then the odds of getting a true negative are 0.95. The odds of getting two true negatives in a row are therefore $$0.95^2$$, which is 0.9025. If we keep going, the odds of none of our analyses meeting this threshold are 50/50 when we do 14 analyses. One sign that people are p hacking is that they find niche results like, “This treatment was effective for left-handed non-smokers between the ages of 45 and 55.” The best way to safeguard against p hacking is to pre-register studies, i.e., to declare before collecting data what analyses are going to be done and how.

• But how do we measure the likelihood?
• One approach is to use simulation
• Combine data sets
• Split into two pieces at random (where each piece is the same size as one of the originals)
• See how far apart the means are
• Repeat a few thousand times
• See how often we get the difference we actually see
• Decide if we believe the difference is likely
#!/usr/bin/env python

import sys
import argparse
import numpy as np
import pandas as pd

def main():
'''
Sample two datasets to calculate odds of means being distant.
'''
# ...parse arguments...
# ...read data and calculate actual means and difference...
# ...repeatedly sample and calculate difference...
# ...report...

• Arguments are pretty simple
• Two files
• Cut the low and high values as we did for plotting
• Number of trials
• Seed for random number generator so we can reproduce results
    # parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--trials', type=int, help='number of trials (>0)')
args = parser.parse_args()
np.random.seed(args.seed)

• Read the data and calculate actual statistics
    # read data and calculate actual means and difference
mean_left = data_left.mean()
mean_right = data_right.mean()
actual_diff = mean_left - mean_right

• And:
def read_data(filename, low, high):
'''
Read data and remove lines of length 1 and very long lines.
'''
if (high != None):
data = data[data.Length <= high]
if (low != None):
data = data[data.Length >= low]
return data.Length.repeat(repeats=data.Count)

• For each simulation:
• Shuffle the index
• Cut the data in two
• Calculate means for the sections
• Record difference in the means
• Afterward:
• Calculate the fraction of differences that are less than or equal to the actual difference
• For comparison, also report the average difference in the means
    # repeatedly sample and calculate difference
combined = data_left.append(data_right).reset_index(drop=True)
split = len(data_left)
sample_diffs = []
for t in range(args.trials):
shuffle = np.random.permutation(len(combined))
sample_left = combined[shuffle[:split]]
sample_right = combined[shuffle[split:]]
sample_diffs.append(sample_left.mean() - sample_right.mean())

sample_diff_mean = sum(sample_diffs) / args.trials
success_frac = sum([x <= actual_diff for x in sample_diffs]) / args.trials

• Write results as YAML
• A long two-column CSV with (title, value) pairs would work
• As would a wide CSV with one row of titles and one row of values
• Either way, must write parameter values to ensure reproducibility
    # report
print(f'parameters:')
print(f'- left: "{args.left}"')
print(f'- right: "{args.right}"')
print(f'- low: {"null" if (args.low is None) else args.low}')
print(f'- high: {"null" if (args.high is None) else args.high}')
print(f'- trials: {args.trials}')
print(f'- seed: {args.seed}')
print(f'results:')
print(f'- mean_left: {mean_left}')
print(f'- mean_right: {mean_right}')
print(f'- actual_diff: {actual_diff}')
print(f'- sample_diff: {sample_diff_mean}')
print(f'- successes: {success_frac}')


## How can we test our approach?

• We’re showing completed working code, but we didn’t get it right the first time
• Create two “populations” and see if we get a plausible answer
• [1, 1] vs. [10, 10]
python bin/simulate.py --left test/test-a.csv --right test/test-b.csv --trials 10000 --seed 1234567

parameters:
- left: "test/test-a.csv"
- right: "test/test-b.csv"
- low: null
- high: null
- trials: 10000
- seed: 1234567
results:
- mean_left: 1.0
- mean_right: 10.0
- actual_diff: -9.0
- sample_diff: -0.0126
- successes: 0.1661

• Is that right?
• 8/24 permutations of the four values are a “perfect split”
• But only half of those have the low values in the lower half
• So we expect 4/24 successes, which is 0.1666… and we actually got 0.1661
• Our simulation reports that the expected difference is 0.0 and the simulated difference is -0.0126
• Seems to be doing what we want
• We should test with unequal counts and numbers of samples
• Left as an exercise
• And note that this doesn’t test if the means are different, only if the first is less than the second
• A one-sided test
• If this says “yes” and we reverse the order of the datasets, the answer will be “no”
• Convert this into a two-sided test in the exercises
python bin/simulate.py --left data/javascript-counts.csv --right data/python-counts.csv --trials 5000 --seed 57622

parameters:
- left: "data/javascript-counts.csv"
- right: "data/python-counts.csv"
- low: null
- high: null
- trials: 5000
- seed: 57622
results:
- mean_left: 38.168300030969725
- mean_right: 36.41329390723824
- actual_diff: 1.7550061237314836
- sample_diff: 0.00398256511711528
- successes: 1.0

• None of the 5000 random trials had a difference as big as what we actually see
• So it seems pretty certain that the difference is more than just chance
• But it takes about two hours to run 5000 simulations on our real data
• We should add a --verbose flag to report progress
• Or print the stats every N iterations so we can check convergence interactively and decide when to cut things off
• And we should find a better way to answer the question
• Take less time
• Tell us how confident we can be in the answer
• Because right now 5000 is a magic number
• Store large datasets by compressing them or by replacing redundant information with keys into a second table.
• Reject the null hypothesis that two samples come from the same population if simulation shows it is highly unlikely.
• Specify how unlikely ‘unlikely’ is by choosing a confidence interval based on a significance threshold.
• Reject the null hypothesis that two samples come from the same population if a t-test tells us it is highly unlikely.

## Exercises

### Recording Errors

Modify bin/file-size.py so that it creates a third file -error.txt with the error messages.

### A Two-Sided Test

Modify the simulation program so that it does a two-sided test.

### Testing the Simulation Program

Test the simulation program with more interesting datasets (different counts, different numbers of observations, etc.).

### Reporting Periodically

Modfiy the simulation program so that it can print results to a file periodically.

### Keeping All the Data

How will the answer differ if we don’t drop lines of length 1 and very long lines? Why?