Terms defined: *t* distribution, *t* test, Bessel correction, Bonferroni correction, Central Limit Theorem, Z test, alternative hypothesis, confidence interval, cumulative distribution function, degrees of freedom, false negative, false positive, normal distribution, null hypothesis, probability density function, probability mass function, sample variance, standard normal distribution, standard uniform distribution, uniform distribution
 Problem: how can we do hypothesis testing
 More quickly (five hours of simulation to answer one question is a lot)
 And more confidently (is 5000 simulations enough? Would 100 work? Do we need a million?)
 Solution: use statistics
 Make some very general assumptions about our data
 Calculate an answer based on rules that hold for large datasets
What is the law of large numbers?
 Function describing probabilities of discrete events is called the probability mass function
 When describing continuous events, use:
 Cumulative distribution function \(F(x) = P(X \leq x)\)
 Probability density function \(f(x) = dF/dx\)
 So \(P(a \lt X \lt B) = \int_{a}^{b} f(x) dx\)
 Require \(\int_{\infty}^{\infty} f(x) dx = 1\)
 I.e., something has to happen
 And notice \(P(x) = P(x \leq X \leq x) = \int_{x}^{x} f(x) dx = 0\)
 I.e., probability of any specific exact value is 0
 So always talk about ranges
 \[\mu = \int_{\infty}^{\infty} x f(x) dx\]
 \[\sigma^2 = \int_{\infty}^{\infty} (x  \mu)^2 f(x) dx = \int_{\infty}^{\infty} x^2 f(x) dx  \mu^2\]
 For example, uniform distribution has equal probability over a finite range \([a \ldots b]\)
 \[f(x) = \frac{1}{b  a}\]
 \[P(a \leq t \leq X \leq t+h \leq b) = \frac{h}{b  a}\]
 I.e., probability is proportional to fraction of range
 Standard uniform distribution has range \([0 \ldots 1]\)
 \[\mu = \frac{1}{2}\]
 \[\sigma^2 = \int_{0}^1 x^2 dx  (\frac{1}{2})^2 = \frac{1}{12}\]
What is the normal distribution and why do we care?
 In its full glory, normal distribution has
 There is no closed formula for the integral \(F(x)\)
 But as the notation suggests, mean is \(\mu\) and variance is \(\sigma^2\)
 The standard normal distribution \(Z\) has mean \(\mu = 0\) and standard deviation \(\sigma = 1\)
 Reconstruct arbitrary distribution \(X = \mu + \sigma Z\)
 Central Limit Theorem
 Let \(S_n = X_1 + X_2 + \ldots + X_n\) be the sum of \(n\) independent random variables, all with mean \(\mu\) and standard deviation \(\sigma\)
 Can be drawn from (almost) any distribution
 As \(n \rightarrow \infty\), \(\frac{S_n  n\mu}{\sigma \sqrt{n}}\) converges on a standard normal random variable
 I.e., the distribution of our estimates of the mean is normal regardless of the underlying distribution
 Rate of convergence is \(\frac{1}{\sqrt{n}}\)
 I.e., to double the precision, quadruple the sample size

Heuristic: for \(n \gt 30\), a normal distribution is an accurate approximation to \(S_n\)
 Sample mean \(\bar{X}\) estimates the population mean
 Variance of \(\bar{X}\) is \(\frac{\sigma^2}{n}\)
 Distribution of sample means is normal, i.e. \(\frac{\bar{X}  \mu}{\sigma / \sqrt{n}}\) is standard normal as \(n \rightarrow \infty\)
 Regardless of the underlying distribution of \(X\)
 FIXME: add program to sample various uniform distributions and see how the sampling converges on a uniform distribution
How can we use this to quantify confidence?
 A confidence interval is an interval \([a \ldots b]\)
that has some probability \(p\) of containing the actual value of a statistic
 E.g., “There is a 90% probability that the actual mean of this population lies between 2.5 and 3.5”
 Larger intervals have a higher probability but are less precise
 If there are more than 30 samples or the standard deviation \(\sigma\) is known, use a Ztest:
 Choose a confidence level \(C\) (typically 95%)
 Find the value \(z^{\star}\) such that \(P(x \leq z^{\star}) \leq \frac{1  C}{2}\)
in a standard normal distribution
 Divide by 2 because the normal curve has two symmetric tails
 Calculate the sample mean \(\bar{X}\)
 Interval is \(\bar{X} \pm z^{\star}\frac{\sigma}{\sqrt{n}}\)
 FIXME: example
Student’s tdistribution
 Usually don’t know the distribution’s variance
 The sample variance is:
 Using \(n1\) instead of \(n\) ensures that \(s^2\) is unbiased (the Bessel correction)
 See proof
 Student’s tdistribution is used to estimate the mean of a normally distributed population
when the sample size is small (e.g., less 30) and the variance is unknown
 Named comes from a pseudonym used by the mathematician who first used it this way
 The variable \(\frac{\bar{X}  \mu}{\sigma / \sqrt{n}}\) has a standard normal distribution
 However, the variable \(\frac{\bar{X}  \mu}{s / \sqrt{n}}\) has a tdistribution
with \(n1\) degrees of freedom
 \(n1\) because there’s a step in the calculation that normalizes the \(n\) values to unit length
 Once \(n1\) are known, the value of the \(n^{th}\) is fixed
 The exact formula for the tdistribution is a little bit scary.
 The PDF’s shape resembles that of a normal distribution with mean 0 and variance 1, but is slightly lower and wider.
 The two become closer as the degrees of freedom \(\nu\) gets larger.
 A ttest follows the same steps as a Ztest:
 Choose a confidence level \(C\)
 Find a value \(t^{\star}\) such that \(P(x \leq t^{\star}) \leq \frac{1  C}{2}\) in a Student’s tdistribution with \(n1\) degrees of freedom
 Estimate the standard deviation \(s\)
 Interval is \(\bar{X} \pm t^{\star}\frac{s}{\sqrt{n}}\)
 FIXME: example
How can we compare the means of two datasets?
 What is the probability of seeing this difference between two datasets?
 The null hypothesis \(H_0\) is that the samples come from a single population and the observed difference is purely due to chance
 The alternative hypothesis \(H_A\) is that the samples come from two difference populations
 False positive: decide that the difference is not purely random when it is
 False negative: decide the difference is purely random when it isn’t
 Also called Type I and Type II errors (but see https://twitter.com/neilccbrown/status/1202595479890124801)
 Adapt the simulation program (keep a subset of the commandline parameters)
from scipy.stats import ttest_ind
def main():
# ...parse arguments...
# ...read data and calculate actual means and difference...
# test and report
result = ttest_ind(data_left, data_right)
 Run
python bin/ttest.py left ../hypothesistesting/data/javascriptcounts.csv right ../hypothesistesting/data/pythoncounts.csv low 1 high 200
Ttest_indResult(statistic=269.67014904687954, pvalue=0.0)
 The \(p\) value is so small that the computer can’t distinguish it from zero

Which means the chances of getting this difference by randomly splitting a single population is vanishingly small
 Look at the hours worked per day in 2019
 Data is (date, hours) pairs taken from a spreadsheet
 There are a lot of spreadsheets in data science
 Split into weekday and weekend subsets and visualize
 Note that hours are never actually negative, but the curve is drawn that way
 They certainly seem different
 And a ttest confirms it
 The odds are large enough this time to be printable…
python bin/weekends.py data data/programmerhours.csv
weekday mean 6.804375000071998
weekend mean 3.232482993312492
Ttest_indResult(statistic=12.815512046971827, pvalue=6.936182610195961e31)
How can this approach be misleading?
 FIXME: Bonferroni correction
 The Central Limit Theorem states that the distribution of the mean of a sample has a normal distribution.
 Use a Ztest or ttest to determine whether two populations are the same or different.