# The Law of Large Numbers

## Checking hypotheses using statistics

Terms defined: *t* distribution, *t* test, Bessel correction, Bonferroni correction, Central Limit Theorem, Z test, alternative hypothesis, confidence interval, cumulative distribution function, degrees of freedom, false negative, false positive, normal distribution, null hypothesis, probability density function, probability mass function, sample variance, standard normal distribution, standard uniform distribution, uniform distribution

• Problem: how can we do hypothesis testing
• More quickly (five hours of simulation to answer one question is a lot)
• And more confidently (is 5000 simulations enough? Would 100 work? Do we need a million?)
• Solution: use statistics
• Make some very general assumptions about our data
• Calculate an answer based on rules that hold for large datasets

## What is the law of large numbers?

• Function describing probabilities of discrete events is called the probability mass function
• When describing continuous events, use:
• Cumulative distribution function $$F(x) = P(X \leq x)$$
• Probability density function $$f(x) = dF/dx$$
• So $$P(a \lt X \lt B) = \int_{a}^{b} f(x) dx$$
• Require $$\int_{-\infty}^{\infty} f(x) dx = 1$$
• I.e., something has to happen
• And notice $$P(x) = P(x \leq X \leq x) = \int_{x}^{x} f(x) dx = 0$$
• I.e., probability of any specific exact value is 0
• So always talk about ranges
• Mean is $$\mu = \int_{-\infty}^{\infty} x f(x) dx$$
• Variance is $$\sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f(x) dx = \int_{-\infty}^{\infty} x^2 f(x) dx - \mu^2$$
• Normally use standard deviation $$\sigma$$ because it has the same units as the data
• Saves us from trying to figure out what a squared price is…
• Example: uniform distribution has equal probability over a finite range $$[a \ldots b]$$
• $f(x) = \frac{1}{b - a}$
• $P(a \leq t \leq X \leq t+h \leq b) = \frac{h}{b - a}$
• I.e., probability is proportional to fraction of range
• Standard uniform distribution has range $$[0 \ldots 1]$$
• $\mu = \frac{1}{2}$
• $\sigma^2 = \int_{0}^1 x^2 dx - (\frac{1}{2})^2 = \frac{1}{12}$

## What is the normal distribution and why do we care?

• In its full glory, normal distribution has
\begin{align*} f(x) & = & \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}} \end{align*}
• There is no closed formula for the integral $$F(x)$$
• But as the notation suggests, mean is $$\mu$$ and variance is $$\sigma^2$$
• The standard normal distribution $$Z$$ has mean $$\mu = 0$$ and standard deviation $$\sigma = 1$$
• Easy to move back and forth between this and arbitrary distribution $$X = \mu + \sigma Z$$
• Central Limit Theorem
• Let $$S_n = X_1 + X_2 + \ldots + X_n$$ be the sum of $$n$$ independent random variables, all with mean $$\mu$$ and standard deviation $$\sigma$$
• Can be drawn from (almost) any distribution
• As $$n \rightarrow \infty$$, $$\frac{S_n - n\mu}{\sigma \sqrt{n}}$$ converges on a standard normal random variable
• I.e., the distribution of our estimates of the mean is normal regardless of the underlying distribution
• Rate of convergence is $$\frac{1}{\sqrt{n}}$$
• I.e., to double the precision, quadruple the sample size
• Heuristic: for $$n \gt 30$$, $$S_n$$ is distributed normally

• Sample mean $$\bar{X}$$ estimates the population mean
• Variance of $$\bar{X}$$ is $$\frac{\sigma^2}{n}$$
• Distribution of sample means is normal, i.e. $$\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}$$ is standard normal as $$n \rightarrow \infty$$
• Regardless of the underlying distribution of $$X$$
• FIXME: add program to sample various uniform distributions and see how the sampling converges on a uniform distribution

## How can we use this to quantify confidence?

• A confidence interval is an interval $$[a \ldots b]$$ that has some probability $$p$$ of containing the actual value of a statistic
• E.g., “There is a 90% probability that the actual mean of this population lies between 2.5 and 3.5”
• Larger intervals are less precise but have a higher probability
• If there are more than 30 samples or the standard deviation $$\sigma$$ is known, use a Z-test:
1. Choose a confidence level $$C$$ (typically 95%)
2. Find the value $$z^{\star}$$ such that $$P(x \leq z^{\star}) \leq \frac{1 - C}{2}$$ in a standard normal distribution
• Divide by 2 because the normal curve has two symmetric tails
3. Calculate the sample mean $$\bar{X}$$
4. Interval is $$\bar{X} \pm z^{\star}\frac{\sigma}{\sqrt{n}}$$
• FIXME: example

## Student’s t-distribution

• Usually don’t know the distribution’s variance
• The sample variance is:
\begin{align*} s^2 & = & \frac{1}{n-1} \sum_{i=1}^{n}(X_i - \bar{X})^2 \\ & = & \frac{\sum X_i^2 - n\bar{X}^2}{n - 1} \end{align*}
• Using $$n-1$$ instead of $$n$$ ensures that $$s^2$$ is unbiased (the Bessel correction)
• Student’s t-distribution is used to estimate the mean of a normally distributed population when the sample size is small (e.g., less 30) and the variance is unknown
• Named comes from a pseudonym used by the mathematician who first used it this way
• The variable $$\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}$$ has a standard normal distribution
• However, the variable $$\frac{\bar{X} - \mu}{s / \sqrt{n}}$$ has a t-distribution with $$n-1$$ degrees of freedom
• Called degrees of freedom because once $$n-1$$ values are known, the value of the $$n^{th}$$ is fixed
• $$n-1$$ because there’s a step in the calculation that normalizes the $$n$$ values to unit length
• The exact formula for the t-distribution is a little bit scary.
• The PDF’s shape resembles that of a normal distribution with mean 0 and variance 1, but is slightly lower and wider.
• The two become closer as the degrees of freedom $$\nu$$ gets larger.
• A t-test follows the same steps as a Z-test:
1. Choose a confidence level $$C$$
2. Find a value $$t^{\star}$$ such that $$P(x \leq t^{\star}) \leq \frac{1 - C}{2}$$ in a Student’s t-distribution with $$n-1$$ degrees of freedom
3. Estimate the standard deviation $$s$$
4. Interval is $$\bar{X} \pm t^{\star}\frac{s}{\sqrt{n}}$$
• FIXME: example

## How can we compare the means of two datasets?

• What is the probability of seeing this difference between two datasets?
• The null hypothesis $$H_0$$ is that the samples come from a single population and the observed difference is purely due to chance
• The alternative hypothesis $$H_A$$ is that the samples come from two difference populations
• False positive: decide that the difference is not purely random when it is
• False negative: decide the difference is purely random when it isn’t
• Also called Type I and Type II errors (but see https://twitter.com/neilccbrown/status/1202595479890124801)
• Adapt the simulation program (keep a subset of the command-line parameters)
from scipy.stats import ttest_ind

def main():
# ...parse arguments...

# ...read data and calculate actual means and difference...

# test and report
result = ttest_ind(data_left, data_right)

• Run
python bin/t-test.py --left ../hypothesis-testing/data/javascript-counts.csv --right ../hypothesis-testing/data/python-counts.csv --low 1 --high 200

Ttest_indResult(statistic=-269.67014904687954, pvalue=0.0)

• The $$p$$ value is so small that the computer can’t distinguish it from zero
• Which means the chances of getting this difference by randomly splitting a single population is vanishingly small

• Look at the hours worked per day in 2019
• Data is (date, hours) pairs taken from a spreadsheet
• There are a lot of spreadsheets in data science
• Split into weekday and weekend subsets and visualize
• Note that hours are never actually negative, but the curve is drawn that way
• They certainly seem different
• And a t-test confirms it
• The odds are large enough this time to be printable…
python bin/weekends.py --data data/programmer-hours.csv

weekday mean 6.804375000071998
weekend mean 3.232482993312492
Ttest_indResult(statistic=12.815512046971827, pvalue=6.936182610195961e-31)


### Higher standards

• Recall discussion of $$p$$ hacking from
• If we analyze the data enough different ways, one of them will be “significant”
• Use the Bonferroni correction
• The more tests we do, the more stringest our significance criteria must be
• The Central Limit Theorem states that the distribution of the mean of a sample has a normal distribution.
• Use a Z-test or t-test to determine whether two populations are the same or different.