A little theory can go a long way.
Basic probability
 Assume a set of events \(E\) and a probability function \(P\) with \(0 \leq P(x) \leq 1\)
 If \(A\) and \(B\) are disjoint sets of events from \(E\) then:
 \(P(E) = 1\) (i.e., something must happen)
 \(P(A \cup B) = P(A) + P(B)\) (i.e., probability of the union of disjoint sets is the sum of their probabilities)
 A few consequences:
 If \(A\) and \(B\) are not disjoint, then \(P(A \cup B) = P(A) + P(B)  P(A \cap B)\)
 \[P(\neg A) = 1  P(A)\]
 If \(A\) and \(B\) are disjoint, \(P(A \cup B) = P(A)P(B)\)
 The probability of \(n\) equally likely disjoint events \(x_i\) is \(nP(x_i)\)
Combinatorics
 There are \(n^{k}\) ways to sample \(k\) distinguishable items from \(n\) with replacement
 There are \(P(n, k) = \frac{n!}{(n  k)!}\) ways to sample \(k\) distinguishable items from \(n\) without replacement
 If the items are indistinguishable, the number of selections is
\(C(c, k) = \binom{c}{k} = \frac{P(n, k)}{k!} = \frac{n!}{k!(n  k)!}\)
 Divide by \(k!\) because the indistinguishable items can be rearranged that many ways
Conditional probability
 The conditional probability \(P(A \mid B)\) of \(A\) given \(B\) is the probability of \(A\) occurring given that \(B\) is known to occur
 \[P(A \mid B) = \frac{P(A \cap B)}{P(B)}\]
Bayes Rule
 Since \(P(A \mid B) = P(B \mid A)\), we have Bayes’ Rule \(P(B \mid A) = \frac{P(A \mid B)P(B)}{P(A)}\)
 Special case: since \(P(A) = P(A \mid B) + P(A \mid \neg B)\), \(P(B \mid A) = \frac{P(A \mid B)P(B)}{P(A \mid B)P(B) + P(A \mid \neg B)P(\neg B)}\)
Mean and variance
 The mean \(\mu\) or expected value \(E(X)\) is the weighted sum of possible outcomes \(\sum_{x} xP(x)\)
 Implies \(E(aX + bY + c) = aE(X) + bE(Y) + c\)
 The variance \(\sigma^2\) is the expected value of the square of the difference between values and the mean
\(\sum_{x} (x  \mu)^2P(x)\)
 Squaring guarantees that values are positive
 And gives extra weight to outliers
 But the units are weird: “bugs squared”
 \(\sigma^{2}_{A + B} = \sigma^2_A + \sigma^2_B\) if \(A\) and \(B\) are independent
 The standard deviation \(\sigma\) is the square root of the variance
 Same units as original variable
Covariance and correlation
 Covariance \(\sigma_{XY}\) of \(X\) and \(Y\) is \(E((X  \mu_{X})(Y  \mu_{Y}))\)
 If \(X\) and \(Y\) are both above or below their means at the same time, \(\sigma_{XY}\) will be positive
 If \(X\) is above when \(Y\) is below and vice versa, \(\sigma_{XY}\) will be negative
 If there is no relation, \(\sigma_{XY}\) will be zero
 Pearson’s correlation coefficient \(r_{XY}\) is covariance normalized by standard deviations
 \[r_{XY} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}\]
 Always lies in \([1 \ldots 1]\)
 Chebyshev’s Inequality: \(P(\mid X  \mu \mid \gt \epsilon) \leq (\frac{\sigma}{\epsilon})^2\)
 I.e., the probability of a value being more than \(\epsilon\) away from the mean is bounded by the square of the ratio between the standard deviation and \(\epsilon\)
 See proof
Bernoulli distribution
 A Bernoulli distribution is
a random variable with just two values 1 and 0 (sometimes called success and failure)
 Named after the mathematician who first described it
 Probability of success is \(p\)
 \[\mu = 0(1  p) + 1(p) = p\]
 \[\sigma^2 = \sum (x  p)^2 P(x) = (0  p)^2 (1  p) + (1  p)^2 p = p(1  p)\]
Binomial distribution
 A binomial distribution is
the number of successes in \(n\) trials of a Bernoulli variable with probability \(p\)
 Name means “two numbers” (referring to \(n\) and \(k\))
 Probability of exactly \(x\) successes in \(n\) trials is \(\binom{n}{x} p^x (1p)^{nx}\)
 Number of different arrangements of that many successes and that many failures
 \(\mu = np\) (because trials are independent)
 \(\sigma^2 = np(1p)\) (because \(\sigma^{2}_{A + B} = \sigma^2_A + \sigma^2_B\) if \(A\) and \(B\) are independent)
 Probability of up to \(x\) successes is complicated to calculate, but we can use approximations (FIXME)
Geometric distribution
 A geometric distribution is
the number of Bernoulli trials needed to get the first success
 Potentially
 \(P(x) = (1  p)^{x  1}p\) (i.e., \(x1\) failures followed by 1 success)
 To find the mean, use the fact that \(\sum_{i=0}^{\infty} x^i = \frac{1}{1  x}\) for \(0 \lt x \lt 1\)
 A geometric series, which gives the distribution its name
 \[\mu = 1/p\]
 The variance takes a little more work, but is \(\sigma^2\) is \(\frac{1p}{p^2}\)
Negative binomial distribution
 The binomial distribution describes the numbr of successes in a fixed number of trials
 The negative binomial distribution is
the number of trials required to achieve a certain number of successes
 “Negative” in the sense of opposite: there is nothing negative in the values

\[P(x) = \binom{x  1}{k  1} (1  p)^{x  k} p^k\]
 Reading right to left, this is \(k\) successes, \(xk\) failures, and the number of possible rearrangements with the last being a success (which is why the 1)
 \[\mu = \frac{k}{p}\]
 \[\sigma^2 = \frac{k(1p)}{p^2}\]
 With \(k=1\), the negative binomial is just the geometric distribution (number of trials to first success)
Poisson distribution
 Number of events occurring within a fixed period has a Poisson distribution
 Assuming events never occur simultaneously
 \[P(x) = e^{ \lambda}\frac{\lambda^x}{x!}\]
 \[\mu = \lambda\]
 \[\sigma^2 = \lambda\]
 Poisson is a special case of binomial where the number of trials is very large
and the probability of success in any trial is small
 See proof
 For \(n \geq 30\) and \(p \leq 0.05\), Poisson is a good approximation of binomial
Probability density and cumulative distribution
 Function describing probabilities of discrete events is called the probability mass function
 When describing continuous events, use:
 Cumulative distribution function \(F(x) = P(X \leq x)\)
 Probability density function \(f(x) = dF/dx\)
 So \(P(a \lt X \lt B) = \int_{a}^{b} f(x) dx\)
 Require \(\int_{\infty}^{\infty} f(x) dx = 1\)
 And notice \(P(x) = P(x \leq X \leq x) = \int_{x}^{x} f(x) dx = 0\)
 I.e., probability of any specific exact value is 0
 \[\mu = \int_{\infty}^{\infty} x f(x) dx\]
 \[\sigma^2 = \int_{\infty}^{\infty} (x  \mu)^2 f(x) dx = \int_{\infty}^{\infty} x^2 f(x) dx  \mu^2\]
Uniform distribution
 Uniform distribution has equal probability over a finite range \([a \ldots b]\)
 \[f(x) = \frac{1}{b  a}\]

\[P(a \leq t \leq X \leq t+h \leq b) = \frac{h}{b  a}\]
 I.e., probability is proportional to fraction of range
 Standard uniform distribution has range \([0 \ldots 1]\)
 \[\mu = \frac{1}{2}\]
 \[\sigma^2 = \int_{0}^1 x^2 dx  (\frac{1}{2})^2 = \frac{1}{12}\]
Exponential distribution
 Assume the waiting time for the next rare event is independent of the time waited so far
 I.e., \(P(X \geq b+a \mid X \geq a) = P(X \geq b)\)
 Can show that \(f(x) = \lambda e^{ \lambda x}\) is the unique solution, so \(F(x) = 1  e^{ \lambda x}\)
 \[\mu = \frac{1}{\lambda}\]
 \[\sigma^2 = \frac{1}{\lambda^2}\]
 The parameter \(\lambda\) is the frequency
 If \(\lambda = 2\) then we expect 2 events per unit time and \(\mu = \frac{1}{\lambda} = 0.5\) units of time between events
 The Poisson distribution with parameter \(\lambda\) is the number of events in time \(t\) (discrete)
 See proof
Gamma distribution
 If there are \(\alpha\) independent sequential steps,
each taking an exponentiallydistributed time with rate \(\lambda\),
then the total time has a Gamma distribution Gamma(\(\alpha\), \(\lambda\))
 Name comes from the gamma function, which extends factorial to complex numbers

\[f(x) = \frac{\lambda^\alpha}{\Gamma (\alpha)} x^{\alpha  1} e^{ \lambda x}\]
 \(\alpha\) is sometimes called the shape parameter
 Gamma(1, \(\lambda\)) is an exponential distribution with rate \(\lambda\)
 \[\mu = \frac{\alpha}{\lambda}\]
 \[\sigma^2 = \frac{\alpha}{\lambda^2}\]
 For specific values, use library or lookup tables
 FIXME: relationship between Gamma and Poisson
Normal distribution
 In its full glory, normal distribution
has \(f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{ \frac{(x  \mu)^2}{2 \sigma^2}}\)
 There is no closed form for the integral \(F(x)\)
 But as the notation suggests, means is \(\mu\) and variance is \(\sigma^2\)
 The standard normal distribution \(Z\) has mean \(\mu = 0\) and standard deviation \(\sigma = 1\)
 Reconstruct arbitrary distribution \(X = \mu + \sigma Z\)
Central Limit Theorem
 Let \(S_n = X_1 + X_2 + \ldots + X_n\) be the sum of \(n\) independent random variables, all with mean \(\mu\) and standard deviation \(\sigma\)
 As \(n \rightarrow \infty\), \(\frac{S_n  n\mu}{\sigma \sqrt{n}}\) converges on a standard normal random variable
 Rate of convergence is \(\frac{1}{\sqrt{n}}\)
 Heuristic: for \(n \gt 30\), a normal distribution is an accurate approximation to \(S_n\)
 In particular, binomial(\(n\), \(p\)) \(\approx\) normal(\(\mu = np\), \(\sigma = \sqrt{n p (1  p)}\) )
 But use a continuity correction: adjust interval by 0.5 units to allow for difference between discrete and continuous
 E.g., \(P(X = x) = P(x  0.5 \lt X \lt x + 0.5)\) (because the probability of a point in a continuous distribution is zero)
Sampling
 A population has a parameter; a sample has a statistic
 Sample mean \(\bar{X}\) estimates the population mean \(\mu = E(X)\)
 Variance of \(\bar{X}\) is \(\frac{\sigma^2}{n}\)
 Distribution of sample means is normal, i.e. \(\frac{\bar{X}  \mu}{\sigma \sqrt{n}}\) is standard normal as \(n \rightarrow \infty\)
 Regardless of the underlying distribution of \(X\)
 The median is the central value such that \(P(X \lt M) \leq 0.5\) and \(P(X \gt M) \leq 0.5\)
 If the distribution is skewed to the right, \(M \lt \mu\)
 If the distribution is skewed to the left, \(M \gt \mu\)
 A quartile is a value that divides the sample into quarters
 E.g., first quartile \(Q_1\) splits values 25%/75%, third quartile \(Q_3\) splits values 75%/25%
 Second quartile is the same as the median
 The interquartile range IQR is \(Q_3  Q_1\)
 Anything more than 1.5 IQR below \(Q_1\) or above \(Q_3\) is considered an outlier
 Because if the data is normal, only 0.7% of it should lie outside these bounds
 The sample variance is:
 Using \(n1\) instead of \(n\) ensures that \(s^2\) is unbiased (the Bessel correction)
 See proof
Parameter estimation
 The \(k^{th}\) population moment is \(E(X^k)\)
 The \(k^{th}\) sample moment is \(\frac{1}{n}\sum_{i=1}^{n} X_i^k\)
 The first sample moment is the sample mean
 The population and sample central moments are \(E(X  \mu)^k\) and \(\frac{1}{n}\sum(X_i  \bar{X})^k\) respectively
 I.e., the moments after shifting the data to a mean of zero
 Parameter estimation via the method of moments: find parameter values to match sample moments
 Parameter estimation via maximum likelihood: choose parameter values to maximize the likelihood of the observed sample
 Example: estimation \(\lambda\) for Poisson distribution
 PMF is \(P(x) = e^{ \lambda} \frac{\lambda^x}{x!}\)
 Taking logarithms, \(\ln{P(x)} = \lambda + x \ln{\lambda}  \ln{x!}\)
 So we need to maximize \(\ln{P(x)} = \sum{(\lambda + X_i \ln{\lambda})} + C = n \lambda + \ln{\lambda}\sum{X_i} + C\) where \(C = \sum{\ln{x!}}\) is a constant that does not depend on \(\lambda\)
 Differentiating with respect to \(\lambda\) and setting equal to zero gives \(\frac{\partial}{\partial{\lambda}}\ln{P(X)} = n + \frac{1}{\lambda}\sum{X_i} = 0\)
 Only solution is \(\lambda = \frac{1}{n}\sum{X_i} = \bar{X}\)
Student’s tdistribution
 Student’s tdistribution is used to estimate the mean of a normally distributed population
when the sample size is small (e.g., less 30) and the variance is unknown
 Named comes from a pseudonym used by the mathematician who first used it this way
 If \(X\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\), then \(\bar{X} = \frac{1}{n}\sum{X_i}\) is the sample mean and \(s^2 = \frac{1}{n1}\sum{(X_i  X)^2}\) is the Besselcorrected sample variance
 The variable \(\frac{\bar{X}  \mu}{\sigma / \sqrt{n}}\) has a standard normal distribution
 However, the variable \(\frac{\bar{X}  \mu}{s / \sqrt{n}}\) has a tdistribution
with \(n1\) degrees of freedom
 \(n1\) because there’s a step in the calculation that normalizes the \(n\) values to unit length
 Once \(n1\) are known, the value of the \(n^{th}\) is fixed
 The exact formula for the tdistribution is a little bit scary.
 The PDF’s shape resembles that of a normal distribution with mean 0 and variance 1, but is slightly lower and wider.
 The two become closer as the degrees of freedom \(\nu\) gets larger.
Confidence intervals
 A confidence interval is an interval \([a \ldots b]\)
that has some probability \(p\) of containing the actual value of a statistic
 E.g., “There is a 90% probability that the actual mean of this population lies between 2.5 and 3.5”
 Larger intervals have a higher probability but are less precise
 If there are more than 30 samples or the standard deviation \(\sigma\) is known, use a ztest:
 Choose a confidence level \(C\) (typically 95%)
 Find the value \(z^{\star}\) such that \(P(x \leq z^{\star}) \leq \frac{1  C}{2}\)
in a standard normal distribution
 Divide by 2 because the normal curve has two symmetric tails
 Calculate the sample mean \(\bar{X}\)
 Interval is \(\bar{X} \pm z^{\star}\frac{\sigma}{\sqrt{n}}\)
 If there are fewer than 30 samples or the standard deviation isn’t known, use a ttest:
 Choose a confidence level \(C\)
 Find a value \(t^{\star}\) such that \(P(x \leq t^{\star}) \leq \frac{1  C}{2}\) in a Student’s tdistribution with \(n1\) degrees of freedom
 Estimate the standard deviation \(s\)
 Interval is \(\bar{X} \pm t^{\star}\frac{s}{\sqrt{n}}\)
Hypothesis testing
 What is the probability of seeing this difference between two datasets?
 The null hypothesis \(H_0\) is that the samples come from a single population and the observed difference is purely due to chance
 The alternative hypothesis \(H_A\) is that the samples come from two difference populations
 False positive: decide that the difference is not purely random when it is
 False negative: decide the difference is purely random when it isn’t
 Example: if a coin comes up heads 9 times out of 10, what are the odds it is actually fair?
 Probability if the coin is fair is \(\binom{10}{9} \cdot 0.5^9 \cdot (10.5)^1 = 10 \cdot 0.00195 \cdot 0.5 = 0.00976\)
 I.e., less than 1% chance of seeing this result if the coin is far
Prediction
 Accuracy is the fraction of correct predictions (true positive + true negative)
 Not useful if there are only a few defective items, since “all good” will have high accuracy
 Precision is the fraction of positives that are actually positive, i.e. true positive / (true positive + false positive)
 Recall is the fraction of positives the method can actually identify, i.e., true positive / (true positive + false negative)
 Perfect precision and perfect recall mean all items are classified correctly
 But increasing precision often reduces recall and vice versa
 The Fmeasure is the harmonic mean of precision and recall
Spearman’s rank correlation
 Spearman’s rank correlation measures the rank correlation between two variables
 Rather than the values, measure how well the sorted order of items matches
 Given two random variables \(X_i\) and \(Y_i\), sort items and assign ranks \(r_{X_i}\) and \(r_{Y_i}\) and calculate the correlation coefficient of the ranks
 If ranks are distinct integers, can use the simplified formula \(\rho = 1  \frac{6 \sum{d_i^2}}{n (n^2  1)}\) where \(d_i = r_{X_i}  r_{Y_i}\) is the difference between the ranks of observation \(i\)
 Standard error is \(\sigma = \frac{0.6325}{\sqrt{n  1}}\)
Proofs
These proofs are included primarily to help readers understand and remember a few key relationships.
Chebyshev’s Inequality
\[P(\mid X  \mu \mid \gt \epsilon) \leq (\frac{\sigma}{\epsilon})^2\] \[\begin{align*} \sigma^2 & = & \sum_x (x  \mu)^2 P(X) \\ & \geq & \sum_{x : \mid x  \mu \mid \gt \epsilon} (x  \mu)^2 P(X) \\ & \geq & \sum_{x : \mid x  \mu \mid \gt \epsilon} \epsilon^2 P(X) \\ & = & \epsilon^2 \sum_{x : \mid x  \mu \mid \gt \epsilon} P(X) \\ & = & \epsilon^2 P(\mid X  \mu \mid \gt \epsilon) \end{align*}\]Poisson as the limit to the binomial distribution
 For binomial:
 Let \(\lambda = np\) be the success rate (number of trials times probability per trial)
 Then \(p = \frac{\lambda}{n}\) and:
 But:
 And since \(e = \lim_{x \rightarrow \infty} (1 + \frac{1}{x})^x\), if \(\theta = \frac{n}{\lambda}\) then:
 Finally:
 Multiplying these all together gives the formula for the Poisson distribution
Relationship between Poisson and exponential distributions
 Let \(N_t\) be the number of events in time \(t\)
 And \(X_t\) be the time for one additional event if there was an event at time \(t\)
 If \(X_t \gt x\) then \(N_t = N_{t+x}\)
 Since total probability is always 1, \(P(X_t \leq x) = 1  P(X_t \gt x)\)
 So \(P(X_t \leq x) = 1  P(N_{t+x}  N_t = 0) = P(N_x = 0)\) (because the process is memoryless)
 But \(P(N_x = 0) = \frac{(\lambda x)^0}{0!}e^{ \lambda x} = e^{ \lambda x}\)
 So \(P(X_t \leq x) = 1  e^{ \lambda x}\)
Bessel correction
 For a sample \(a\), \(X\) deviates from sample mean \(\bar{X}\) with variance \(\sigma_{a}^2\)
 But the sample mean \(\bar{X}\) deviates from the population mean \(\mu\) with variance \(\sigma^2 / n\)
 So the population variance \(\sigma^2 = \sigma_{a}^2 + \sigma^2 / n\)
 Rearranging gives \(\sigma^2 = \frac{n}{n1} \sigma_{a}^2\)
 But \(\sigma_{a}^2 = \frac{\sum (X_i  \bar{X})^2}{n}\), so \(\sigma^2 = \frac{\sum (X_i  \bar{X})^2}{n1}\)
Student’s t distribution
The PDF of Student’s tdistribution with \(\nu\) degrees of freedom is:
\[\begin{align*} f(t) & = & \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi} \Gamma(\frac{\nu}{2})} (1+\frac{t^2}{\nu})^{\frac{\nu+1}{2}} \end{align*}\]where \(\Gamma\) is the gamma function. For positive even integer values of \(\nu\), the first term is:
\[\begin{align*} \frac{\Gamma(\frac{\nu+1}{2})} {\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} & = & \frac{(\nu 1)(\nu 3)\cdots 5 \cdot 3}{2 \sqrt{\nu}(\nu 2)(\nu 4)\cdots 4 \cdot 2} \end{align*}\]For positive odd integer values of \(\nu\), the first term is:
\[\begin{align*} \frac{\Gamma(\frac{\nu+1}{2})} {\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} & = & \frac{(\nu 1)(\nu 3)\cdots 4 \cdot 2}{\pi \sqrt{\nu}(\nu 2)(\nu 4)\cdots 5 \cdot 3} \end{align*}\]