Data Science for Software Engineers

code + statistics = useful insights

A little theory can go a long way.

Basic probability

Combinatorics

Conditional probability

Bayes Rule

Mean and variance

Covariance and correlation

Bernoulli distribution

Binomial distribution

Geometric distribution

Negative binomial distribution

Poisson distribution

Probability density and cumulative distribution

Uniform distribution

Exponential distribution

Gamma distribution

Normal distribution

Central Limit Theorem

Sampling

\[\begin{align*} s^2 & = & \frac{1}{n-1} \sum_{i=1}^{n}(X_i - \bar{X})^2 \\ & = & \frac{\sum X_i^2 - n\bar{X}^2}{n - 1} \end{align*}\]

Parameter estimation

Student’s t-distribution

Confidence intervals

Hypothesis testing

Prediction

Spearman’s rank correlation

Proofs

These proofs are included primarily to help readers understand and remember a few key relationships.

Chebyshev’s Inequality

\[P(\mid X - \mu \mid \gt \epsilon) \leq (\frac{\sigma}{\epsilon})^2\] \[\begin{align*} \sigma^2 & = & \sum_x (x - \mu)^2 P(X) \\ & \geq & \sum_{x : \mid x - \mu \mid \gt \epsilon} (x - \mu)^2 P(X) \\ & \geq & \sum_{x : \mid x - \mu \mid \gt \epsilon} \epsilon^2 P(X) \\ & = & \epsilon^2 \sum_{x : \mid x - \mu \mid \gt \epsilon} P(X) \\ & = & \epsilon^2 P(\mid X - \mu \mid \gt \epsilon) \end{align*}\]

Poisson as the limit to the binomial distribution

\[\begin{align*} P(X = k) & = & \binom{n}{k} p^k (1 - p)^{n - k} \end{align*}\] \[\begin{align*} P(X = k) & = & \frac{n!}{k!(n - k)!} (\frac{\lambda}{n})^k (1 - \frac{\lambda}{n})^{n - k} \\ & = & (\frac{\lambda^k}{k!}) \frac{n!}{(n - k)!} \frac{1}{n^k} (1 - \frac{\lambda}{n})^n (1 - \frac{\lambda}{n})^{- k} \end{align*}\] \[\begin{align*} \lim_{n \rightarrow \infty} \frac{n!}{(n - k)!} \frac{1}{n^k} & = & \lim_{n \rightarrow \infty} \frac{n(n - 1)(n - 2)\ldots(n - k + 1)}{n^k} \\ & = & \lim_{n \rightarrow \infty} \frac{n}{n} \frac{n - 1}{n} \ldots \frac{n - k + 1}{n} \\ & = & 1 \end{align*}\] \[\begin{align*} \lim_{n \rightarrow \infty} (1 - \frac{\lambda}{n})^n & = & \lim_{n \rightarrow \infty} (1 + \frac{1}{\theta})^{-\lambda\theta} \\ & = & e^{- \lambda} \\ \end{align*}\] \[\begin{align*} \lim_{n \rightarrow \infty} (1 - \frac{\lambda}{n})^{-k} & = & 1 \end{align*}\]

Relationship between Poisson and exponential distributions

Bessel correction

Student’s t distribution

The PDF of Student’s t-distribution with \(\nu\) degrees of freedom is:

\[\begin{align*} f(t) & = & \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi} \Gamma(\frac{\nu}{2})} (1+\frac{t^2}{\nu})^{-\frac{\nu+1}{2}} \end{align*}\]

where \(\Gamma\) is the gamma function. For positive even integer values of \(\nu\), the first term is:

\[\begin{align*} \frac{\Gamma(\frac{\nu+1}{2})} {\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} & = & \frac{(\nu -1)(\nu -3)\cdots 5 \cdot 3}{2 \sqrt{\nu}(\nu -2)(\nu -4)\cdots 4 \cdot 2} \end{align*}\]

For positive odd integer values of \(\nu\), the first term is:

\[\begin{align*} \frac{\Gamma(\frac{\nu+1}{2})} {\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} & = & \frac{(\nu -1)(\nu -3)\cdots 4 \cdot 2}{\pi \sqrt{\nu}(\nu -2)(\nu -4)\cdots 5 \cdot 3} \end{align*}\]