← Appendix
Mathematics

Probability & Statistics

Distributions, expectations, Bayes theorem, MLE

Probability & Statistics

1. Sample Space, Events, Probability

Sample space Ω\Omega — the set of all possible outcomes of an experiment.

Event AΩA \subseteq \Omega — a subset of outcomes we care about.

Probability measure PP assigns a number in [0,1][0,1] to each event, satisfying:

  1. P(Ω)=1P(\Omega) = 1
  2. P(A)0P(A) \geq 0 for all AA
  3. For mutually exclusive events: P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B)

Useful consequences: P()=0P(Ac)=1P(A)P(AB)=P(A)+P(B)P(AB)P(\emptyset) = 0 \qquad P(A^c) = 1 - P(A) \qquad P(A \cup B) = P(A) + P(B) - P(A \cap B)

Example: Roll a fair die. Ω={1,2,3,4,5,6}\Omega = \{1,2,3,4,5,6\}. Event AA = "even" = {2,4,6}\{2,4,6\}. P(A)=3/6=0.5P(A) = 3/6 = 0.5.


2. Conditional Probability

P(AB)=P(AB)P(B)provided P(B)>0P(A \mid B) = \frac{P(A \cap B)}{P(B)} \qquad \text{provided } P(B) > 0

"The probability of AA given that we know BB has occurred."

Example: A patient tests positive for a disease. What is the probability they actually have it? That's P(diseasepositive)P(\text{disease} \mid \text{positive}) — not the same as P(positivedisease)P(\text{positive} \mid \text{disease}). Confusing these is a classic error (see Bayes theorem below).

Multiplication rule

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B)\, P(B) = P(B \mid A)\, P(A)

Independence

AA and BB are independent if knowing BB tells you nothing about AA:

P(AB)=P(A)P(AB)=P(A)P(B)P(A \mid B) = P(A) \qquad \Leftrightarrow \qquad P(A \cap B) = P(A)\,P(B)

[!WARNING] Independence is an assumption, not something you observe directly. The i.i.d. assumption in ML (samples are independent and identically distributed) is an assumption about how data was collected — it's often approximately true but rarely exactly true.


3. Bayes' Theorem

P(AB)=P(BA)P(A)P(B)\boxed{P(A \mid B) = \frac{P(B \mid A)\, P(A)}{P(B)}}

In ML language: posterior = (likelihood × prior) / evidence.

P(θD)=P(Dθ)P(θ)P(D)P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta)\, P(\theta)}{P(\mathcal{D})}

  • P(θ)P(\theta)prior: what we believe about parameters before seeing data
  • P(Dθ)P(\mathcal{D} \mid \theta)likelihood: how probable is the data under these parameters
  • P(D)P(\mathcal{D})evidence (normalizing constant, often intractable)
  • P(θD)P(\theta \mid \mathcal{D})posterior: updated belief after seeing data

Law of Total Probability

If B1,,BkB_1, \ldots, B_k partition Ω\Omega (mutually exclusive, exhaustive):

P(A)=i=1kP(ABi)P(Bi)P(A) = \sum_{i=1}^k P(A \mid B_i)\, P(B_i)

Used to compute P(D)P(\mathcal{D}) in the denominator of Bayes.

Classic example — Medical test

A disease affects 1% of the population. A test is 95% sensitive (true positive rate) and 95% specific (true negative rate). You test positive. What's the probability you have the disease?

  • P(disease)=0.01P(\text{disease}) = 0.01 (prior)
  • P(posdisease)=0.95P(\text{pos} \mid \text{disease}) = 0.95 (sensitivity)
  • P(posno disease)=0.05P(\text{pos} \mid \text{no disease}) = 0.05 (false positive rate)
  • P(pos)=0.95×0.01+0.05×0.99=0.0095+0.0495=0.059P(\text{pos}) = 0.95 \times 0.01 + 0.05 \times 0.99 = 0.0095 + 0.0495 = 0.059

P(diseasepos)=0.95×0.010.05916%P(\text{disease} \mid \text{pos}) = \frac{0.95 \times 0.01}{0.059} \approx \mathbf{16\%}

Despite a 95% accurate test, a positive result only means 16% chance of disease. This is why rare-disease screening is hard — base rates dominate.


4. Random Variables

A random variable XX is a function X:ΩRX: \Omega \to \mathbb{R} mapping outcomes to numbers.

Discrete random variable

Takes values in a countable set. Described by a probability mass function (PMF):

P(X=x)=p(x)xp(x)=1P(X = x) = p(x) \qquad \sum_x p(x) = 1

Example — Bernoulli(pp):

P(X=1)=pP(X=0)=1pP(X = 1) = p \qquad P(X = 0) = 1-p

Used for: binary outcomes (spam/not spam, click/no click).

Example — Binomial(n,pn, p):

P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}

Number of successes in nn independent Bernoulli trials. If I send 100 emails and each has a 3% click rate, the number of clicks is Binomial(100, 0.03).

Continuous random variable

Takes values in R\mathbb{R} (or an interval). Described by a probability density function (PDF) f(x)f(x):

P(aXb)=abf(x)dxf(x)dx=1P(a \leq X \leq b) = \int_a^b f(x)\, dx \qquad \int_{-\infty}^\infty f(x)\, dx = 1

Note: P(X=x)=0P(X = x) = 0 for any single point — probability lives in intervals.

Cumulative distribution function (CDF):

F(x)=P(Xx)=xf(t)dtf(x)=F(x)F(x) = P(X \leq x) = \int_{-\infty}^x f(t)\, dt \qquad f(x) = F'(x)


5. Expectation

The expected value (mean) of XX:

E[X]={xxp(x)discretexf(x)dxcontinuous\mathbb{E}[X] = \begin{cases} \sum_x x\, p(x) & \text{discrete} \\ \int_{-\infty}^\infty x\, f(x)\, dx & \text{continuous} \end{cases}

Think of it as a weighted average of all possible values, weighted by their probability.

Linearity of expectation — always true, regardless of dependence:

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]

Example: Roll two dice, SS = sum. E[S]=E[D1]+E[D2]=3.5+3.5=7\mathbb{E}[S] = \mathbb{E}[D_1] + \mathbb{E}[D_2] = 3.5 + 3.5 = 7. No need to enumerate all 36 outcomes.

Expectation of a function:

E[g(X)]=xg(x)p(x)\mathbb{E}[g(X)] = \sum_x g(x)\, p(x)

[!WARNING] E[g(X)]g(E[X])\mathbb{E}[g(X)] \neq g(\mathbb{E}[X]) in general. This is Jensen's inequality: for convex gg, E[g(X)]g(E[X])\mathbb{E}[g(X)] \geq g(\mathbb{E}[X]). Relevant in variational inference, log-likelihood bounds.


6. Variance and Standard Deviation

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

The second form is often easier to compute. Standard deviation: σ=Var(X)\sigma = \sqrt{\text{Var}(X)}.

Properties:

  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X)
  • Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)
  • If X,YX, Y independent: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Covariance:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Correlation (normalized covariance, scale-free):

ρ(X,Y)=Cov(X,Y)σXσY[1,1]\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]

[!NOTE] ρ=0\rho = 0 means uncorrelated, not independent. Independence implies uncorrelated, but not vice versa. If XN(0,1)X \sim \mathcal{N}(0,1) and Y=X2Y = X^2, then Cov(X,Y)=0\text{Cov}(X,Y) = 0 but they are clearly dependent.


7. Common Distributions

Uniform — Uniform(a,b)\text{Uniform}(a, b)

f(x)=1bax[a,b]f(x) = \frac{1}{b-a} \quad x \in [a,b]

E[X]=a+b2Var(X)=(ba)212\mathbb{E}[X] = \frac{a+b}{2} \qquad \text{Var}(X) = \frac{(b-a)^2}{12}

Used for: random initialization, random search, Monte Carlo sampling.


Gaussian (Normal) — N(μ,σ2)\mathcal{N}(\mu, \sigma^2)

f(x)=1σ2πexp ⁣((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

E[X]=μVar(X)=σ2\mathbb{E}[X] = \mu \qquad \text{Var}(X) = \sigma^2

The standard normal is N(0,1)\mathcal{N}(0,1), denoted ZZ. If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) then Z=(Xμ)/σZ = (X-\mu)/\sigma.

The 68-95-99.7 rule:

  • P(μσXμ+σ)68%P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 68\%
  • P(μ2σXμ+2σ)95%P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 95\%
  • P(μ3σXμ+3σ)99.7%P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 99.7\%

Why Gaussian dominates ML:

  • Central Limit Theorem — sum of many i.i.d. variables converges to Gaussian
  • Mathematically convenient — closed under linear transformations
  • Maximum entropy distribution given fixed mean and variance
  • Assumed in linear regression (residuals N(0,σ2)\sim \mathcal{N}(0, \sigma^2))

Multivariate Gaussian N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma):

f(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)

  • μRd\boldsymbol{\mu} \in \mathbb{R}^d — mean vector
  • ΣRd×d\Sigma \in \mathbb{R}^{d \times d} — covariance matrix (symmetric, PSD)
  • The exponent (xμ)TΣ1(xμ)(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu}) is the Mahalanobis distance — it measures distance in units of standard deviations, accounting for correlations

Bernoulli — Bern(p)\text{Bern}(p)

P(X=1)=pP(X=0)=1pP(X=1) = p \qquad P(X=0) = 1-p E[X]=pVar(X)=p(1p)\mathbb{E}[X] = p \qquad \text{Var}(X) = p(1-p)

Used as the output distribution for binary classification. The sigmoid function σ(z)=1/(1+ez)\sigma(z) = 1/(1+e^{-z}) maps logistic regression output to a valid pp.


Categorical — Cat(π)\text{Cat}(\boldsymbol{\pi})

Generalization of Bernoulli to KK classes. P(X=k)=πkP(X = k) = \pi_k, kπk=1\sum_k \pi_k = 1.

The softmax maps logit vectors to valid π\boldsymbol{\pi}:

πk=ezkj=1Kezj\pi_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}


Poisson — Pois(λ)\text{Pois}(\lambda)

P(X=k)=λkeλk!k=0,1,2,P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \qquad k = 0, 1, 2, \ldots

E[X]=λVar(X)=λ\mathbb{E}[X] = \lambda \qquad \text{Var}(X) = \lambda

Used for: count data (number of events in a fixed time window). Emails per hour, bugs per 1000 lines of code, word counts in documents.


Exponential — Exp(λ)\text{Exp}(\lambda)

f(x)=λeλxx0f(x) = \lambda e^{-\lambda x} \qquad x \geq 0 E[X]=1/λVar(X)=1/λ2\mathbb{E}[X] = 1/\lambda \qquad \text{Var}(X) = 1/\lambda^2

Models waiting times between Poisson events. Memoryless: P(X>s+tX>s)=P(X>t)P(X > s+t \mid X > s) = P(X > t).


Beta — Beta(α,β)\text{Beta}(\alpha, \beta)

f(x)=xα1(1x)β1B(α,β)x[0,1]f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)} \qquad x \in [0,1]

E[X]=αα+βVar(X)=αβ(α+β)2(α+β+1)\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta} \qquad \text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}

Lives on [0,1][0,1], used as a prior for probabilities (conjugate prior to Bernoulli/Binomial). Intuition: α1\alpha - 1 = number of observed successes, β1\beta - 1 = failures.


8. Maximum Likelihood Estimation (MLE)

Given data D={x(1),,x(n)}\mathcal{D} = \{x^{(1)}, \ldots, x^{(n)}\} drawn i.i.d. from p(x;θ)p(x ; \theta), the likelihood is:

L(θ)=i=1np(x(i);θ)L(\theta) = \prod_{i=1}^n p(x^{(i)} ; \theta)

MLE finds the parameter that makes the observed data most probable:

θ^MLE=argmaxθL(θ)=argmaxθi=1nlogp(x(i);θ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta L(\theta) = \arg\max_\theta \sum_{i=1}^n \log p(x^{(i)} ; \theta)

The log-likelihood (θ)=logL(θ)\ell(\theta) = \log L(\theta) is used because:

  • Products become sums (easier)
  • More numerically stable (no underflow)
  • Maximizing log is equivalent to maximizing the original (log is monotone)

Example — MLE for Gaussian

Given x(1),,x(n)N(μ,σ2)x^{(1)}, \ldots, x^{(n)} \sim \mathcal{N}(\mu, \sigma^2):

(μ,σ2)=n2log(2πσ2)12σ2i=1n(x(i)μ)2\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x^{(i)} - \mu)^2

Setting /μ=0\partial \ell / \partial \mu = 0:

μ^MLE=1ni=1nx(i)=xˉ\hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x^{(i)} = \bar{x}

Setting /σ2=0\partial \ell / \partial \sigma^2 = 0:

σ^MLE2=1ni=1n(x(i)xˉ)2\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x^{(i)} - \bar{x})^2

[!NOTE] This gives 1/n1/n not 1/(n1)1/(n-1). The MLE estimate of variance is biased (underestimates). The unbiased sample variance uses n1n-1 (Bessel's correction). In practice the difference is negligible for large nn.

MLE ↔ Loss functions

Model assumptionNegative log-likelihoodStandard name
yxN(y^,σ2)y \mid x \sim \mathcal{N}(\hat{y}, \sigma^2)yy^2\|y - \hat{y}\|^2MSE loss
yxBern(p^)y \mid x \sim \text{Bern}(\hat{p})ylogp^(1y)log(1p^)-y\log\hat{p} - (1-y)\log(1-\hat{p})Binary cross-entropy
yxCat(π)y \mid x \sim \text{Cat}(\boldsymbol{\pi})kyklogπk-\sum_k y_k \log \pi_kCross-entropy
yxLaplace(y^,b)y \mid x \sim \text{Laplace}(\hat{y}, b)yy^1\|y - \hat{y}\|_1MAE loss

This is the key insight: choosing a loss function is equivalent to choosing a probabilistic model for your noise. MSE assumes Gaussian residuals. MAE assumes Laplacian (heavier tails, more robust to outliers).


9. Maximum A Posteriori (MAP)

MAP includes a prior over parameters:

θ^MAP=argmaxθlogp(θD)=argmaxθ[logp(Dθ)+logp(θ)]\hat{\theta}_{\text{MAP}} = \arg\max_\theta \log p(\theta \mid \mathcal{D}) = \arg\max_\theta \left[\log p(\mathcal{D} \mid \theta) + \log p(\theta)\right]

MAP ↔ Regularization

Prior on θ\thetaEffect on objectiveStandard name
θjN(0,1/λ)\theta_j \sim \mathcal{N}(0, 1/\lambda)adds λθ22\lambda\|\theta\|_2^2L2 regularization / Ridge
θjLaplace(0,1/λ)\theta_j \sim \text{Laplace}(0, 1/\lambda)adds λθ1\lambda\|\theta\|_1L1 regularization / Lasso

L2 (Gaussian prior): Penalizes large weights, shrinks all weights toward zero. Prefers many small weights. Solution is always unique and dense.

L1 (Laplace prior): Produces sparse solutions — many weights become exactly zero. Useful for feature selection.

from sklearn.linear_model import Ridge, Lasso
ridge = Ridge(alpha=1.0)      # L2, alpha = λ
lasso = Lasso(alpha=0.1)      # L1

10. Information Theory

Entropy

The Shannon entropy of a discrete distribution PP:

H(P)=xp(x)logp(x)=E[logp(X)]H(P) = -\sum_x p(x) \log p(x) = \mathbb{E}[-\log p(X)]

Measures uncertainty or average surprise. Units: bits (log base 2) or nats (natural log).

  • Uniform distribution over KK classes: H=logKH = \log K (maximum entropy)
  • Point mass (certain outcome): H=0H = 0 (minimum entropy)

Example: Fair coin: H=0.5log0.50.5log0.5=1H = -0.5\log 0.5 - 0.5\log 0.5 = 1 bit. Biased coin with p=0.9p=0.9: H0.47H \approx 0.47 bits — less uncertainty.

KL Divergence

Measures how different distribution QQ is from reference PP:

DKL(PQ)=xp(x)logp(x)q(x)=EP ⁣[logp(X)q(X)]D_{\text{KL}}(P \| Q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_P\!\left[\log\frac{p(X)}{q(X)}\right]

Properties:

  • DKL(PQ)0D_{\text{KL}}(P \| Q) \geq 0 always (Gibbs inequality)
  • DKL(PQ)=0    P=QD_{\text{KL}}(P \| Q) = 0 \iff P = Q
  • Not symmetric: DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)

[!NOTE] DKL(PQ)D_{\text{KL}}(P \| Q) is "forward KL" — it's zero-avoiding (forces q>0q>0 wherever p>0p>0, so the approximation covers all modes). DKL(QP)D_{\text{KL}}(Q \| P) is "reverse KL" — zero-forcing (allows q=0q=0 where p>0p>0, causes mode-seeking behavior). Both appear in variational inference and generative models.

Cross-Entropy

H(P,Q)=xp(x)logq(x)=H(P)+DKL(PQ)H(P, Q) = -\sum_x p(x) \log q(x) = H(P) + D_{\text{KL}}(P \| Q)

When PP is the true distribution (one-hot labels) and QQ is the model output (softmax probabilities), minimizing cross-entropy = minimizing KL divergence from true to predicted = MLE.

import torch.nn.functional as F
loss = F.cross_entropy(logits, targets)  # standard for classification

Mutual Information

How much does knowing XX tell you about YY?

I(X;Y)=DKL(PXYPXPY)=H(X)H(XY)=H(Y)H(YX)I(X;Y) = D_{\text{KL}}(P_{XY} \| P_X P_Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)

I(X;Y)=0I(X;Y) = 0 iff XX and YY are independent. Used in feature selection (pick features that maximize mutual information with the label).


11. Estimators and Their Properties

An estimator θ^\hat{\theta} is a function of data used to estimate a parameter θ\theta.

Bias: Bias(θ^)=E[θ^]θ\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta

Unbiased: E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta. Sample mean xˉ\bar{x} is unbiased for μ\mu. Sample variance with n1n-1 is unbiased for σ2\sigma^2.

Variance of the estimator: Var(θ^)=E[(θ^E[θ^])2]\text{Var}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right]

Mean Squared Error: MSE(θ^)=Bias(θ^)2+Var(θ^)\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta})

This is the bias-variance tradeoff in disguise. A slightly biased estimator with much lower variance can have lower MSE — sometimes paying a little bias to get a lot of variance reduction is worth it. Ridge regression does exactly this.

Consistency: θ^npθ\hat{\theta}_n \xrightarrow{p} \theta as nn \to \infty. MLE estimators are typically consistent.


12. Central Limit Theorem

Let X1,,XnX_1, \ldots, X_n be i.i.d. with mean μ\mu and variance σ2\sigma^2. Then:

n(Xˉnμ)dN(0,σ2)\sqrt{n}\left(\bar{X}_n - \mu\right) \xrightarrow{d} \mathcal{N}(0, \sigma^2)

Or equivalently:

XˉnN ⁣(μ,σ2n)\bar{X}_n \approx \mathcal{N}\!\left(\mu,\, \frac{\sigma^2}{n}\right)

Why it matters: No matter what distribution your data comes from, the sample mean is approximately normally distributed for large nn. This is why Gaussian assumptions work so well in practice.

Standard error of the mean: SE=σ/n\text{SE} = \sigma/\sqrt{n}. Collecting 4× more data halves the standard error.


13. Hypothesis Testing (just enough for ML)

A p-value is P(data this extreme or moreH0)P(\text{data this extreme or more} \mid H_0). It is not P(H0data)P(H_0 \mid \text{data}).

Common tests in ML practice:

TestWhen to use
t-testCompare means of two models/groups
Paired t-testCompare two models on the same test sets
McNemar's testCompare two classifiers on the same examples
Kolmogorov-SmirnovCheck if a sample came from a distribution
from scipy import stats

# Are model A and model B accuracies significantly different?
t_stat, p_value = stats.ttest_rel(accuracies_A, accuracies_B)
print(f"p = {p_value:.4f}")  # if p < 0.05, difference is "significant"

[!WARNING] Statistical significance is not practical significance. With enough data, tiny useless differences become "significant" (p<0.05p < 0.05). Always report effect size alongside p-values.


14. Covariance Matrix in Detail

For a random vector xRd\mathbf{x} \in \mathbb{R}^d:

Σ=Cov(x)=E[(xμ)(xμ)T]\Sigma = \text{Cov}(\mathbf{x}) = \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T]

Σij=Cov(xi,xj)\Sigma_{ij} = \text{Cov}(x_i, x_j)

Properties: symmetric, positive semi-definite, diagonal entries are variances.

Sample covariance from data matrix XRn×dX \in \mathbb{R}^{n \times d} (zero-mean):

Σ^=1n1XTX\hat{\Sigma} = \frac{1}{n-1} X^T X

The correlation matrix is the normalized version:

Rij=ΣijΣiiΣjjR_{ij} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii} \Sigma_{jj}}}

# Empirical covariance
X_centered = X - X.mean(axis=0)
Sigma = (X_centered.T @ X_centered) / (n - 1)

# Or just:
Sigma = np.cov(X.T)          # X is n×d, np.cov wants d×n
corr  = np.corrcoef(X.T)     # correlation matrix

Geometric interpretation: The covariance matrix defines an ellipse (in 2D) or ellipsoid (in ddD). Its eigenvectors are the axes of the ellipsoid; eigenvalues are the squared semi-axis lengths. PCA finds these axes.


15. Quick Reference

import numpy as np
from scipy import stats

# ── Descriptive ───────────────────────────────
np.mean(x)
np.median(x)
np.std(x)              # std dev (ddof=0 by default → biased)
np.std(x, ddof=1)      # unbiased sample std
np.var(x, ddof=1)
np.percentile(x, [25, 50, 75])

# ── Distributions ─────────────────────────────
norm  = stats.norm(loc=0, scale=1)
norm.pdf(x)            # density
norm.cdf(x)            # P(X ≤ x)
norm.ppf(0.975)        # quantile (inverse CDF) → 1.96

# Sample from distributions
np.random.normal(mu, sigma, size=1000)
np.random.binomial(n=10, p=0.3, size=1000)
np.random.poisson(lam=5, size=1000)
np.random.beta(a=2, b=5, size=1000)

# ── MLE fits ──────────────────────────────────
mu_hat, sigma_hat = stats.norm.fit(data)
a_hat, b_hat, loc, scale = stats.beta.fit(data, floc=0, fscale=1)

# ── Entropy / Information ─────────────────────
from scipy.stats import entropy
H = entropy([0.5, 0.5])               # Shannon entropy (nats)
H = entropy([0.5, 0.5], base=2)       # bits
KL = entropy(p, q)                    # KL divergence D_KL(p||q)

# ── Correlation ───────────────────────────────
np.corrcoef(x, y)[0,1]               # Pearson correlation
stats.pearsonr(x, y)                 # + p-value
stats.spearmanr(x, y)                # rank correlation

# ── Tests ─────────────────────────────────────
stats.ttest_ind(a, b)                # independent t-test
stats.ttest_rel(a, b)                # paired t-test
stats.kstest(x, 'norm')              # normality test
stats.chi2_contingency(table)        # chi-squared test