Example: Roll a fair die. Ω={1,2,3,4,5,6}. Event A = "even" = {2,4,6}. P(A)=3/6=0.5.
2. Conditional Probability
P(A∣B)=P(B)P(A∩B)provided P(B)>0
"The probability of A given that we know B has occurred."
Example: A patient tests positive for a disease. What is the probability they actually have it? That's P(disease∣positive) — not the same as P(positive∣disease). Confusing these is a classic error (see Bayes theorem below).
Multiplication rule
P(A∩B)=P(A∣B)P(B)=P(B∣A)P(A)
Independence
A and B are independent if knowing B tells you nothing about A:
P(A∣B)=P(A)⇔P(A∩B)=P(A)P(B)
[!WARNING]
Independence is an assumption, not something you observe directly. The i.i.d. assumption in ML (samples are independent and identically distributed) is an assumption about how data was collected — it's often approximately true but rarely exactly true.
3. Bayes' Theorem
P(A∣B)=P(B)P(B∣A)P(A)
In ML language: posterior = (likelihood × prior) / evidence.
P(θ∣D)=P(D)P(D∣θ)P(θ)
P(θ) — prior: what we believe about parameters before seeing data
P(D∣θ) — likelihood: how probable is the data under these parameters
P(D) — evidence (normalizing constant, often intractable)
P(θ∣D) — posterior: updated belief after seeing data
Law of Total Probability
If B1,…,Bk partition Ω (mutually exclusive, exhaustive):
P(A)=∑i=1kP(A∣Bi)P(Bi)
Used to compute P(D) in the denominator of Bayes.
Classic example — Medical test
A disease affects 1% of the population. A test is 95% sensitive (true positive rate) and 95% specific (true negative rate). You test positive. What's the probability you have the disease?
P(disease)=0.01 (prior)
P(pos∣disease)=0.95 (sensitivity)
P(pos∣no disease)=0.05 (false positive rate)
P(pos)=0.95×0.01+0.05×0.99=0.0095+0.0495=0.059
P(disease∣pos)=0.0590.95×0.01≈16%
Despite a 95% accurate test, a positive result only means 16% chance of disease. This is why rare-disease screening is hard — base rates dominate.
4. Random Variables
A random variableX is a function X:Ω→R mapping outcomes to numbers.
Discrete random variable
Takes values in a countable set. Described by a probability mass function (PMF):
P(X=x)=p(x)∑xp(x)=1
Example — Bernoulli(p):
P(X=1)=pP(X=0)=1−p
Used for: binary outcomes (spam/not spam, click/no click).
Example — Binomial(n,p):
P(X=k)=(kn)pk(1−p)n−k
Number of successes in n independent Bernoulli trials. If I send 100 emails and each has a 3% click rate, the number of clicks is Binomial(100, 0.03).
Continuous random variable
Takes values in R (or an interval). Described by a probability density function (PDF)f(x):
P(a≤X≤b)=∫abf(x)dx∫−∞∞f(x)dx=1
Note: P(X=x)=0 for any single point — probability lives in intervals.
Cumulative distribution function (CDF):
F(x)=P(X≤x)=∫−∞xf(t)dtf(x)=F′(x)
5. Expectation
The expected value (mean) of X:
E[X]={∑xxp(x)∫−∞∞xf(x)dxdiscretecontinuous
Think of it as a weighted average of all possible values, weighted by their probability.
Linearity of expectation — always true, regardless of dependence:
E[aX+bY]=aE[X]+bE[Y]
Example: Roll two dice, S = sum. E[S]=E[D1]+E[D2]=3.5+3.5=7. No need to enumerate all 36 outcomes.
Expectation of a function:
E[g(X)]=∑xg(x)p(x)
[!WARNING]
E[g(X)]=g(E[X]) in general. This is Jensen's inequality: for convex g, E[g(X)]≥g(E[X]). Relevant in variational inference, log-likelihood bounds.
6. Variance and Standard Deviation
Var(X)=E[(X−E[X])2]=E[X2]−(E[X])2
The second form is often easier to compute. Standard deviation: σ=Var(X).
Properties:
Var(aX+b)=a2Var(X)
Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)
If X,Y independent: Var(X+Y)=Var(X)+Var(Y)
Covariance:
Cov(X,Y)=E[(X−E[X])(Y−E[Y])]=E[XY]−E[X]E[Y]
Correlation (normalized covariance, scale-free):
ρ(X,Y)=σXσYCov(X,Y)∈[−1,1]
[!NOTE]
ρ=0 means uncorrelated, not independent. Independence implies uncorrelated, but not vice versa. If X∼N(0,1) and Y=X2, then Cov(X,Y)=0 but they are clearly dependent.
7. Common Distributions
Uniform — Uniform(a,b)
f(x)=b−a1x∈[a,b]
E[X]=2a+bVar(X)=12(b−a)2
Used for: random initialization, random search, Monte Carlo sampling.
Gaussian (Normal) — N(μ,σ2)
f(x)=σ2π1exp(−2σ2(x−μ)2)
E[X]=μVar(X)=σ2
The standard normal is N(0,1), denoted Z. If X∼N(μ,σ2) then Z=(X−μ)/σ.
The 68-95-99.7 rule:
P(μ−σ≤X≤μ+σ)≈68%
P(μ−2σ≤X≤μ+2σ)≈95%
P(μ−3σ≤X≤μ+3σ)≈99.7%
Why Gaussian dominates ML:
Central Limit Theorem — sum of many i.i.d. variables converges to Gaussian
Mathematically convenient — closed under linear transformations
Maximum entropy distribution given fixed mean and variance
Assumed in linear regression (residuals ∼N(0,σ2))
Multivariate GaussianN(μ,Σ):
f(x)=(2π)d/2∣Σ∣1/21exp(−21(x−μ)TΣ−1(x−μ))
μ∈Rd — mean vector
Σ∈Rd×d — covariance matrix (symmetric, PSD)
The exponent (x−μ)TΣ−1(x−μ) is the Mahalanobis distance — it measures distance in units of standard deviations, accounting for correlations
Bernoulli — Bern(p)
P(X=1)=pP(X=0)=1−pE[X]=pVar(X)=p(1−p)
Used as the output distribution for binary classification. The sigmoid function σ(z)=1/(1+e−z) maps logistic regression output to a valid p.
Categorical — Cat(π)
Generalization of Bernoulli to K classes. P(X=k)=πk, ∑kπk=1.
The softmax maps logit vectors to valid π:
πk=∑j=1Kezjezk
Poisson — Pois(λ)
P(X=k)=k!λke−λk=0,1,2,…
E[X]=λVar(X)=λ
Used for: count data (number of events in a fixed time window). Emails per hour, bugs per 1000 lines of code, word counts in documents.
Exponential — Exp(λ)
f(x)=λe−λxx≥0E[X]=1/λVar(X)=1/λ2
Models waiting times between Poisson events. Memoryless: P(X>s+t∣X>s)=P(X>t).
Beta — Beta(α,β)
f(x)=B(α,β)xα−1(1−x)β−1x∈[0,1]
E[X]=α+βαVar(X)=(α+β)2(α+β+1)αβ
Lives on [0,1], used as a prior for probabilities (conjugate prior to Bernoulli/Binomial). Intuition: α−1 = number of observed successes, β−1 = failures.
8. Maximum Likelihood Estimation (MLE)
Given data D={x(1),…,x(n)} drawn i.i.d. from p(x;θ), the likelihood is:
L(θ)=∏i=1np(x(i);θ)
MLE finds the parameter that makes the observed data most probable:
θ^MLE=argmaxθL(θ)=argmaxθ∑i=1nlogp(x(i);θ)
The log-likelihoodℓ(θ)=logL(θ) is used because:
Products become sums (easier)
More numerically stable (no underflow)
Maximizing log is equivalent to maximizing the original (log is monotone)
Example — MLE for Gaussian
Given x(1),…,x(n)∼N(μ,σ2):
ℓ(μ,σ2)=−2nlog(2πσ2)−2σ21∑i=1n(x(i)−μ)2
Setting ∂ℓ/∂μ=0:
μ^MLE=n1∑i=1nx(i)=xˉ
Setting ∂ℓ/∂σ2=0:
σ^MLE2=n1∑i=1n(x(i)−xˉ)2
[!NOTE]
This gives 1/n not 1/(n−1). The MLE estimate of variance is biased (underestimates). The unbiased sample variance uses n−1 (Bessel's correction). In practice the difference is negligible for large n.
MLE ↔ Loss functions
Model assumption
Negative log-likelihood
Standard name
y∣x∼N(y^,σ2)
∥y−y^∥2
MSE loss
y∣x∼Bern(p^)
−ylogp^−(1−y)log(1−p^)
Binary cross-entropy
y∣x∼Cat(π)
−∑kyklogπk
Cross-entropy
y∣x∼Laplace(y^,b)
∥y−y^∥1
MAE loss
This is the key insight: choosing a loss function is equivalent to choosing a probabilistic model for your noise. MSE assumes Gaussian residuals. MAE assumes Laplacian (heavier tails, more robust to outliers).
Measures uncertainty or average surprise. Units: bits (log base 2) or nats (natural log).
Uniform distribution over K classes: H=logK (maximum entropy)
Point mass (certain outcome): H=0 (minimum entropy)
Example: Fair coin: H=−0.5log0.5−0.5log0.5=1 bit. Biased coin with p=0.9: H≈0.47 bits — less uncertainty.
KL Divergence
Measures how different distribution Q is from reference P:
DKL(P∥Q)=∑xp(x)logq(x)p(x)=EP[logq(X)p(X)]
Properties:
DKL(P∥Q)≥0 always (Gibbs inequality)
DKL(P∥Q)=0⟺P=Q
Not symmetric: DKL(P∥Q)=DKL(Q∥P)
[!NOTE]
DKL(P∥Q) is "forward KL" — it's zero-avoiding (forces q>0 wherever p>0, so the approximation covers all modes). DKL(Q∥P) is "reverse KL" — zero-forcing (allows q=0 where p>0, causes mode-seeking behavior). Both appear in variational inference and generative models.
Cross-Entropy
H(P,Q)=−∑xp(x)logq(x)=H(P)+DKL(P∥Q)
When P is the true distribution (one-hot labels) and Q is the model output (softmax probabilities), minimizing cross-entropy = minimizing KL divergence from true to predicted = MLE.
import torch.nn.functional as F
loss = F.cross_entropy(logits, targets) # standard for classification
Mutual Information
How much does knowing X tell you about Y?
I(X;Y)=DKL(PXY∥PXPY)=H(X)−H(X∣Y)=H(Y)−H(Y∣X)
I(X;Y)=0 iff X and Y are independent. Used in feature selection (pick features that maximize mutual information with the label).
11. Estimators and Their Properties
An estimatorθ^ is a function of data used to estimate a parameter θ.
Bias:Bias(θ^)=E[θ^]−θ
Unbiased: E[θ^]=θ. Sample mean xˉ is unbiased for μ. Sample variance with n−1 is unbiased for σ2.
Variance of the estimator:Var(θ^)=E[(θ^−E[θ^])2]
Mean Squared Error:MSE(θ^)=Bias(θ^)2+Var(θ^)
This is the bias-variance tradeoff in disguise. A slightly biased estimator with much lower variance can have lower MSE — sometimes paying a little bias to get a lot of variance reduction is worth it. Ridge regression does exactly this.
Consistency:θ^npθ as n→∞. MLE estimators are typically consistent.
12. Central Limit Theorem
Let X1,…,Xn be i.i.d. with mean μ and variance σ2. Then:
n(Xˉn−μ)dN(0,σ2)
Or equivalently:
Xˉn≈N(μ,nσ2)
Why it matters: No matter what distribution your data comes from, the sample mean is approximately normally distributed for large n. This is why Gaussian assumptions work so well in practice.
Standard error of the mean:SE=σ/n. Collecting 4× more data halves the standard error.
13. Hypothesis Testing (just enough for ML)
A p-value is P(data this extreme or more∣H0). It is notP(H0∣data).
Common tests in ML practice:
Test
When to use
t-test
Compare means of two models/groups
Paired t-test
Compare two models on the same test sets
McNemar's test
Compare two classifiers on the same examples
Kolmogorov-Smirnov
Check if a sample came from a distribution
from scipy import stats
# Are model A and model B accuracies significantly different?
t_stat, p_value = stats.ttest_rel(accuracies_A, accuracies_B)
print(f"p = {p_value:.4f}") # if p < 0.05, difference is "significant"
[!WARNING]
Statistical significance is not practical significance. With enough data, tiny useless differences become "significant" (p<0.05). Always report effect size alongside p-values.
14. Covariance Matrix in Detail
For a random vector x∈Rd:
Σ=Cov(x)=E[(x−μ)(x−μ)T]
Σij=Cov(xi,xj)
Properties: symmetric, positive semi-definite, diagonal entries are variances.
Sample covariance from data matrix X∈Rn×d (zero-mean):
Σ^=n−11XTX
The correlation matrix is the normalized version:
Rij=ΣiiΣjjΣij
# Empirical covariance
X_centered = X - X.mean(axis=0)
Sigma = (X_centered.T @ X_centered) / (n - 1)
# Or just:
Sigma = np.cov(X.T) # X is n×d, np.cov wants d×n
corr = np.corrcoef(X.T) # correlation matrix
Geometric interpretation: The covariance matrix defines an ellipse (in 2D) or ellipsoid (in dD). Its eigenvectors are the axes of the ellipsoid; eigenvalues are the squared semi-axis lengths. PCA finds these axes.