← Appendix
Mathematics

Information Theory

Entropy, KL divergence, cross-entropy — used in loss functions

Information Theory

1. Why Information Theory Matters for ML

Information theory, invented by Claude Shannon in 1948, answers one question: how do we quantify information, uncertainty, and communication? It turns out this is exactly what ML needs:

  • Loss functions (cross-entropy, KL divergence) are information-theoretic quantities
  • Regularization has an information-theoretic interpretation
  • Generalization bounds use mutual information
  • Compression and representation learning are the same problem
  • The entire field of variational inference is applied information theory

The core idea: rare events carry more information than common ones. "The sun rose today" tells you nothing. "It snowed in July in Bologna" tells you a lot.


2. Self-Information (Surprisal)

The information content of a single outcome xx:

I(x)=logp(x)I(x) = -\log p(x)

Units depend on the base:

  • Base 2 → bits (most common in theory)
  • Base ee → nats (most common in ML, since log\nabla \log is cleaner)
  • Base 10 → hartleys (rare)

Properties:

  • p(x)=1I(x)=0p(x) = 1 \Rightarrow I(x) = 0 — certain events carry zero information
  • p(x)0I(x)p(x) \to 0 \Rightarrow I(x) \to \infty — impossible events are infinitely surprising
  • Additivity: if xx and yy are independent, I(x,y)=I(x)+I(y)I(x, y) = I(x) + I(y)

This additivity property is the key motivation. We want information to add up for independent events, and log(p(x)p(y))=logp(x)logp(y)-\log(p(x)p(y)) = -\log p(x) - \log p(y).

Example: A fair coin. I(heads)=log2(0.5)=1I(\text{heads}) = -\log_2(0.5) = 1 bit. A biased coin with p=0.01p = 0.01: I(heads)=log2(0.01)6.64I(\text{heads}) = -\log_2(0.01) \approx 6.64 bits — landing heads is very surprising.

Example: In a vocabulary of 50,000 words (roughly uniform), the word "the" has probability 0.07\approx 0.07, giving I3.8I \approx 3.8 bits. A rare word like "ephemeral" at p0.00001p \approx 0.00001 gives I16.6I \approx 16.6 bits.


3. Shannon Entropy

The entropy of a discrete distribution PP over X\mathcal{X}:

H(X)=E[I(X)]=xXp(x)logp(x)H(X) = \mathbb{E}[I(X)] = -\sum_{x \in \mathcal{X}} p(x) \log p(x)

It is the expected surprisal — the average amount of information per outcome. Also interpretable as the minimum average number of bits needed to encode outcomes from PP.

Convention: 0log0=00 \log 0 = 0 (by continuity, since xlogx0x \log x \to 0 as x0x \to 0).

Properties

Non-negativity: H(X)0H(X) \geq 0 with equality iff XX is deterministic (all probability on one outcome).

Maximum entropy: H(X)logXH(X) \leq \log |\mathcal{X}| with equality iff PP is uniform. Uniform = maximum uncertainty = maximum entropy.

Concavity: HH is a concave function of the probability vector p\mathbf{p}.

Example — Binary entropy function

For a Bernoulli(pp) variable:

Hb(p)=plog2p(1p)log2(1p)H_b(p) = -p\log_2 p - (1-p)\log_2(1-p)

ppHb(p)H_b(p) (bits)Interpretation
0.00Certain — no information
0.10.469Mostly predictable
0.30.881Some uncertainty
0.51.0Maximum uncertainty for binary
0.70.881Symmetric with 0.3
1.00Certain — no information

The function is symmetric around p=0.5p = 0.5 and peaks at exactly 1 bit.

Entropy of common distributions

DistributionEntropy
Bernoulli(pp)plogp(1p)log(1p)-p\log p - (1-p)\log(1-p)
Categorical(π\boldsymbol{\pi}, KK classes)k=1Kπklogπk-\sum_{k=1}^K \pi_k \log \pi_k
Uniform(KK outcomes)logK\log K
N(μ,σ2)\mathcal{N}(\mu, \sigma^2)12log(2πeσ2)\frac{1}{2}\log(2\pi e \sigma^2)
N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) in Rd\mathbb{R}^d12log[(2πe)ddetΣ]\frac{1}{2}\log\left[(2\pi e)^d \det\Sigma\right]
Exponential(λ\lambda)1logλ1 - \log \lambda

[!MATH] The Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2) has the maximum entropy of all distributions with fixed mean μ\mu and variance σ2\sigma^2. This is a deep result — if all you know about a distribution is its mean and variance, the least-committal (most honest) choice is Gaussian.


4. Differential Entropy

For a continuous random variable with density f(x)f(x):

h(X)=f(x)logf(x)dxh(X) = -\int_{-\infty}^\infty f(x) \log f(x)\, dx

[!WARNING] Differential entropy is not the continuous analog of discrete entropy in every sense. It can be negative. For example, Uniform(0,a)\text{Uniform}(0, a) has h=logah = \log a, which is negative when a<1a < 1. It is not invariant to reparameterization. What matters is differences of differential entropy (like KL divergence), which are well-behaved.

It still serves the same intuitive role — measures spread/uncertainty of a continuous distribution — and most formulas involving entropy still hold when you use hh.


5. Joint and Conditional Entropy

Joint entropy

H(X,Y)=x,yp(x,y)logp(x,y)H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y)

The uncertainty in the joint outcome. Always:

H(X,Y)H(X)+H(Y)H(X, Y) \leq H(X) + H(Y)

with equality iff XX and YY are independent. Knowing YY can only reduce or leave unchanged the uncertainty in XX.

Conditional entropy

The uncertainty in XX given that we know YY:

H(XY)=yp(y)H(XY=y)=x,yp(x,y)logp(xy)H(X \mid Y) = \sum_y p(y) H(X \mid Y = y) = -\sum_{x,y} p(x,y) \log p(x \mid y)

Also written as:

H(XY)=H(X,Y)H(Y)H(X \mid Y) = H(X, Y) - H(Y)

Conditioning reduces entropy: H(XY)H(X)H(X \mid Y) \leq H(X) with equality iff XYX \perp Y.

Chain rule for entropy

H(X1,X2,,Xn)=i=1nH(XiX1,,Xi1)H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1})

The joint uncertainty = sum of conditional uncertainties added one variable at a time.


6. KL Divergence

The Kullback-Leibler divergence from QQ to PP:

DKL(PQ)=xp(x)logp(x)q(x)=EP ⁣[logp(X)q(X)]D_{\text{KL}}(P \| Q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_P\!\left[\log \frac{p(X)}{q(X)}\right]

For continuous distributions:

DKL(PQ)=fP(x)logfP(x)fQ(x)dxD_{\text{KL}}(P \| Q) = \int f_P(x) \log \frac{f_P(x)}{f_Q(x)}\, dx

Read: "KL divergence from QQ to PP" or "KL of PP with respect to QQ."

Properties

Non-negativity (Gibbs inequality): DKL(PQ)0with equality iff P=QD_{\text{KL}}(P \| Q) \geq 0 \quad \text{with equality iff } P = Q

Proof sketch: follows from the concavity of log\log and Jensen's inequality: DKL(PQ)=EP ⁣[logq(X)p(X)]logEP ⁣[q(X)p(X)]=log1=0-D_{\text{KL}}(P \| Q) = \mathbb{E}_P\!\left[\log\frac{q(X)}{p(X)}\right] \leq \log \mathbb{E}_P\!\left[\frac{q(X)}{p(X)}\right] = \log 1 = 0

Asymmetry: DKL(PQ)DKL(QP) in generalD_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) \text{ in general}

This is not a distance in the mathematical sense — it violates symmetry and the triangle inequality.

Not defined if Q=0Q = 0 where P>0P > 0: If q(x)=0q(x) = 0 but p(x)>0p(x) > 0, then DKL=D_{\text{KL}} = \infty. You are trying to encode events under QQ that QQ thinks are impossible. This forces approximate distributions to have full support.

Forward vs Reverse KL

This asymmetry has deep practical consequences:

Forward KLDKL(PQ)D_{\text{KL}}(P \| Q), minimized w.r.t. QQ:

minQDKL(PQ)=minQEP ⁣[logp(X)q(X)]\min_Q D_{\text{KL}}(P \| Q) = \min_Q \mathbb{E}_P\!\left[\log\frac{p(X)}{q(X)}\right]

Called moment-matching or mean-seeking. QQ must cover all regions where PP has mass. If PP is multimodal, QQ spreads to cover all modes — resulting in a wide, blurry approximation.

Reverse KLDKL(QP)D_{\text{KL}}(Q \| P), minimized w.r.t. QQ:

minQDKL(QP)=minQEQ ⁣[logq(X)p(X)]\min_Q D_{\text{KL}}(Q \| P) = \min_Q \mathbb{E}_Q\!\left[\log\frac{q(X)}{p(X)}\right]

Called information projection or mode-seeking. QQ concentrates on one mode of PP rather than covering all of them. If PP is multimodal, QQ collapses to a single peak.

[!NOTE] Variational inference (VI) typically minimizes reverse KL DKL(QP)D_{\text{KL}}(Q \| P) because we can compute EQ[]\mathbb{E}_Q[\cdot] by sampling from QQ (which we control), but we can't sample from PP (which is the posterior we're trying to approximate). This is why VI tends to underestimate posterior variance.

KL for Gaussians

Closed form for P=N(μ1,σ12)P = \mathcal{N}(\mu_1, \sigma_1^2) and Q=N(μ2,σ22)Q = \mathcal{N}(\mu_2, \sigma_2^2):

DKL(PQ)=logσ2σ1+σ12+(μ1μ2)22σ2212D_{\text{KL}}(P \| Q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

For multivariate P=N(μ1,Σ1)P = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1), Q=N(μ2,Σ2)Q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2):

DKL(PQ)=12[logΣ2Σ1d+tr(Σ21Σ1)+(μ1μ2)TΣ21(μ1μ2)]D_{\text{KL}}(P \| Q) = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T \Sigma_2^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)\right]

This formula appears in the VAE loss, in Gaussian process inference, and in natural gradient methods.

from scipy.special import rel_entr
import numpy as np

# KL divergence for discrete distributions
def kl_divergence(p, q):
    p, q = np.array(p), np.array(q)
    return np.sum(rel_entr(p, q))  # handles 0*log(0)=0, inf where q=0 and p>0

p = [0.4, 0.4, 0.2]
q = [0.33, 0.33, 0.34]
print(kl_divergence(p, q))   # → 0.0268 nats

# KL for Gaussians
def kl_gaussian(mu1, sig1, mu2, sig2):
    return (np.log(sig2/sig1)
            + (sig1**2 + (mu1-mu2)**2) / (2*sig2**2)
            - 0.5)

print(kl_gaussian(0, 1, 0, 2))  # N(0,1) || N(0,4) → 0.443 nats
print(kl_gaussian(0, 2, 0, 1))  # N(0,4) || N(0,1) → 1.386 nats — asymmetric!

7. Cross-Entropy

H(P,Q)=xp(x)logq(x)=H(P)+DKL(PQ)H(P, Q) = -\sum_x p(x) \log q(x) = H(P) + D_{\text{KL}}(P \| Q)

Interpretation: the average number of bits needed to encode events from PP using a code optimized for QQ.

Since H(P)H(P) is fixed when PP is fixed (true labels), minimizing cross-entropy is equivalent to minimizing KL divergence from PP to QQ.

Cross-entropy as a loss function

In classification with KK classes:

L=k=1Kyklogp^k\mathcal{L} = -\sum_{k=1}^K y_k \log \hat{p}_k

For one-hot labels (only one yk=1y_k = 1):

L=logp^y\mathcal{L} = -\log \hat{p}_{y}

This is the negative log-probability of the correct class — just a single term survives. The loss is small when the model assigns high probability to the right class, large when it doesn't.

For binary classification (sigmoid output):

L=ylogp^(1y)log(1p^)\mathcal{L} = -y\log\hat{p} - (1-y)\log(1-\hat{p})

[!MATH] Why not MSE for classification? MSE assumes Gaussian noise around the true label. For a binary label, Gaussian doesn't make sense — the output is a probability in [0,1][0,1]. Cross-entropy corresponds to assuming Bernoulli noise, which is the right model. Cross-entropy also penalizes confident wrong predictions much more severely than MSE does: if y=1y=1 and p^=0.01\hat{p}=0.01, cross-entropy gives log(0.01)=4.6-\log(0.01) = 4.6, while MSE gives (10.01)2=0.98(1-0.01)^2 = 0.98. Cross-entropy creates stronger gradient signals for badly wrong predictions.

Label smoothing

Hard one-hot targets have infinite cross-entropy loss if the model ever assigns zero probability to the correct class. Label smoothing replaces one-hot with a soft target:

y~k=(1ϵ)yk+ϵK\tilde{y}_k = (1 - \epsilon) y_k + \frac{\epsilon}{K}

Typically ϵ=0.1\epsilon = 0.1. This prevents overconfidence and improves calibration.


8. Mutual Information

How much does knowing XX reduce uncertainty about YY?

I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)

Equivalently:

I(X;Y)=DKL(PXYPXPY)=x,yp(x,y)logp(x,y)p(x)p(y)I(X; Y) = D_{\text{KL}}(P_{XY} \| P_X P_Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

Properties:

  • I(X;Y)0I(X;Y) \geq 0 — knowing YY can only help or be useless, never hurt
  • I(X;Y)=0    XYI(X;Y) = 0 \iff X \perp Y
  • I(X;Y)=I(Y;X)I(X;Y) = I(Y;X) — symmetric (unlike KL)
  • I(X;X)=H(X)I(X;X) = H(X) — a variable tells you everything about itself

Information diagram (Venn diagram of entropies)

   ┌─────────────────────────────────┐
   │         H(X,Y)                 │
   │  ┌──────────────┐              │
   │  │     H(X)     │  H(Y|X)      │
   │  │  H(X|Y) ┌────┼────┐         │
   │  │         │I(X;Y)   │         │
   │  └─────────┼────┘    │         │
   │            │   H(Y)  │         │
   │            └─────────┘         │
   └─────────────────────────────────┘

H(X,Y)=H(X)+H(Y)I(X;Y)H(X,Y) = H(X) + H(Y) - I(X;Y) I(X;Y)=H(X)+H(Y)H(X,Y)I(X;Y) = H(X) + H(Y) - H(X,Y)

Conditional mutual information

I(X;YZ)=H(XZ)H(XY,Z)I(X;Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z)

Measures how much YY tells us about XX when we already know ZZ.

Data processing inequality: If XYZX \to Y \to Z is a Markov chain (Z depends on X only through Y):

I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y)

Processing data can only destroy information, never create it. This is why features should be computed carefully — every deterministic transformation of the input can only lose information about the target.

Mutual information in ML

Feature selection: Pick the feature XjX_j that maximizes I(Xj;Y)I(X_j; Y) where YY is the label.

Representation learning: Learn a representation Z=f(X)Z = f(X) that maximizes I(Z;Y)I(Z; Y) (retains task-relevant info) while minimizing I(Z;X)I(Z; X) (discards irrelevant info). This is the Information Bottleneck principle.

InfoNCE / contrastive learning: Modern self-supervised methods like SimCLR and CLIP maximize a lower bound on mutual information between views of the same data:

I(X;Y)E[logef(x)Tf(y)1Nj=1Nef(x)Tf(yj)]I(X; Y) \geq \mathbb{E}\left[\log \frac{e^{f(x)^T f(y)}}{\frac{1}{N}\sum_{j=1}^N e^{f(x)^T f(y_j)}}\right]

from sklearn.feature_selection import mutual_info_classif

# Mutual information between each feature and the target
mi_scores = mutual_info_classif(X_train, y_train)
top_features = np.argsort(mi_scores)[::-1][:10]

9. Jensen-Shannon Divergence

A symmetric, bounded version of KL divergence:

JSD(PQ)=12DKL(PM)+12DKL(QM)M=P+Q2\text{JSD}(P \| Q) = \frac{1}{2} D_{\text{KL}}(P \| M) + \frac{1}{2} D_{\text{KL}}(Q \| M) \qquad M = \frac{P+Q}{2}

Properties:

  • 0JSD(PQ)log20 \leq \text{JSD}(P \| Q) \leq \log 2
  • Symmetric: JSD(PQ)=JSD(QP)\text{JSD}(P \| Q) = \text{JSD}(Q \| P)
  • JSD\sqrt{\text{JSD}} is an actual metric (triangle inequality holds)
  • Equal to 0 iff P=QP = Q, equal to log2\log 2 iff PP and QQ have disjoint supports

Used in the original GAN formulation — the GAN discriminator implicitly minimizes JSD between real and generated distributions:

minGmaxDV(D,G)=JSD(PdataPG)log4\min_G \max_D V(D,G) = \text{JSD}(P_{\text{data}} \| P_G) - \log 4


10. Fisher Information

The Fisher information measures how much information the observable data XX carries about a parameter θ\theta:

I(θ)=E ⁣[(θlogp(X;θ))2]=E ⁣[2θ2logp(X;θ)]\mathcal{I}(\theta) = \mathbb{E}\!\left[\left(\frac{\partial}{\partial\theta} \log p(X;\theta)\right)^2\right] = -\mathbb{E}\!\left[\frac{\partial^2}{\partial\theta^2} \log p(X;\theta)\right]

The term θlogp(X;θ)\frac{\partial}{\partial\theta} \log p(X;\theta) is called the score function. Its expectation is zero; its variance is the Fisher information.

Cramér-Rao bound

No unbiased estimator can have variance smaller than:

Var(θ^)1nI(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{n\, \mathcal{I}(\theta)}

This is the fundamental limit on estimation accuracy. MLE is asymptotically efficient — it achieves this bound for large nn.

Fisher information matrix (multiparameter)

I(θ)ij=E ⁣[2logp(X;θ)θiθj]\mathcal{I}(\boldsymbol{\theta})_{ij} = -\mathbb{E}\!\left[\frac{\partial^2 \log p(X;\boldsymbol{\theta})}{\partial\theta_i \partial\theta_j}\right]

This is the expected Hessian of the log-likelihood. It defines the natural gradient:

~θL=I(θ)1θL\tilde{\nabla}_\theta \mathcal{L} = \mathcal{I}(\theta)^{-1} \nabla_\theta \mathcal{L}

Natural gradient descent accounts for the curvature of the parameter space, leading to faster convergence. Used in KFAC, Adam (approximately), and trust region methods.

Fisher information and KL divergence

The Fisher information is the local curvature of KL divergence around θ\theta:

DKL(Pθ+ϵPθ)12ϵTI(θ)ϵD_{\text{KL}}(P_{\theta+\epsilon} \| P_\theta) \approx \frac{1}{2} \epsilon^T \mathcal{I}(\theta) \epsilon

This makes it the right measure of how fast the distribution changes as θ\theta moves.


11. Minimum Description Length (MDL)

Shannon's source coding theorem: The minimum average code length for outcomes from PP is exactly H(P)H(P) bits. You cannot compress below entropy.

More specifically, the optimal code assigns log2p(x)-\log_2 p(x) bits to outcome xx. Frequent events → short codes. Rare events → long codes. This is the idea behind Huffman coding and arithmetic coding.

Kolmogorov complexity K(x)K(x) — the length of the shortest program that outputs xx. Unlike Shannon entropy (which is a property of a distribution), Kolmogorov complexity is a property of an individual string. It is uncomputable in general, but serves as a theoretical ideal.

MDL principle for model selection: Choose the model MM that minimizes:

MDL=L(M)model description+L(DM)data given model\text{MDL} = \underbrace{L(M)}_{\text{model description}} + \underbrace{L(\mathcal{D} \mid M)}_{\text{data given model}}

This is a formalization of Occam's razor. Simpler models get shorter descriptions; if a complex model doesn't compress the data much better, it loses. MDL is closely related to Bayesian model comparison and regularization.


12. Entropy and Maximum Entropy Principle

The Maximum Entropy Principle (Jaynes): given constraints on a distribution (e.g., fixed mean and variance), choose the distribution that maximizes entropy. This is the "least biased" choice — you add no information beyond what the constraints impose.

ConstraintsMaximum entropy distribution
None (finite X\mathcal{X})Uniform
Fixed mean μ\mu on [0,)[0,\infty)Exponential(1/μ1/\mu)
Fixed mean μ\mu, variance σ2\sigma^2 on R\mathbb{R}N(μ,σ2)\mathcal{N}(\mu, \sigma^2)
Fixed mean on simplexCategorical (uniform)
Fixed mean vector μ\boldsymbol{\mu}, covariance Σ\Sigma on Rd\mathbb{R}^dN(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma)

This is why the Gaussian distribution appears everywhere — it is the distribution that makes the fewest assumptions given a known mean and variance.

Boltzmann / Gibbs distribution — the MaxEnt solution to "fixed expected energy E[E(x)]=U\mathbb{E}[E(x)] = U":

p(x)=1ZeE(x)/TZ=xeE(x)/Tp(x) = \frac{1}{Z} e^{-E(x)/T} \qquad Z = \sum_x e^{-E(x)/T}

This is the softmax when T=1T=1 and E(x)=zxE(x) = -z_x (logits). Energy-based models use this directly.


13. Variational Inference and the ELBO

Given intractable posterior p(θD)p(\theta \mid \mathcal{D}), approximate it with a tractable q(θ)q(\theta) by minimizing:

DKL(qp)=DKL(q(θ)p(θD))D_{\text{KL}}(q \| p) = D_{\text{KL}}(q(\theta) \| p(\theta \mid \mathcal{D}))

Expanding:

DKL(qp)=Eq[logq(θ)]Eq[logp(θ,D)]+logp(D)D_{\text{KL}}(q \| p) = \mathbb{E}_q[\log q(\theta)] - \mathbb{E}_q[\log p(\theta, \mathcal{D})] + \log p(\mathcal{D})

Since logp(D)\log p(\mathcal{D}) is a constant w.r.t. qq, minimizing KL is equivalent to maximizing the Evidence Lower BOund (ELBO):

L(q)=Eq[logp(θ,D)]Eq[logq(θ)]=Eq[logp(Dθ)]reconstructionDKL(q(θ)p(θ))regularization\mathcal{L}(q) = \mathbb{E}_q[\log p(\theta, \mathcal{D})] - \mathbb{E}_q[\log q(\theta)] = \underbrace{\mathbb{E}_q[\log p(\mathcal{D} \mid \theta)]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q(\theta) \| p(\theta))}_{\text{regularization}}

The ELBO is a lower bound on logp(D)\log p(\mathcal{D}):

logp(D)=L(q)+DKL(qp)L(q)\log p(\mathcal{D}) = \mathcal{L}(q) + D_{\text{KL}}(q \| p) \geq \mathcal{L}(q)

VAE loss is exactly the ELBO with q=qϕ(zx)q = q_\phi(\mathbf{z} \mid \mathbf{x}) (encoder) and p=pθ(xz)p = p_\theta(\mathbf{x} \mid \mathbf{z}) (decoder):

LVAE=Eqϕ(zx)[logpθ(xz)]reconstruction lossDKL(qϕ(zx)p(z))KL to prior\mathcal{L}_{\text{VAE}} = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})]}_{\text{reconstruction loss}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{KL to prior}}

With Gaussian encoder qϕ=N(μϕ,σϕ2)q_\phi = \mathcal{N}(\boldsymbol{\mu}_\phi, \boldsymbol{\sigma}^2_\phi) and prior p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(0, I):

DKL=12j=1d(μj2+σj2logσj21)D_{\text{KL}} = \frac{1}{2}\sum_{j=1}^d \left(\mu_j^2 + \sigma_j^2 - \log\sigma_j^2 - 1\right)

def vae_loss(x, x_recon, mu, log_var):
    # Reconstruction: binary cross-entropy
    recon = F.binary_cross_entropy(x_recon, x, reduction='sum')
    # KL to standard Gaussian — closed form
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon + kl

14. Channel Capacity and Mutual Information

Shannon's channel capacity — the maximum information rate through a noisy channel:

C=maxPXI(X;Y)bits per channel useC = \max_{P_X} I(X; Y) \quad \text{bits per channel use}

For an additive white Gaussian noise (AWGN) channel with signal power PP and noise power NN:

C=12log2 ⁣(1+PN)bits per channel useC = \frac{1}{2}\log_2\!\left(1 + \frac{P}{N}\right) \quad \text{bits per channel use}

This is the Shannon-Hartley theorem — the foundation of all of wireless communication theory.

Shannon's channel coding theorem: Reliable communication at rate R<CR < C is possible with arbitrarily small error probability. Above CC, it is impossible. This bound is achievable (modern LDPC and turbo codes come close).


15. Perplexity

Perplexity is the standard evaluation metric for language models:

PPL(P,Q)=2H(P,Q)=21ni=1nlog2q(xi)\text{PPL}(P, Q) = 2^{H(P,Q)} = 2^{-\frac{1}{n}\sum_{i=1}^n \log_2 q(x_i)}

Or in nats:

PPL=e1ni=1nlogq(xi)\text{PPL} = e^{-\frac{1}{n}\sum_{i=1}^n \log q(x_i)}

Interpretation: perplexity of kk means the model is as confused as if it had to choose uniformly among kk equally likely options at each step. Lower is better. A unigram model of English might have perplexity ~1000. A good LLM might have perplexity ~10.

PPL=eH(P,Q)=eH(P)+DKL(PQ)\text{PPL} = e^{H(P,Q)} = e^{H(P) + D_{\text{KL}}(P \| Q)}

So perplexity is minimized when Q=PQ = P (the model matches the true distribution), at which point PPL=eH(P)\text{PPL} = e^{H(P)} — the irreducible perplexity due to the true entropy of language.


16. Quick Reference

import numpy as np
from scipy.stats import entropy
from scipy.special import rel_entr, entr

# ── Self-information ────────────────────────────────────
def surprisal(p, base=2):
    return -np.log2(p) if base == 2 else -np.log(p)

# ── Entropy ─────────────────────────────────────────────
def H(p, base=None):
    p = np.array(p, dtype=float)
    return entropy(p, base=base)   # base=None → nats, base=2 → bits

H([0.5, 0.5], base=2)             # → 1.0 bit
H([0.9, 0.1], base=2)             # → 0.469 bits
H([0.25]*4,   base=2)             # → 2.0 bits (uniform over 4)

# Gaussian differential entropy
def gaussian_entropy(sigma, base=None):
    h = 0.5 * np.log(2 * np.pi * np.e * sigma**2)
    return h if base is None else h / np.log(base)

# ── KL divergence ────────────────────────────────────────
def kl(p, q):
    # rel_entr(p,q) = p*log(p/q), handles 0*log0=0
    return np.sum(rel_entr(p, q))   # nats

kl([0.4, 0.4, 0.2], [0.33, 0.33, 0.34])   # → 0.027

# KL between two Gaussians
def kl_gaussians(mu1, s1, mu2, s2):
    return (np.log(s2/s1)
            + (s1**2 + (mu1-mu2)**2) / (2*s2**2)
            - 0.5)

# ── Cross-entropy ────────────────────────────────────────
def cross_entropy(p, q):
    return H(p) + kl(p, q)

# In PyTorch (standard):
# loss = F.cross_entropy(logits, targets)        # multiclass
# loss = F.binary_cross_entropy_with_logits(...) # binary

# ── Jensen-Shannon divergence ────────────────────────────
def jsd(p, q):
    p, q = np.array(p), np.array(q)
    m = 0.5 * (p + q)
    return 0.5 * kl(p, m) + 0.5 * kl(q, m)

jsd([0.5, 0.5], [0.9, 0.1])   # → 0.278 nats

# ── Mutual information (discrete) ───────────────────────
def mutual_info(pxy):
    """pxy: 2D array of joint probabilities"""
    pxy = np.array(pxy)
    px  = pxy.sum(axis=1, keepdims=True)
    py  = pxy.sum(axis=0, keepdims=True)
    return kl(pxy.ravel(), (px * py).ravel())

# sklearn version (for continuous features)
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
mi = mutual_info_classif(X, y)

# ── Perplexity ───────────────────────────────────────────
def perplexity(log_probs):
    """log_probs: array of log P(x_i) under model"""
    return np.exp(-np.mean(log_probs))

# ── Binary entropy function ──────────────────────────────
def h_binary(p):
    p = np.clip(p, 1e-12, 1-1e-12)
    return -p*np.log2(p) - (1-p)*np.log2(1-p)

17. Summary of Key Relationships

H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)H(X,Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)

I(X;Y)=H(X)+H(Y)H(X,Y)=H(X)H(XY)I(X;Y) = H(X) + H(Y) - H(X,Y) = H(X) - H(X \mid Y)

H(P,Q)=H(P)+DKL(PQ)H(P,Q) = H(P) + D_{\text{KL}}(P \| Q)

logp(D)=ELBO(q)+DKL(qp(D))\log p(\mathcal{D}) = \text{ELBO}(q) + D_{\text{KL}}(q \| p(\cdot \mid \mathcal{D}))

PPL=eH(P,Q)\text{PPL} = e^{H(P,Q)}

minimizing cross-entropyminimizing KLMLE\text{minimizing cross-entropy} \equiv \text{minimizing KL} \equiv \text{MLE}

L2 regularizationGaussian priorMAP with N(0,1/λ)\text{L2 regularization} \equiv \text{Gaussian prior} \equiv \text{MAP with } \mathcal{N}(0, 1/\lambda)

L1 regularizationLaplace priorMAP with Laplace(0,1/λ)\text{L1 regularization} \equiv \text{Laplace prior} \equiv \text{MAP with Laplace}(0, 1/\lambda)