Information Theory

1. Why Information Theory Matters for ML

Information theory, invented by Claude Shannon in 1948, answers one question: how do we quantify information, uncertainty, and communication? It turns out this is exactly what ML needs:

Loss functions (cross-entropy, KL divergence) are information-theoretic quantities
Regularization has an information-theoretic interpretation
Generalization bounds use mutual information
Compression and representation learning are the same problem
The entire field of variational inference is applied information theory

The core idea: rare events carry more information than common ones. "The sun rose today" tells you nothing. "It snowed in July in Bologna" tells you a lot.

2. Self-Information (Surprisal)

The information content of a single outcome $x$ :

$I(x) = -\log p(x)$

Units depend on the base:

Base 2 → bits (most common in theory)
Base $e$ → nats (most common in ML, since $\nabla \log$ is cleaner)
Base 10 → hartleys (rare)

Properties:

$p(x) = 1 \Rightarrow I(x) = 0$ — certain events carry zero information
$p(x) \to 0 \Rightarrow I(x) \to \infty$ — impossible events are infinitely surprising
Additivity: if $x$ and $y$ are independent, $I(x, y) = I(x) + I(y)$

This additivity property is the key motivation. We want information to add up for independent events, and $-\log(p(x)p(y)) = -\log p(x) - \log p(y)$ .

Example: A fair coin. $I(\text{heads}) = -\log_2(0.5) = 1$ bit. A biased coin with $p = 0.01$ : $I(\text{heads}) = -\log_2(0.01) \approx 6.64$ bits — landing heads is very surprising.

Example: In a vocabulary of 50,000 words (roughly uniform), the word "the" has probability $\approx 0.07$ , giving $I \approx 3.8$ bits. A rare word like "ephemeral" at $p \approx 0.00001$ gives $I \approx 16.6$ bits.

3. Shannon Entropy

The entropy of a discrete distribution $P$ over $\mathcal{X}$ :

$H(X) = \mathbb{E}[I(X)] = -\sum_{x \in \mathcal{X}} p(x) \log p(x)$

It is the expected surprisal — the average amount of information per outcome. Also interpretable as the minimum average number of bits needed to encode outcomes from $P$ .

Convention: $0 \log 0 = 0$ (by continuity, since $x \log x \to 0$ as $x \to 0$ ).

Properties

Non-negativity: $H(X) \geq 0$ with equality iff $X$ is deterministic (all probability on one outcome).

Maximum entropy: $H(X) \leq \log |\mathcal{X}|$ with equality iff $P$ is uniform. Uniform = maximum uncertainty = maximum entropy.

Concavity: $H$ is a concave function of the probability vector $\mathbf{p}$ .

Example — Binary entropy function

For a Bernoulli( $p$ ) variable:

$H_b(p) = -p\log_2 p - (1-p)\log_2(1-p)$

$p$	$H_b(p)$ (bits)	Interpretation
0.0	0	Certain — no information
0.1	0.469	Mostly predictable
0.3	0.881	Some uncertainty
0.5	1.0	Maximum uncertainty for binary
0.7	0.881	Symmetric with 0.3
1.0	0	Certain — no information

The function is symmetric around $p = 0.5$ and peaks at exactly 1 bit.

Entropy of common distributions

Distribution	Entropy
Bernoulli( $p$ )	$-p\log p - (1-p)\log(1-p)$
Categorical( $\boldsymbol{\pi}$ , $K$ classes)	$-\sum_{k=1}^K \pi_k \log \pi_k$
Uniform( $K$ outcomes)	$\log K$
$\mathcal{N}(\mu, \sigma^2)$	$\frac{1}{2}\log(2\pi e \sigma^2)$
$\mathcal{N}(\boldsymbol{\mu}, \Sigma)$ in $\mathbb{R}^d$	$\frac{1}{2}\log\left[(2\pi e)^d \det\Sigma\right]$
Exponential( $\lambda$ )	$1 - \log \lambda$

[!MATH] The Gaussian $\mathcal{N}(\mu, \sigma^2)$ has the maximum entropy of all distributions with fixed mean $\mu$ and variance $\sigma^2$ . This is a deep result — if all you know about a distribution is its mean and variance, the least-committal (most honest) choice is Gaussian.

4. Differential Entropy

For a continuous random variable with density $f(x)$ :

$h(X) = -\int_{-\infty}^\infty f(x) \log f(x)\, dx$

[!WARNING] Differential entropy is not the continuous analog of discrete entropy in every sense. It can be negative. For example, $\text{Uniform}(0, a)$ has $h = \log a$ , which is negative when $a < 1$ . It is not invariant to reparameterization. What matters is differences of differential entropy (like KL divergence), which are well-behaved.

It still serves the same intuitive role — measures spread/uncertainty of a continuous distribution — and most formulas involving entropy still hold when you use $h$ .

5. Joint and Conditional Entropy

Joint entropy

$H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y)$

The uncertainty in the joint outcome. Always:

$H(X, Y) \leq H(X) + H(Y)$

with equality iff $X$ and $Y$ are independent. Knowing $Y$ can only reduce or leave unchanged the uncertainty in $X$ .

Conditional entropy

The uncertainty in $X$ given that we know $Y$ :

$H(X \mid Y) = \sum_y p(y) H(X \mid Y = y) = -\sum_{x,y} p(x,y) \log p(x \mid y)$

Also written as:

$H(X \mid Y) = H(X, Y) - H(Y)$

Conditioning reduces entropy: $H(X \mid Y) \leq H(X)$ with equality iff $X \perp Y$ .

Chain rule for entropy

$H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1})$

The joint uncertainty = sum of conditional uncertainties added one variable at a time.

6. KL Divergence

The Kullback-Leibler divergence from $Q$ to $P$ :

$D_{\text{KL}}(P \| Q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_P\!\left[\log \frac{p(X)}{q(X)}\right]$

For continuous distributions:

$D_{\text{KL}}(P \| Q) = \int f_P(x) \log \frac{f_P(x)}{f_Q(x)}\, dx$

Read: "KL divergence from $Q$ to $P$ " or "KL of $P$ with respect to $Q$ ."

Properties

Non-negativity (Gibbs inequality): $D_{\text{KL}}(P \| Q) \geq 0 \quad \text{with equality iff } P = Q$

Proof sketch: follows from the concavity of $\log$ and Jensen's inequality: $-D_{\text{KL}}(P \| Q) = \mathbb{E}_P\!\left[\log\frac{q(X)}{p(X)}\right] \leq \log \mathbb{E}_P\!\left[\frac{q(X)}{p(X)}\right] = \log 1 = 0$

Asymmetry: $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) \text{ in general}$

This is not a distance in the mathematical sense — it violates symmetry and the triangle inequality.

Not defined if $Q = 0$ where $P > 0$ : If $q(x) = 0$ but $p(x) > 0$ , then $D_{\text{KL}} = \infty$ . You are trying to encode events under $Q$ that $Q$ thinks are impossible. This forces approximate distributions to have full support.

Forward vs Reverse KL

This asymmetry has deep practical consequences:

Forward KL — $D_{\text{KL}}(P \| Q)$ , minimized w.r.t. $Q$ :

$\min_Q D_{\text{KL}}(P \| Q) = \min_Q \mathbb{E}_P\!\left[\log\frac{p(X)}{q(X)}\right]$

Called moment-matching or mean-seeking. $Q$ must cover all regions where $P$ has mass. If $P$ is multimodal, $Q$ spreads to cover all modes — resulting in a wide, blurry approximation.

Reverse KL — $D_{\text{KL}}(Q \| P)$ , minimized w.r.t. $Q$ :

$\min_Q D_{\text{KL}}(Q \| P) = \min_Q \mathbb{E}_Q\!\left[\log\frac{q(X)}{p(X)}\right]$

Called information projection or mode-seeking. $Q$ concentrates on one mode of $P$ rather than covering all of them. If $P$ is multimodal, $Q$ collapses to a single peak.

[!NOTE] Variational inference (VI) typically minimizes reverse KL $D_{\text{KL}}(Q \| P)$ because we can compute $\mathbb{E}_Q[\cdot]$ by sampling from $Q$ (which we control), but we can't sample from $P$ (which is the posterior we're trying to approximate). This is why VI tends to underestimate posterior variance.

KL for Gaussians

Closed form for $P = \mathcal{N}(\mu_1, \sigma_1^2)$ and $Q = \mathcal{N}(\mu_2, \sigma_2^2)$ :

$D_{\text{KL}}(P \| Q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$

For multivariate $P = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1)$ , $Q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)$ :

$D_{\text{KL}}(P \| Q) = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T \Sigma_2^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)\right]$

This formula appears in the VAE loss, in Gaussian process inference, and in natural gradient methods.

from scipy.special import rel_entr
import numpy as np

# KL divergence for discrete distributions
def kl_divergence(p, q):
    p, q = np.array(p), np.array(q)
    return np.sum(rel_entr(p, q))  # handles 0*log(0)=0, inf where q=0 and p>0

p = [0.4, 0.4, 0.2]
q = [0.33, 0.33, 0.34]
print(kl_divergence(p, q))   # → 0.0268 nats

# KL for Gaussians
def kl_gaussian(mu1, sig1, mu2, sig2):
    return (np.log(sig2/sig1)
            + (sig1**2 + (mu1-mu2)**2) / (2*sig2**2)
            - 0.5)

print(kl_gaussian(0, 1, 0, 2))  # N(0,1) || N(0,4) → 0.443 nats
print(kl_gaussian(0, 2, 0, 1))  # N(0,4) || N(0,1) → 1.386 nats — asymmetric!

7. Cross-Entropy

$H(P, Q) = -\sum_x p(x) \log q(x) = H(P) + D_{\text{KL}}(P \| Q)$

Interpretation: the average number of bits needed to encode events from $P$ using a code optimized for $Q$ .

Since $H(P)$ is fixed when $P$ is fixed (true labels), minimizing cross-entropy is equivalent to minimizing KL divergence from $P$ to $Q$ .

Cross-entropy as a loss function

In classification with $K$ classes:

$\mathcal{L} = -\sum_{k=1}^K y_k \log \hat{p}_k$

For one-hot labels (only one $y_k = 1$ ):

$\mathcal{L} = -\log \hat{p}_{y}$

This is the negative log-probability of the correct class — just a single term survives. The loss is small when the model assigns high probability to the right class, large when it doesn't.

For binary classification (sigmoid output):

$\mathcal{L} = -y\log\hat{p} - (1-y)\log(1-\hat{p})$

[!MATH] Why not MSE for classification? MSE assumes Gaussian noise around the true label. For a binary label, Gaussian doesn't make sense — the output is a probability in $[0,1]$ . Cross-entropy corresponds to assuming Bernoulli noise, which is the right model. Cross-entropy also penalizes confident wrong predictions much more severely than MSE does: if $y=1$ and $\hat{p}=0.01$ , cross-entropy gives $-\log(0.01) = 4.6$ , while MSE gives $(1-0.01)^2 = 0.98$ . Cross-entropy creates stronger gradient signals for badly wrong predictions.

Label smoothing

Hard one-hot targets have infinite cross-entropy loss if the model ever assigns zero probability to the correct class. Label smoothing replaces one-hot with a soft target:

$\tilde{y}_k = (1 - \epsilon) y_k + \frac{\epsilon}{K}$

Typically $\epsilon = 0.1$ . This prevents overconfidence and improves calibration.

8. Mutual Information

How much does knowing $X$ reduce uncertainty about $Y$ ?

$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)$

Equivalently:

$I(X; Y) = D_{\text{KL}}(P_{XY} \| P_X P_Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$

Properties:

$I(X;Y) \geq 0$ — knowing $Y$ can only help or be useless, never hurt
$I(X;Y) = 0 \iff X \perp Y$
$I(X;Y) = I(Y;X)$ — symmetric (unlike KL)
$I(X;X) = H(X)$ — a variable tells you everything about itself

Information diagram (Venn diagram of entropies)

   ┌─────────────────────────────────┐
   │         H(X,Y)                 │
   │  ┌──────────────┐              │
   │  │     H(X)     │  H(Y|X)      │
   │  │  H(X|Y) ┌────┼────┐         │
   │  │         │I(X;Y)   │         │
   │  └─────────┼────┘    │         │
   │            │   H(Y)  │         │
   │            └─────────┘         │
   └─────────────────────────────────┘

$H(X,Y) = H(X) + H(Y) - I(X;Y)$ $I(X;Y) = H(X) + H(Y) - H(X,Y)$

Conditional mutual information

$I(X;Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z)$

Measures how much $Y$ tells us about $X$ when we already know $Z$ .

Data processing inequality: If $X \to Y \to Z$ is a Markov chain (Z depends on X only through Y):

$I(X; Z) \leq I(X; Y)$

Processing data can only destroy information, never create it. This is why features should be computed carefully — every deterministic transformation of the input can only lose information about the target.

Mutual information in ML

Feature selection: Pick the feature $X_j$ that maximizes $I(X_j; Y)$ where $Y$ is the label.

Representation learning: Learn a representation $Z = f(X)$ that maximizes $I(Z; Y)$ (retains task-relevant info) while minimizing $I(Z; X)$ (discards irrelevant info). This is the Information Bottleneck principle.

InfoNCE / contrastive learning: Modern self-supervised methods like SimCLR and CLIP maximize a lower bound on mutual information between views of the same data:

$I(X; Y) \geq \mathbb{E}\left[\log \frac{e^{f(x)^T f(y)}}{\frac{1}{N}\sum_{j=1}^N e^{f(x)^T f(y_j)}}\right]$

from sklearn.feature_selection import mutual_info_classif

# Mutual information between each feature and the target
mi_scores = mutual_info_classif(X_train, y_train)
top_features = np.argsort(mi_scores)[::-1][:10]

9. Jensen-Shannon Divergence

A symmetric, bounded version of KL divergence:

$\text{JSD}(P \| Q) = \frac{1}{2} D_{\text{KL}}(P \| M) + \frac{1}{2} D_{\text{KL}}(Q \| M) \qquad M = \frac{P+Q}{2}$

Properties:

$0 \leq \text{JSD}(P \| Q) \leq \log 2$
Symmetric: $\text{JSD}(P \| Q) = \text{JSD}(Q \| P)$
$\sqrt{\text{JSD}}$ is an actual metric (triangle inequality holds)
Equal to 0 iff $P = Q$ , equal to $\log 2$ iff $P$ and $Q$ have disjoint supports

Used in the original GAN formulation — the GAN discriminator implicitly minimizes JSD between real and generated distributions:

$\min_G \max_D V(D,G) = \text{JSD}(P_{\text{data}} \| P_G) - \log 4$

10. Fisher Information

The Fisher information measures how much information the observable data $X$ carries about a parameter $\theta$ :

$\mathcal{I}(\theta) = \mathbb{E}\!\left[\left(\frac{\partial}{\partial\theta} \log p(X;\theta)\right)^2\right] = -\mathbb{E}\!\left[\frac{\partial^2}{\partial\theta^2} \log p(X;\theta)\right]$

The term $\frac{\partial}{\partial\theta} \log p(X;\theta)$ is called the score function. Its expectation is zero; its variance is the Fisher information.

Cramér-Rao bound

No unbiased estimator can have variance smaller than:

$\text{Var}(\hat{\theta}) \geq \frac{1}{n\, \mathcal{I}(\theta)}$

This is the fundamental limit on estimation accuracy. MLE is asymptotically efficient — it achieves this bound for large $n$ .

Fisher information matrix (multiparameter)

$\mathcal{I}(\boldsymbol{\theta})_{ij} = -\mathbb{E}\!\left[\frac{\partial^2 \log p(X;\boldsymbol{\theta})}{\partial\theta_i \partial\theta_j}\right]$

This is the expected Hessian of the log-likelihood. It defines the natural gradient:

$\tilde{\nabla}_\theta \mathcal{L} = \mathcal{I}(\theta)^{-1} \nabla_\theta \mathcal{L}$

Natural gradient descent accounts for the curvature of the parameter space, leading to faster convergence. Used in KFAC, Adam (approximately), and trust region methods.

Fisher information and KL divergence

The Fisher information is the local curvature of KL divergence around $\theta$ :

$D_{\text{KL}}(P_{\theta+\epsilon} \| P_\theta) \approx \frac{1}{2} \epsilon^T \mathcal{I}(\theta) \epsilon$

This makes it the right measure of how fast the distribution changes as $\theta$ moves.

11. Minimum Description Length (MDL)

Shannon's source coding theorem: The minimum average code length for outcomes from $P$ is exactly $H(P)$ bits. You cannot compress below entropy.

More specifically, the optimal code assigns $-\log_2 p(x)$ bits to outcome $x$ . Frequent events → short codes. Rare events → long codes. This is the idea behind Huffman coding and arithmetic coding.

Kolmogorov complexity $K(x)$ — the length of the shortest program that outputs $x$ . Unlike Shannon entropy (which is a property of a distribution), Kolmogorov complexity is a property of an individual string. It is uncomputable in general, but serves as a theoretical ideal.

MDL principle for model selection: Choose the model $M$ that minimizes:

$\text{MDL} = \underbrace{L(M)}_{\text{model description}} + \underbrace{L(\mathcal{D} \mid M)}_{\text{data given model}}$

This is a formalization of Occam's razor. Simpler models get shorter descriptions; if a complex model doesn't compress the data much better, it loses. MDL is closely related to Bayesian model comparison and regularization.

12. Entropy and Maximum Entropy Principle

The Maximum Entropy Principle (Jaynes): given constraints on a distribution (e.g., fixed mean and variance), choose the distribution that maximizes entropy. This is the "least biased" choice — you add no information beyond what the constraints impose.

Constraints	Maximum entropy distribution
None (finite $\mathcal{X}$ )	Uniform
Fixed mean $\mu$ on $[0,\infty)$	Exponential( $1/\mu$ )
Fixed mean $\mu$ , variance $\sigma^2$ on $\mathbb{R}$	$\mathcal{N}(\mu, \sigma^2)$
Fixed mean on simplex	Categorical (uniform)
Fixed mean vector $\boldsymbol{\mu}$ , covariance $\Sigma$ on $\mathbb{R}^d$	$\mathcal{N}(\boldsymbol{\mu}, \Sigma)$

This is why the Gaussian distribution appears everywhere — it is the distribution that makes the fewest assumptions given a known mean and variance.

Boltzmann / Gibbs distribution — the MaxEnt solution to "fixed expected energy $\mathbb{E}[E(x)] = U$ ":

$p(x) = \frac{1}{Z} e^{-E(x)/T} \qquad Z = \sum_x e^{-E(x)/T}$

This is the softmax when $T=1$ and $E(x) = -z_x$ (logits). Energy-based models use this directly.

13. Variational Inference and the ELBO

Given intractable posterior $p(\theta \mid \mathcal{D})$ , approximate it with a tractable $q(\theta)$ by minimizing:

$D_{\text{KL}}(q \| p) = D_{\text{KL}}(q(\theta) \| p(\theta \mid \mathcal{D}))$

Expanding:

$D_{\text{KL}}(q \| p) = \mathbb{E}_q[\log q(\theta)] - \mathbb{E}_q[\log p(\theta, \mathcal{D})] + \log p(\mathcal{D})$

Since $\log p(\mathcal{D})$ is a constant w.r.t. $q$ , minimizing KL is equivalent to maximizing the Evidence Lower BOund (ELBO):

$\mathcal{L}(q) = \mathbb{E}_q[\log p(\theta, \mathcal{D})] - \mathbb{E}_q[\log q(\theta)] = \underbrace{\mathbb{E}_q[\log p(\mathcal{D} \mid \theta)]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q(\theta) \| p(\theta))}_{\text{regularization}}$

The ELBO is a lower bound on $\log p(\mathcal{D})$ :

$\log p(\mathcal{D}) = \mathcal{L}(q) + D_{\text{KL}}(q \| p) \geq \mathcal{L}(q)$

VAE loss is exactly the ELBO with $q = q_\phi(\mathbf{z} \mid \mathbf{x})$ (encoder) and $p = p_\theta(\mathbf{x} \mid \mathbf{z})$ (decoder):

$\mathcal{L}_{\text{VAE}} = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})]}_{\text{reconstruction loss}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{KL to prior}}$

With Gaussian encoder $q_\phi = \mathcal{N}(\boldsymbol{\mu}_\phi, \boldsymbol{\sigma}^2_\phi)$ and prior $p(\mathbf{z}) = \mathcal{N}(0, I)$ :

$D_{\text{KL}} = \frac{1}{2}\sum_{j=1}^d \left(\mu_j^2 + \sigma_j^2 - \log\sigma_j^2 - 1\right)$

def vae_loss(x, x_recon, mu, log_var):
    # Reconstruction: binary cross-entropy
    recon = F.binary_cross_entropy(x_recon, x, reduction='sum')
    # KL to standard Gaussian — closed form
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon + kl

14. Channel Capacity and Mutual Information

Shannon's channel capacity — the maximum information rate through a noisy channel:

$C = \max_{P_X} I(X; Y) \quad \text{bits per channel use}$

For an additive white Gaussian noise (AWGN) channel with signal power $P$ and noise power $N$ :

$C = \frac{1}{2}\log_2\!\left(1 + \frac{P}{N}\right) \quad \text{bits per channel use}$

This is the Shannon-Hartley theorem — the foundation of all of wireless communication theory.

Shannon's channel coding theorem: Reliable communication at rate $R < C$ is possible with arbitrarily small error probability. Above $C$ , it is impossible. This bound is achievable (modern LDPC and turbo codes come close).

15. Perplexity

Perplexity is the standard evaluation metric for language models:

$\text{PPL}(P, Q) = 2^{H(P,Q)} = 2^{-\frac{1}{n}\sum_{i=1}^n \log_2 q(x_i)}$

Or in nats:

$\text{PPL} = e^{-\frac{1}{n}\sum_{i=1}^n \log q(x_i)}$

Interpretation: perplexity of $k$ means the model is as confused as if it had to choose uniformly among $k$ equally likely options at each step. Lower is better. A unigram model of English might have perplexity ~1000. A good LLM might have perplexity ~10.

$\text{PPL} = e^{H(P,Q)} = e^{H(P) + D_{\text{KL}}(P \| Q)}$

So perplexity is minimized when $Q = P$ (the model matches the true distribution), at which point $\text{PPL} = e^{H(P)}$ — the irreducible perplexity due to the true entropy of language.

16. Quick Reference

import numpy as np
from scipy.stats import entropy
from scipy.special import rel_entr, entr

# ── Self-information ────────────────────────────────────
def surprisal(p, base=2):
    return -np.log2(p) if base == 2 else -np.log(p)

# ── Entropy ─────────────────────────────────────────────
def H(p, base=None):
    p = np.array(p, dtype=float)
    return entropy(p, base=base)   # base=None → nats, base=2 → bits

H([0.5, 0.5], base=2)             # → 1.0 bit
H([0.9, 0.1], base=2)             # → 0.469 bits
H([0.25]*4,   base=2)             # → 2.0 bits (uniform over 4)

# Gaussian differential entropy
def gaussian_entropy(sigma, base=None):
    h = 0.5 * np.log(2 * np.pi * np.e * sigma**2)
    return h if base is None else h / np.log(base)

# ── KL divergence ────────────────────────────────────────
def kl(p, q):
    # rel_entr(p,q) = p*log(p/q), handles 0*log0=0
    return np.sum(rel_entr(p, q))   # nats

kl([0.4, 0.4, 0.2], [0.33, 0.33, 0.34])   # → 0.027

# KL between two Gaussians
def kl_gaussians(mu1, s1, mu2, s2):
    return (np.log(s2/s1)
            + (s1**2 + (mu1-mu2)**2) / (2*s2**2)
            - 0.5)

# ── Cross-entropy ────────────────────────────────────────
def cross_entropy(p, q):
    return H(p) + kl(p, q)

# In PyTorch (standard):
# loss = F.cross_entropy(logits, targets)        # multiclass
# loss = F.binary_cross_entropy_with_logits(...) # binary

# ── Jensen-Shannon divergence ────────────────────────────
def jsd(p, q):
    p, q = np.array(p), np.array(q)
    m = 0.5 * (p + q)
    return 0.5 * kl(p, m) + 0.5 * kl(q, m)

jsd([0.5, 0.5], [0.9, 0.1])   # → 0.278 nats

# ── Mutual information (discrete) ───────────────────────
def mutual_info(pxy):
    """pxy: 2D array of joint probabilities"""
    pxy = np.array(pxy)
    px  = pxy.sum(axis=1, keepdims=True)
    py  = pxy.sum(axis=0, keepdims=True)
    return kl(pxy.ravel(), (px * py).ravel())

# sklearn version (for continuous features)
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
mi = mutual_info_classif(X, y)

# ── Perplexity ───────────────────────────────────────────
def perplexity(log_probs):
    """log_probs: array of log P(x_i) under model"""
    return np.exp(-np.mean(log_probs))

# ── Binary entropy function ──────────────────────────────
def h_binary(p):
    p = np.clip(p, 1e-12, 1-1e-12)
    return -p*np.log2(p) - (1-p)*np.log2(1-p)

17. Summary of Key Relationships

$H(X,Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$

$I(X;Y) = H(X) + H(Y) - H(X,Y) = H(X) - H(X \mid Y)$

$H(P,Q) = H(P) + D_{\text{KL}}(P \| Q)$

$\log p(\mathcal{D}) = \text{ELBO}(q) + D_{\text{KL}}(q \| p(\cdot \mid \mathcal{D}))$

$\text{PPL} = e^{H(P,Q)}$

$\text{minimizing cross-entropy} \equiv \text{minimizing KL} \equiv \text{MLE}$

$\text{L2 regularization} \equiv \text{Gaussian prior} \equiv \text{MAP with } \mathcal{N}(0, 1/\lambda)$

$\text{L1 regularization} \equiv \text{Laplace prior} \equiv \text{MAP with Laplace}(0, 1/\lambda)$