Entropy, KL divergence, cross-entropy — used in loss functions
Information theory, invented by Claude Shannon in 1948, answers one question: how do we quantify information, uncertainty, and communication? It turns out this is exactly what ML needs:
The core idea: rare events carry more information than common ones. "The sun rose today" tells you nothing. "It snowed in July in Bologna" tells you a lot.
The information content of a single outcome :
Units depend on the base:
Properties:
This additivity property is the key motivation. We want information to add up for independent events, and .
Example: A fair coin. bit. A biased coin with : bits — landing heads is very surprising.
Example: In a vocabulary of 50,000 words (roughly uniform), the word "the" has probability , giving bits. A rare word like "ephemeral" at gives bits.
The entropy of a discrete distribution over :
It is the expected surprisal — the average amount of information per outcome. Also interpretable as the minimum average number of bits needed to encode outcomes from .
Convention: (by continuity, since as ).
Non-negativity: with equality iff is deterministic (all probability on one outcome).
Maximum entropy: with equality iff is uniform. Uniform = maximum uncertainty = maximum entropy.
Concavity: is a concave function of the probability vector .
For a Bernoulli() variable:
| (bits) | Interpretation | |
|---|---|---|
| 0.0 | 0 | Certain — no information |
| 0.1 | 0.469 | Mostly predictable |
| 0.3 | 0.881 | Some uncertainty |
| 0.5 | 1.0 | Maximum uncertainty for binary |
| 0.7 | 0.881 | Symmetric with 0.3 |
| 1.0 | 0 | Certain — no information |
The function is symmetric around and peaks at exactly 1 bit.
| Distribution | Entropy |
|---|---|
| Bernoulli() | |
| Categorical(, classes) | |
| Uniform( outcomes) | |
| in | |
| Exponential() |
[!MATH] The Gaussian has the maximum entropy of all distributions with fixed mean and variance . This is a deep result — if all you know about a distribution is its mean and variance, the least-committal (most honest) choice is Gaussian.
For a continuous random variable with density :
[!WARNING] Differential entropy is not the continuous analog of discrete entropy in every sense. It can be negative. For example, has , which is negative when . It is not invariant to reparameterization. What matters is differences of differential entropy (like KL divergence), which are well-behaved.
It still serves the same intuitive role — measures spread/uncertainty of a continuous distribution — and most formulas involving entropy still hold when you use .
The uncertainty in the joint outcome. Always:
with equality iff and are independent. Knowing can only reduce or leave unchanged the uncertainty in .
The uncertainty in given that we know :
Also written as:
Conditioning reduces entropy: with equality iff .
The joint uncertainty = sum of conditional uncertainties added one variable at a time.
The Kullback-Leibler divergence from to :
For continuous distributions:
Read: "KL divergence from to " or "KL of with respect to ."
Non-negativity (Gibbs inequality):
Proof sketch: follows from the concavity of and Jensen's inequality:
Asymmetry:
This is not a distance in the mathematical sense — it violates symmetry and the triangle inequality.
Not defined if where : If but , then . You are trying to encode events under that thinks are impossible. This forces approximate distributions to have full support.
This asymmetry has deep practical consequences:
Forward KL — , minimized w.r.t. :
Called moment-matching or mean-seeking. must cover all regions where has mass. If is multimodal, spreads to cover all modes — resulting in a wide, blurry approximation.
Reverse KL — , minimized w.r.t. :
Called information projection or mode-seeking. concentrates on one mode of rather than covering all of them. If is multimodal, collapses to a single peak.
[!NOTE] Variational inference (VI) typically minimizes reverse KL because we can compute by sampling from (which we control), but we can't sample from (which is the posterior we're trying to approximate). This is why VI tends to underestimate posterior variance.
Closed form for and :
For multivariate , :
This formula appears in the VAE loss, in Gaussian process inference, and in natural gradient methods.
from scipy.special import rel_entr
import numpy as np
# KL divergence for discrete distributions
def kl_divergence(p, q):
p, q = np.array(p), np.array(q)
return np.sum(rel_entr(p, q)) # handles 0*log(0)=0, inf where q=0 and p>0
p = [0.4, 0.4, 0.2]
q = [0.33, 0.33, 0.34]
print(kl_divergence(p, q)) # → 0.0268 nats
# KL for Gaussians
def kl_gaussian(mu1, sig1, mu2, sig2):
return (np.log(sig2/sig1)
+ (sig1**2 + (mu1-mu2)**2) / (2*sig2**2)
- 0.5)
print(kl_gaussian(0, 1, 0, 2)) # N(0,1) || N(0,4) → 0.443 nats
print(kl_gaussian(0, 2, 0, 1)) # N(0,4) || N(0,1) → 1.386 nats — asymmetric!
Interpretation: the average number of bits needed to encode events from using a code optimized for .
Since is fixed when is fixed (true labels), minimizing cross-entropy is equivalent to minimizing KL divergence from to .
In classification with classes:
For one-hot labels (only one ):
This is the negative log-probability of the correct class — just a single term survives. The loss is small when the model assigns high probability to the right class, large when it doesn't.
For binary classification (sigmoid output):
[!MATH] Why not MSE for classification? MSE assumes Gaussian noise around the true label. For a binary label, Gaussian doesn't make sense — the output is a probability in . Cross-entropy corresponds to assuming Bernoulli noise, which is the right model. Cross-entropy also penalizes confident wrong predictions much more severely than MSE does: if and , cross-entropy gives , while MSE gives . Cross-entropy creates stronger gradient signals for badly wrong predictions.
Hard one-hot targets have infinite cross-entropy loss if the model ever assigns zero probability to the correct class. Label smoothing replaces one-hot with a soft target:
Typically . This prevents overconfidence and improves calibration.
How much does knowing reduce uncertainty about ?
Equivalently:
Properties:
┌─────────────────────────────────┐
│ H(X,Y) │
│ ┌──────────────┐ │
│ │ H(X) │ H(Y|X) │
│ │ H(X|Y) ┌────┼────┐ │
│ │ │I(X;Y) │ │
│ └─────────┼────┘ │ │
│ │ H(Y) │ │
│ └─────────┘ │
└─────────────────────────────────┘
Measures how much tells us about when we already know .
Data processing inequality: If is a Markov chain (Z depends on X only through Y):
Processing data can only destroy information, never create it. This is why features should be computed carefully — every deterministic transformation of the input can only lose information about the target.
Feature selection: Pick the feature that maximizes where is the label.
Representation learning: Learn a representation that maximizes (retains task-relevant info) while minimizing (discards irrelevant info). This is the Information Bottleneck principle.
InfoNCE / contrastive learning: Modern self-supervised methods like SimCLR and CLIP maximize a lower bound on mutual information between views of the same data:
from sklearn.feature_selection import mutual_info_classif
# Mutual information between each feature and the target
mi_scores = mutual_info_classif(X_train, y_train)
top_features = np.argsort(mi_scores)[::-1][:10]
A symmetric, bounded version of KL divergence:
Properties:
Used in the original GAN formulation — the GAN discriminator implicitly minimizes JSD between real and generated distributions:
The Fisher information measures how much information the observable data carries about a parameter :
The term is called the score function. Its expectation is zero; its variance is the Fisher information.
No unbiased estimator can have variance smaller than:
This is the fundamental limit on estimation accuracy. MLE is asymptotically efficient — it achieves this bound for large .
This is the expected Hessian of the log-likelihood. It defines the natural gradient:
Natural gradient descent accounts for the curvature of the parameter space, leading to faster convergence. Used in KFAC, Adam (approximately), and trust region methods.
The Fisher information is the local curvature of KL divergence around :
This makes it the right measure of how fast the distribution changes as moves.
Shannon's source coding theorem: The minimum average code length for outcomes from is exactly bits. You cannot compress below entropy.
More specifically, the optimal code assigns bits to outcome . Frequent events → short codes. Rare events → long codes. This is the idea behind Huffman coding and arithmetic coding.
Kolmogorov complexity — the length of the shortest program that outputs . Unlike Shannon entropy (which is a property of a distribution), Kolmogorov complexity is a property of an individual string. It is uncomputable in general, but serves as a theoretical ideal.
MDL principle for model selection: Choose the model that minimizes:
This is a formalization of Occam's razor. Simpler models get shorter descriptions; if a complex model doesn't compress the data much better, it loses. MDL is closely related to Bayesian model comparison and regularization.
The Maximum Entropy Principle (Jaynes): given constraints on a distribution (e.g., fixed mean and variance), choose the distribution that maximizes entropy. This is the "least biased" choice — you add no information beyond what the constraints impose.
| Constraints | Maximum entropy distribution |
|---|---|
| None (finite ) | Uniform |
| Fixed mean on | Exponential() |
| Fixed mean , variance on | |
| Fixed mean on simplex | Categorical (uniform) |
| Fixed mean vector , covariance on |
This is why the Gaussian distribution appears everywhere — it is the distribution that makes the fewest assumptions given a known mean and variance.
Boltzmann / Gibbs distribution — the MaxEnt solution to "fixed expected energy ":
This is the softmax when and (logits). Energy-based models use this directly.
Given intractable posterior , approximate it with a tractable by minimizing:
Expanding:
Since is a constant w.r.t. , minimizing KL is equivalent to maximizing the Evidence Lower BOund (ELBO):
The ELBO is a lower bound on :
VAE loss is exactly the ELBO with (encoder) and (decoder):
With Gaussian encoder and prior :
def vae_loss(x, x_recon, mu, log_var):
# Reconstruction: binary cross-entropy
recon = F.binary_cross_entropy(x_recon, x, reduction='sum')
# KL to standard Gaussian — closed form
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon + kl
Shannon's channel capacity — the maximum information rate through a noisy channel:
For an additive white Gaussian noise (AWGN) channel with signal power and noise power :
This is the Shannon-Hartley theorem — the foundation of all of wireless communication theory.
Shannon's channel coding theorem: Reliable communication at rate is possible with arbitrarily small error probability. Above , it is impossible. This bound is achievable (modern LDPC and turbo codes come close).
Perplexity is the standard evaluation metric for language models:
Or in nats:
Interpretation: perplexity of means the model is as confused as if it had to choose uniformly among equally likely options at each step. Lower is better. A unigram model of English might have perplexity ~1000. A good LLM might have perplexity ~10.
So perplexity is minimized when (the model matches the true distribution), at which point — the irreducible perplexity due to the true entropy of language.
import numpy as np
from scipy.stats import entropy
from scipy.special import rel_entr, entr
# ── Self-information ────────────────────────────────────
def surprisal(p, base=2):
return -np.log2(p) if base == 2 else -np.log(p)
# ── Entropy ─────────────────────────────────────────────
def H(p, base=None):
p = np.array(p, dtype=float)
return entropy(p, base=base) # base=None → nats, base=2 → bits
H([0.5, 0.5], base=2) # → 1.0 bit
H([0.9, 0.1], base=2) # → 0.469 bits
H([0.25]*4, base=2) # → 2.0 bits (uniform over 4)
# Gaussian differential entropy
def gaussian_entropy(sigma, base=None):
h = 0.5 * np.log(2 * np.pi * np.e * sigma**2)
return h if base is None else h / np.log(base)
# ── KL divergence ────────────────────────────────────────
def kl(p, q):
# rel_entr(p,q) = p*log(p/q), handles 0*log0=0
return np.sum(rel_entr(p, q)) # nats
kl([0.4, 0.4, 0.2], [0.33, 0.33, 0.34]) # → 0.027
# KL between two Gaussians
def kl_gaussians(mu1, s1, mu2, s2):
return (np.log(s2/s1)
+ (s1**2 + (mu1-mu2)**2) / (2*s2**2)
- 0.5)
# ── Cross-entropy ────────────────────────────────────────
def cross_entropy(p, q):
return H(p) + kl(p, q)
# In PyTorch (standard):
# loss = F.cross_entropy(logits, targets) # multiclass
# loss = F.binary_cross_entropy_with_logits(...) # binary
# ── Jensen-Shannon divergence ────────────────────────────
def jsd(p, q):
p, q = np.array(p), np.array(q)
m = 0.5 * (p + q)
return 0.5 * kl(p, m) + 0.5 * kl(q, m)
jsd([0.5, 0.5], [0.9, 0.1]) # → 0.278 nats
# ── Mutual information (discrete) ───────────────────────
def mutual_info(pxy):
"""pxy: 2D array of joint probabilities"""
pxy = np.array(pxy)
px = pxy.sum(axis=1, keepdims=True)
py = pxy.sum(axis=0, keepdims=True)
return kl(pxy.ravel(), (px * py).ravel())
# sklearn version (for continuous features)
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
mi = mutual_info_classif(X, y)
# ── Perplexity ───────────────────────────────────────────
def perplexity(log_probs):
"""log_probs: array of log P(x_i) under model"""
return np.exp(-np.mean(log_probs))
# ── Binary entropy function ──────────────────────────────
def h_binary(p):
p = np.clip(p, 1e-12, 1-1e-12)
return -p*np.log2(p) - (1-p)*np.log2(1-p)