← Appendix
Mathematics

Linear Algebra Essentials

Vectors, matrices, dot product, eigenvalues — everything used in ML derivations

Linear Algebra Essentials

1. Scalars, Vectors, Matrices, Tensors

Scalar — a single number. xRx \in \mathbb{R}

Vector — ordered list of nn numbers. xRn\mathbf{x} \in \mathbb{R}^n

x=(x1x2xn)\mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}

By convention vectors are column vectors. A row vector is written xT\mathbf{x}^T.

Matrix — rectangular array of numbers. ARm×nA \in \mathbb{R}^{m \times n} has mm rows and nn columns.

A=(a11a12a1na21a22a2nam1am2amn)A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}

AijA_{ij} or aija_{ij} denotes the element at row ii, column jj.

Tensor — generalization to more than 2 dimensions. A 3D tensor XRm×n×p\mathcal{X} \in \mathbb{R}^{m \times n \times p} is indexed by Xijk\mathcal{X}_{ijk}. Used in deep learning (e.g. a batch of images: batch × height × width × channels).


2. Vector Operations

Addition and scalar multiplication

x+y=(x1+y1xn+yn)αx=(αx1αxn)\mathbf{x} + \mathbf{y} = \begin{pmatrix} x_1 + y_1 \\ \vdots \\ x_n + y_n \end{pmatrix} \qquad \alpha \mathbf{x} = \begin{pmatrix} \alpha x_1 \\ \vdots \\ \alpha x_n \end{pmatrix}

Dot product (inner product)

xy=xTy=i=1nxiyi\mathbf{x} \cdot \mathbf{y} = \mathbf{x}^T \mathbf{y} = \sum_{i=1}^n x_i y_i

Geometric interpretation:

xTy=xycosθ\mathbf{x}^T \mathbf{y} = \|\mathbf{x}\| \|\mathbf{y}\| \cos\theta

where θ\theta is the angle between the two vectors.

[!NOTE] If xTy=0\mathbf{x}^T \mathbf{y} = 0, the vectors are orthogonal (perpendicular). This is fundamental — orthogonality appears in PCA, QR decomposition, attention, everywhere.

Norms — measuring vector size

The LpL^p norm:

xp=(i=1nxip)1/p\|\mathbf{x}\|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p}

NormFormulaNameUsed for
x1\|\mathbf{x}\|_1xi\sum \|x_i\|Manhattan / taxicabL1 regularization (Lasso)
x2\|\mathbf{x}\|_2xi2\sqrt{\sum x_i^2}EuclideanDefault distance, L2 reg
x\|\mathbf{x}\|_\inftymaxixi\max_i \|x_i\|Max normGradient clipping

The squared L2L^2 norm is especially convenient:

x22=xTx=i=1nxi2\|\mathbf{x}\|_2^2 = \mathbf{x}^T\mathbf{x} = \sum_{i=1}^n x_i^2

Outer product

xyT=(x1y1x1y2x2y1x2y2)Rm×n\mathbf{x}\mathbf{y}^T = \begin{pmatrix} x_1 y_1 & x_1 y_2 & \cdots \\ x_2 y_1 & x_2 y_2 & \cdots \\ \vdots & & \ddots \end{pmatrix} \in \mathbb{R}^{m \times n}

The dot product contracts two vectors to a scalar. The outer product expands two vectors into a matrix.


3. Matrix Operations

Transpose

(AT)ij=Aji(A^T)_{ij} = A_{ji}

Key identities:

  • (AB)T=BTAT(AB)^T = B^T A^T
  • (AT)T=A(A^T)^T = A
  • (xTy)=yTx(\mathbf{x}^T \mathbf{y}) = \mathbf{y}^T \mathbf{x} (scalar, so equals its own transpose)

Matrix multiplication

C=ABC = AB where ARm×kA \in \mathbb{R}^{m \times k}, BRk×nB \in \mathbb{R}^{k \times n}, CRm×nC \in \mathbb{R}^{m \times n}:

Cij=l=1kAilBlj=(row i of A)(column j of B)C_{ij} = \sum_{l=1}^k A_{il} B_{lj} = \text{(row } i \text{ of } A) \cdot \text{(column } j \text{ of } B)

[!WARNING] Matrix multiplication is not commutative: ABBAAB \neq BA in general. It is associative: (AB)C=A(BC)(AB)C = A(BC).

Matrix-vector product AxA\mathbf{x}: think of it as taking a linear combination of the columns of AA, weighted by entries of x\mathbf{x}:

Ax=x1a1+x2a2++xnanA\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n

Hadamard product (elementwise)

C=ABCij=AijBijC = A \odot B \qquad C_{ij} = A_{ij} B_{ij}

Used in neural network gates (LSTM, attention masks). Not the same as matrix multiplication.

Identity matrix

I=(100010001)AI=IA=AI = \begin{pmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{pmatrix} \qquad AI = IA = A

Inverse

For square ARn×nA \in \mathbb{R}^{n \times n}, if it exists:

A1A=AA1=IA^{-1} A = A A^{-1} = I

Properties:

  • (AB)1=B1A1(AB)^{-1} = B^{-1} A^{-1}
  • (AT)1=(A1)T(A^T)^{-1} = (A^{-1})^T
  • A matrix is invertible iff its determinant is nonzero iff its columns are linearly independent

[!WARNING] Never compute A1A^{-1} explicitly if you can avoid it. Solving Ax=bAx = b via np.linalg.solve(A, b) is numerically more stable and faster than np.linalg.inv(A) @ b.


4. Special Matrices

Symmetric

A=AT(Aij=Aji)A = A^T \qquad (A_{ij} = A_{ji})

The covariance matrix Σ\Sigma is always symmetric. All eigenvalues of a symmetric matrix are real.

Diagonal

D=diag(d1,d2,,dn)Dij=0 if ijD = \text{diag}(d_1, d_2, \ldots, d_n) \qquad D_{ij} = 0 \text{ if } i \neq j

Efficient: DxD\mathbf{x} just scales each entry. D1=diag(1/d1,,1/dn)D^{-1} = \text{diag}(1/d_1, \ldots, 1/d_n).

Orthogonal

QTQ=IQ1=QTQ^T Q = I \qquad \Rightarrow \quad Q^{-1} = Q^T

Columns of QQ are orthonormal (unit length, mutually orthogonal). Orthogonal matrices preserve lengths and angles — they represent rotations and reflections. Used in PCA, QR decomposition.

Positive Semi-Definite (PSD)

A symmetric matrix AA is PSD if:

xTAx0x\mathbf{x}^T A \mathbf{x} \geq 0 \quad \forall \mathbf{x}

Written A0A \succeq 0. Covariance matrices are always PSD. All eigenvalues of a PSD matrix are 0\geq 0.


5. Linear Independence, Span, Rank

A set of vectors {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} is linearly independent if:

i=1kαivi=0    α1=α2==αk=0\sum_{i=1}^k \alpha_i \mathbf{v}_i = \mathbf{0} \implies \alpha_1 = \alpha_2 = \cdots = \alpha_k = 0

i.e. no vector in the set can be written as a combination of the others.

Span — the set of all vectors reachable as linear combinations of {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}.

Rank of matrix AA — the number of linearly independent columns (= linearly independent rows):

rank(A)min(m,n)\text{rank}(A) \leq \min(m, n)

  • Full rank: rank(A)=min(m,n)\text{rank}(A) = \min(m,n) — no redundant columns/rows
  • Rank deficient: some columns are linear combinations of others — ATAA^T A is not invertible, the normal equation has no unique solution
import numpy as np
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(np.linalg.matrix_rank(A))  # → 2 (rank deficient, row 3 = row1 + row2 × 2... not quite but determinant=0)

6. Eigenvalues and Eigenvectors

For square ARn×nA \in \mathbb{R}^{n \times n}:

Av=λvA\mathbf{v} = \lambda \mathbf{v}

λR\lambda \in \mathbb{R} is an eigenvalue, v0\mathbf{v} \neq \mathbf{0} is the corresponding eigenvector.

Interpretation: v\mathbf{v} is a direction that AA only scales (by λ\lambda) — it does not rotate.

Finding eigenvalues

det(AλI)=0\det(A - \lambda I) = 0

This is the characteristic polynomial of degree nn. For large matrices, solved numerically.

Eigendecomposition

If AA has nn linearly independent eigenvectors:

A=VΛV1A = V \Lambda V^{-1}

where VV has eigenvectors as columns, Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n).

For symmetric AA (which includes all covariance matrices):

A=QΛQTA = Q \Lambda Q^T

with QQ orthogonal. This is the spectral theorem. All eigenvalues are real.

eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)  # eigh for symmetric
# eigenvalues[i] corresponds to eigenvectors[:, i]

[!MATH] Why eigenvalues matter for ML:

  • PCA — eigenvectors of XTXX^T X are the principal components; eigenvalues measure how much variance each direction captures
  • Optimization — the curvature of the loss surface is described by the Hessian matrix; its eigenvalues tell you whether you're in a saddle point, local min, or flat region
  • Graph neural networks — spectral convolutions are defined via eigendecomposition of the graph Laplacian

7. Singular Value Decomposition (SVD)

For any matrix ARm×nA \in \mathbb{R}^{m \times n} (need not be square):

A=UΣVTA = U \Sigma V^T

where:

  • URm×mU \in \mathbb{R}^{m \times m} — orthogonal, columns are left singular vectors
  • ΣRm×n\Sigma \in \mathbb{R}^{m \times n} — diagonal with singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0
  • VRn×nV \in \mathbb{R}^{n \times n} — orthogonal, columns are right singular vectors

Relationship to eigendecomposition

ATA=VΣTΣVT(eigendecomp of ATA)A^T A = V \Sigma^T \Sigma V^T \qquad \text{(eigendecomp of } A^T A\text{)} σi=λi(ATA)\sigma_i = \sqrt{\lambda_i(A^T A)}

Low-rank approximation (Eckart–Young theorem)

The best rank-kk approximation to AA (in Frobenius norm) is:

Ak=i=1kσiuiviTA_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T

Keep only the top kk singular values and vectors. This is the foundation of PCA, matrix factorization, image compression, and word embeddings.

U, s, Vt = np.linalg.svd(A, full_matrices=False)
k = 10
A_approx = (U[:, :k] * s[:k]) @ Vt[:k, :]

8. The Data Matrix Convention

In ML, we stack nn examples as rows:

X=(x(1)Tx(2)Tx(n)T)Rn×dX = \begin{pmatrix} — \mathbf{x}^{(1)T} — \\ — \mathbf{x}^{(2)T} — \\ \vdots \\ — \mathbf{x}^{(n)T} — \end{pmatrix} \in \mathbb{R}^{n \times d}

nn = number of samples, dd = number of features.

Key products to recognize instantly:

ExpressionShapeMeaning
XTXX^T Xd×dd \times dFeature covariance (up to 1/n1/n)
XXTX X^Tn×nn \times nSample similarity / Gram matrix
XθX \boldsymbol{\theta}n×1n \times 1Predictions for all samples at once
XT(yXθ)X^T (y - X\boldsymbol{\theta})d×1d \times 1Gradient of MSE loss

9. Key Calculus on Vectors and Matrices

These derivatives appear constantly in deriving ML algorithms.

Gradient of a linear function

x(aTx)=a\frac{\partial}{\partial \mathbf{x}} (\mathbf{a}^T \mathbf{x}) = \mathbf{a}

Gradient of a quadratic form

x(xTAx)=(A+AT)x=2Axif A symmetric\frac{\partial}{\partial \mathbf{x}} (\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x} = 2A\mathbf{x} \quad \text{if } A \text{ symmetric}

Gradient of MSE loss

L(θ)=1nyXθ2=1n(yXθ)T(yXθ)\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{n}\|y - X\boldsymbol{\theta}\|^2 = \frac{1}{n}(y - X\boldsymbol{\theta})^T(y - X\boldsymbol{\theta})

Lθ=2nXT(yXθ)\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}} = -\frac{2}{n} X^T (y - X\boldsymbol{\theta})

Setting to zero → Normal Equation:

θ=(XTX)1XTy\boldsymbol{\theta}^* = (X^T X)^{-1} X^T y

Chain rule (vector form)

xf(g(x))=(gx)Tfg\frac{\partial}{\partial \mathbf{x}} f(g(\mathbf{x})) = \left(\frac{\partial g}{\partial \mathbf{x}}\right)^T \frac{\partial f}{\partial g}

This is the foundation of backpropagation — just repeated application of the vector chain rule through each layer.


10. Trace and Frobenius Norm

Trace — sum of diagonal elements of a square matrix:

tr(A)=i=1nAii=i=1nλi\text{tr}(A) = \sum_{i=1}^n A_{ii} = \sum_{i=1}^n \lambda_i

Useful identity: xTAx=tr(xTAx)=tr(AxxT)\mathbf{x}^T A \mathbf{x} = \text{tr}(\mathbf{x}^T A \mathbf{x}) = \text{tr}(A \mathbf{x}\mathbf{x}^T)

Frobenius norm — the matrix equivalent of the L2L^2 vector norm:

AF=i,jAij2=tr(ATA)=iσi2\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\text{tr}(A^T A)} = \sqrt{\sum_i \sigma_i^2}

Used in regularization (e.g. weight decay in neural networks penalizes WF2\|W\|_F^2).


11. Quick Reference

import numpy as np

# Dot product
np.dot(a, b)          # or a @ b for vectors

# Matrix multiply
A @ B

# Transpose
A.T

# Inverse
np.linalg.inv(A)

# Solve Ax = b  (prefer over inv)
np.linalg.solve(A, b)

# Eigenvalues (general)
vals, vecs = np.linalg.eig(A)

# Eigenvalues (symmetric — more stable)
vals, vecs = np.linalg.eigh(A)   # eigenvalues returned in ascending order

# SVD
U, s, Vt = np.linalg.svd(A)
U, s, Vt = np.linalg.svd(A, full_matrices=False)  # economy SVD

# Norms
np.linalg.norm(x)         # L2 by default
np.linalg.norm(x, ord=1)  # L1
np.linalg.norm(A, 'fro')  # Frobenius

# Rank
np.linalg.matrix_rank(A)

# Determinant
np.linalg.det(A)

# Trace
np.trace(A)