Aij or aij denotes the element at row i, column j.
Tensor — generalization to more than 2 dimensions. A 3D tensor X∈Rm×n×p is indexed by Xijk. Used in deep learning (e.g. a batch of images: batch × height × width × channels).
2. Vector Operations
Addition and scalar multiplication
x+y=x1+y1⋮xn+ynαx=αx1⋮αxn
Dot product (inner product)
x⋅y=xTy=∑i=1nxiyi
Geometric interpretation:
xTy=∥x∥∥y∥cosθ
where θ is the angle between the two vectors.
[!NOTE]
If xTy=0, the vectors are orthogonal (perpendicular). This is fundamental — orthogonality appears in PCA, QR decomposition, attention, everywhere.
Norms — measuring vector size
The Lp norm:
∥x∥p=(∑i=1n∣xi∣p)1/p
Norm
Formula
Name
Used for
∥x∥1
∑∥xi∥
Manhattan / taxicab
L1 regularization (Lasso)
∥x∥2
∑xi2
Euclidean
Default distance, L2 reg
∥x∥∞
maxi∥xi∥
Max norm
Gradient clipping
The squared L2 norm is especially convenient:
∥x∥22=xTx=∑i=1nxi2
Outer product
xyT=x1y1x2y1⋮x1y2x2y2⋯⋯⋱∈Rm×n
The dot product contracts two vectors to a scalar. The outer product expands two vectors into a matrix.
3. Matrix Operations
Transpose
(AT)ij=Aji
Key identities:
(AB)T=BTAT
(AT)T=A
(xTy)=yTx (scalar, so equals its own transpose)
Matrix multiplication
C=AB where A∈Rm×k, B∈Rk×n, C∈Rm×n:
Cij=∑l=1kAilBlj=(row i of A)⋅(column j of B)
[!WARNING]
Matrix multiplication is not commutative: AB=BA in general. It is associative: (AB)C=A(BC).
Matrix-vector product Ax: think of it as taking a linear combination of the columns of A, weighted by entries of x:
Ax=x1a1+x2a2+⋯+xnan
Hadamard product (elementwise)
C=A⊙BCij=AijBij
Used in neural network gates (LSTM, attention masks). Not the same as matrix multiplication.
Identity matrix
I=10⋮0010⋯⋯⋱⋯00⋮1AI=IA=A
Inverse
For square A∈Rn×n, if it exists:
A−1A=AA−1=I
Properties:
(AB)−1=B−1A−1
(AT)−1=(A−1)T
A matrix is invertible iff its determinant is nonzero iff its columns are linearly independent
[!WARNING]
Never compute A−1 explicitly if you can avoid it. Solving Ax=b via np.linalg.solve(A, b) is numerically more stable and faster than np.linalg.inv(A) @ b.
4. Special Matrices
Symmetric
A=AT(Aij=Aji)
The covariance matrix Σ is always symmetric. All eigenvalues of a symmetric matrix are real.
Diagonal
D=diag(d1,d2,…,dn)Dij=0 if i=j
Efficient: Dx just scales each entry. D−1=diag(1/d1,…,1/dn).
Orthogonal
QTQ=I⇒Q−1=QT
Columns of Q are orthonormal (unit length, mutually orthogonal). Orthogonal matrices preserve lengths and angles — they represent rotations and reflections. Used in PCA, QR decomposition.
Positive Semi-Definite (PSD)
A symmetric matrix A is PSD if:
xTAx≥0∀x
Written A⪰0. Covariance matrices are always PSD. All eigenvalues of a PSD matrix are ≥0.
5. Linear Independence, Span, Rank
A set of vectors {v1,…,vk} is linearly independent if:
∑i=1kαivi=0⟹α1=α2=⋯=αk=0
i.e. no vector in the set can be written as a combination of the others.
Span — the set of all vectors reachable as linear combinations of {v1,…,vk}.
Rank of matrix A — the number of linearly independent columns (= linearly independent rows):
rank(A)≤min(m,n)
Full rank: rank(A)=min(m,n) — no redundant columns/rows
Rank deficient: some columns are linear combinations of others — ATA is not invertible, the normal equation has no unique solution
import numpy as np
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(np.linalg.matrix_rank(A)) # → 2 (rank deficient, row 3 = row1 + row2 × 2... not quite but determinant=0)
6. Eigenvalues and Eigenvectors
For square A∈Rn×n:
Av=λv
λ∈R is an eigenvalue, v=0 is the corresponding eigenvector.
Interpretation: v is a direction that A only scales (by λ) — it does not rotate.
Finding eigenvalues
det(A−λI)=0
This is the characteristic polynomial of degree n. For large matrices, solved numerically.
Eigendecomposition
If A has n linearly independent eigenvectors:
A=VΛV−1
where V has eigenvectors as columns, Λ=diag(λ1,…,λn).
For symmetricA (which includes all covariance matrices):
A=QΛQT
with Q orthogonal. This is the spectral theorem. All eigenvalues are real.
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix) # eigh for symmetric# eigenvalues[i] corresponds to eigenvectors[:, i]
[!MATH]
Why eigenvalues matter for ML:
PCA — eigenvectors of XTX are the principal components; eigenvalues measure how much variance each direction captures
Optimization — the curvature of the loss surface is described by the Hessian matrix; its eigenvalues tell you whether you're in a saddle point, local min, or flat region
Graph neural networks — spectral convolutions are defined via eigendecomposition of the graph Laplacian
7. Singular Value Decomposition (SVD)
For any matrix A∈Rm×n (need not be square):
A=UΣVT
where:
U∈Rm×m — orthogonal, columns are left singular vectors
Σ∈Rm×n — diagonal with singular valuesσ1≥σ2≥⋯≥0
V∈Rn×n — orthogonal, columns are right singular vectors
Relationship to eigendecomposition
ATA=VΣTΣVT(eigendecomp of ATA)σi=λi(ATA)
Low-rank approximation (Eckart–Young theorem)
The best rank-k approximation to A (in Frobenius norm) is:
Ak=∑i=1kσiuiviT
Keep only the top k singular values and vectors. This is the foundation of PCA, matrix factorization, image compression, and word embeddings.
U, s, Vt = np.linalg.svd(A, full_matrices=False)
k = 10
A_approx = (U[:, :k] * s[:k]) @ Vt[:k, :]
8. The Data Matrix Convention
In ML, we stack n examples as rows:
X=—x(1)T——x(2)T—⋮—x(n)T—∈Rn×d
n = number of samples, d = number of features.
Key products to recognize instantly:
Expression
Shape
Meaning
XTX
d×d
Feature covariance (up to 1/n)
XXT
n×n
Sample similarity / Gram matrix
Xθ
n×1
Predictions for all samples at once
XT(y−Xθ)
d×1
Gradient of MSE loss
9. Key Calculus on Vectors and Matrices
These derivatives appear constantly in deriving ML algorithms.
Gradient of a linear function
∂x∂(aTx)=a
Gradient of a quadratic form
∂x∂(xTAx)=(A+AT)x=2Axif A symmetric
Gradient of MSE loss
L(θ)=n1∥y−Xθ∥2=n1(y−Xθ)T(y−Xθ)
∂θ∂L=−n2XT(y−Xθ)
Setting to zero → Normal Equation:
θ∗=(XTX)−1XTy
Chain rule (vector form)
∂x∂f(g(x))=(∂x∂g)T∂g∂f
This is the foundation of backpropagation — just repeated application of the vector chain rule through each layer.
10. Trace and Frobenius Norm
Trace — sum of diagonal elements of a square matrix:
tr(A)=∑i=1nAii=∑i=1nλi
Useful identity: xTAx=tr(xTAx)=tr(AxxT)
Frobenius norm — the matrix equivalent of the L2 vector norm:
∥A∥F=∑i,jAij2=tr(ATA)=∑iσi2
Used in regularization (e.g. weight decay in neural networks penalizes ∥W∥F2).
11. Quick Reference
import numpy as np
# Dot product
np.dot(a, b) # or a @ b for vectors# Matrix multiply
A @ B
# Transpose
A.T
# Inverse
np.linalg.inv(A)
# Solve Ax = b (prefer over inv)
np.linalg.solve(A, b)
# Eigenvalues (general)
vals, vecs = np.linalg.eig(A)
# Eigenvalues (symmetric — more stable)
vals, vecs = np.linalg.eigh(A) # eigenvalues returned in ascending order# SVD
U, s, Vt = np.linalg.svd(A)
U, s, Vt = np.linalg.svd(A, full_matrices=False) # economy SVD# Norms
np.linalg.norm(x) # L2 by default
np.linalg.norm(x, ord=1) # L1
np.linalg.norm(A, 'fro') # Frobenius# Rank
np.linalg.matrix_rank(A)
# Determinant
np.linalg.det(A)
# Trace
np.trace(A)