← All lectures
Lecture · L03

Multivariate linear regression

doneMultivariate

The Limitation We Are About to Remove

Everything in Lecture 1 and the gradient descent section assumed a single input feature: house size predicts house price. The hypothesis was hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x, with exactly two parameters to learn.

This is useful for building intuition, but it is not how real problems look. A house has a size, a number of bathrooms, a number of floors, an age, a neighbourhood, a distance to the city centre. A tumour has a size, a cell uniformity score, a cell thickness measure, a clump thickness. Any real dataset has many features, often hundreds or thousands. We need the machinery to handle all of them at once.

The generalisation is called multivariate linear regression — or equivalently, linear regression with multiple features. The ideas are identical to the univariate case; what changes is the notation, and the fact that the parameters are now a vector rather than a pair of numbers.


New Notation

Before writing the hypothesis, we need to extend the notation from Lecture 1 to handle multiple features.

SymbolMeaning
mmNumber of training examples (rows in the dataset)
nnNumber of features (columns, not counting the output yy)
x(i)x^{(i)}The full feature vector for the ii-th training example — a vector of nn numbers
xj(i)x^{(i)}_jThe value of feature jj in the ii-th training example
y(i)y^{(i)}The output for the ii-th training example

So x(i)x^{(i)} is no longer a single number — it is a column vector with nn entries. For example, if x(4)x^{(4)} is the fourth house in a dataset with four features (size, bathrooms, floors, age), then:

x(4)=[85221]Rnx^{(4)} = \begin{bmatrix} 852 \\ 2 \\ 1 \\ \vdots \end{bmatrix} \in \mathbb{R}^n

And x2(4)=2x^{(4)}_2 = 2 would be the number of bathrooms of that fourth house. The index in parentheses selects the example; the subscript selects the feature within that example.

[!WARNING] Keep the two indices straight: the superscript (i)(i) always refers to the training example (row), and the subscript jj always refers to the feature (column). x2(4)x^{(4)}_2 is feature 2 of example 4 — not the second power of the fourth example.


The Multivariate Hypothesis

In the univariate case, the hypothesis was hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x. The natural extension to nn features is to give each feature its own parameter:

hθ(x)=θ0+θ1x1+θ2x2++θnxnh_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n

Each θj\theta_j is the weight on feature jj — it tells the model how much feature jj contributes to the prediction. If you are predicting house price and x1x_1 is size in square metres, then θ1\theta_1 is roughly "how many euros does one extra square metre add to the price, holding everything else equal."

Now we have n+1n + 1 parameters: θ0\theta_0 (the intercept) and θ1\theta_1 through θn\theta_n (one weight per feature). Writing them out each time is cumbersome. We need a compact notation.


Vectorising the Hypothesis

The key notational trick that makes everything clean is to introduce a zero feature: define x0=1x_0 = 1 for every single training example. This is not adding any information — it is just a constant 1 appended to every feature vector.

Why? Because it lets us absorb the intercept θ0\theta_0 into the same structure as all other parameters. Now the feature vector for any example becomes:

x=[x0x1xn]Rn+1where x0=1x = \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} \in \mathbb{R}^{n+1} \qquad \text{where } x_0 = 1

And the parameter vector is:

θ=[θ0θ1θn]Rn+1\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} \in \mathbb{R}^{n+1}

Now the hypothesis is simply a dot product — the transpose of θ\theta multiplied by xx:

hθ(x)=θ0x0+θ1x1++θnxn=θTxh_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_n x_n = \theta^T x

This is the vectorised form of the multivariate hypothesis. Every time you see hθ(x)=θTxh_\theta(x) = \theta^T x, it means exactly the same thing as the long sum — it is just compressed into two symbols. This notation is what everyone uses in papers, textbooks, and code. Get comfortable with it.

The cost function extends in the obvious way — the form is identical to before, just with the new hh:

J(θ)=12mi=1m(hθ(x(i))y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2

Note we now write J(θ)J(\theta) rather than J(θ0,θ1)J(\theta_0, \theta_1)θ\theta is the whole parameter vector, not just a pair. The goal is still minθJ(θ)\min_\theta J(\theta).


Gradient Descent for Multiple Features

With the vectorised notation in place, the gradient descent update rule is remarkably clean. We now need to update every θj\theta_j for j=0,1,,nj = 0, 1, \ldots, n at every iteration, simultaneously.

The general update rule is:

Repeat until convergence:

θj:=θjαθjJ(θ)simultaneously for all j=0,1,,n\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \qquad \text{simultaneously for all } j = 0, 1, \ldots, n

Working out the partial derivative — same chain rule as before, the inner term hθ(x(i))y(i)h_\theta(x^{(i)}) - y^{(i)} differentiates with respect to θj\theta_j to give xj(i)x^{(i)}_j — gives:

Jθj=1mi=1m(hθ(x(i))y(i))xj(i)\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right) x^{(i)}_j

Substituting back, the concrete update is:

θj:=θjα1mi=1m(hθ(x(i))y(i))xj(i)\theta_j := \theta_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right) x^{(i)}_j

For j=0j = 0 this simplifies slightly because x0(i)=1x^{(i)}_0 = 1 for every example, so the x0(i)x^{(i)}_0 factor disappears and you just sum the raw errors. The univariate updates from the previous section are the special case n=1n = 1 of this more general formula.

The structure is the same no matter how many features you have — the sum is always over all mm examples, and the gradient for each parameter picks up the appropriate feature value as a weight. Scale this to n=10,000n = 10{,}000 features and the formula looks exactly the same; only the size of the vectors changes.


Feature Scaling

Now that we have multiple features, a practical problem arises. Consider predicting house price with two features:

  • x1x_1 = size in square metres, ranging from 5050 to 20002000
  • x2x_2 = number of bathrooms, ranging from 11 to 55

These two features differ by nearly three orders of magnitude in scale. The contour plot of J(θ1,θ2)J(\theta_1, \theta_2) becomes a very elongated ellipse — nearly a line — rather than the circular bowl we saw in the one-feature case.

What does this mean for gradient descent? The gradient direction becomes dominated by the feature with the larger range. The algorithm takes large steps along the θ1\theta_1 direction and tiny ones along θ2\theta_2, causing it to zigzag down the long narrow valley of the cost surface rather than going straight to the minimum. It can take many more iterations than necessary — sometimes orders of magnitude more.

The fix is feature scaling: rescale the features so they all live on approximately the same range. The simplest approach is to divide each feature by its maximum value:

x1x12000,x2x25x_1 \leftarrow \frac{x_1}{2000}, \qquad x_2 \leftarrow \frac{x_2}{5}

Now both features range from 0 to 1. The contour plot becomes much more circular, gradient descent can take a much more direct path to the minimum, and convergence is dramatically faster.

The practical target range is approximately [1,+1][-1, +1] or [0,1][0, 1]. Some guidelines on when to bother:

  • Features in [0,3][0, 3] or [2,0.5][-2, 0.5]? Close enough — scaling probably not needed.
  • Features in [100,+100][-100, +100]? Yes, rescale.
  • Features in [0.0001,+0.0001][-0.0001, +0.0001]? Also yes — they are within [1,1][-1, 1] in absolute terms, but the relative scale between different features may still be huge.

The point is not the absolute values but their similarity across features and their proximity to a reasonable range.


Mean Normalisation

Alongside feature scaling, practitioners often also apply mean normalisation: shifting features so they have approximately zero mean. This is done by replacing xjx_j with:

xjxjμjSjx_j \leftarrow \frac{x_j - \mu_j}{S_j}

where μj\mu_j is the mean of feature jj across the training set, and SjS_j is either the range (max minus min) or the standard deviation. The x0=1x_0 = 1 feature is never normalised.

The combined effect of feature scaling and mean normalisation is that each feature ends up centred around zero with a similar spread. Gradient descent on such data behaves much more predictably and converges in far fewer iterations.

[!NOTE] Feature scaling and mean normalisation are part of what is called data pre-processing or data preparation in the broader ML pipeline. We encounter them here in the context of making gradient descent behave well, but their importance goes beyond just convergence speed — they also affect regularisation, distance-based algorithms, and neural network training. We will return to data preparation in more depth later in the course.


Choosing the Learning Rate

The learning rate α\alpha is a hyperparameter — you set it before training, and gradient descent itself does not learn it. Choosing it well is a practical skill, and there is a concrete debugging workflow.

The debugging plot: after running gradient descent, plot J(θ)J(\theta) as a function of iteration number. For a correct implementation with an appropriate α\alpha:

  • JJ should decrease after every single iteration without exception
  • The curve should flatten out and approach a horizontal asymptote, indicating convergence

If JJ is increasing, or oscillating up and down, α\alpha is almost certainly too large — the updates are overshooting. Reduce α\alpha by a factor of 3 or 10 and try again.

If JJ is decreasing but very slowly, α\alpha might be too small. Increase it.

Practical search strategies. A useful approach is to try learning rates spaced by a factor of approximately 3:

,0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1,\ldots, 0.001,\ 0.003,\ 0.01,\ 0.03,\ 0.1,\ 0.3,\ 1, \ldots

Run a short trial at each value, plot the loss curves, and pick the largest α\alpha that still shows smooth monotone decrease. You are looking for the fastest convergence without divergence.

[!MATH] It can be shown (under mild conditions on JJ) that if α\alpha is sufficiently small, J(θ)J(\theta) will decrease on every iteration. For linear regression this holds for any α<1X2/m\alpha < \frac{1}{\|X\|^2 / m} where X\|X\| is a norm of the data matrix — in practice this bound is not useful directly, but it confirms that making α\alpha small enough always works. The art is making it small enough to converge while large enough to be practical.


Feature Engineering and Polynomial Regression

Having established the multivariate framework, a natural question arises: do we have to use the raw features as they appear in the dataset? The answer is no — and this turns out to be one of the most important degrees of freedom in building ML models.

Creating new features. Suppose you are predicting house price and you have the façade width x1x_1 and the depth x2x_2 of a plot. You could feed both to the model as separate features. But you might notice that what actually predicts price better is the total area — and area=x1×x2\text{area} = x_1 \times x_2. You can simply define a new feature x3=x1x2x_3 = x_1 \cdot x_2 and feed that to the model instead of or alongside the originals.

This is not "inventing data." You are not adding information that was not there — you are choosing how to present the information to the learning algorithm. A useful new feature can dramatically improve model performance, while a poorly chosen feature adds noise. Whether a combination of features is meaningful depends on domain knowledge: if you understand the problem, you can often design features that make the model's job much easier.

Polynomial regression. Sometimes the relationship between the input and output is clearly not linear — a scatter plot might show a curve. With the multivariate framework, we can handle this without leaving linear regression at all.

Suppose the housing price data looks like it follows a curve. Instead of hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x, we might try:

hθ(x)=θ0+θ1x+θ2x2h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2

or even a cubic:

hθ(x)=θ0+θ1x+θ2x2+θ3x3h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3

From the model's perspective these are just multivariate linear regression problems. If we define x1=xx_1 = x, x2=x2x_2 = x^2, x3=x3x_3 = x^3, then the cubic hypothesis is hθ(x)=θ0+θ1x1+θ2x2+θ3x3=θTxh_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 = \theta^T x — exactly the form we already know how to train.

The model is still linear in the parameters θ\theta — gradient descent and the cost function are unchanged. The nonlinearity lives entirely in the feature transformation, not in the model itself. This is a subtle but important distinction.

One could also try a square root feature — x2=xx_2 = \sqrt{x} — if the data suggests the curve flattens out at large values. The choice of which polynomial (or other nonlinear transformation) to use is guided by looking at the data and by domain intuition.

[!NOTE] If you use polynomial features, feature scaling becomes even more important. If xx ranges from 1 to 1000, then x2x^2 ranges from 1 to 1,000,000 and x3x^3 from 1 to 10910^9. Without scaling, the different powers will be on wildly different scales and gradient descent will struggle. Always scale after constructing polynomial features.


The Normal Equation: An Analytic Alternative

Gradient descent is an iterative algorithm — it takes many steps to converge. It turns out that for linear regression specifically, there is an analytic method that finds the optimal θ\theta in a single step, with no iterations at all: the normal equation.

The derivation involves calculus on matrices — you take the gradient of J(θ)J(\theta) with respect to θ\theta (now a vector), set it to zero, and solve. The result is:

θ=(XTX)1XTy\theta = (X^T X)^{-1} X^T y

where XX is the design matrix — an m×(n+1)m \times (n+1) matrix whose ii-th row is the feature vector x(i)Tx^{(i)T} — and yy is the mm-dimensional vector of training outputs. You compute this matrix expression once and you have the exact optimal θ\theta.

This is genuinely remarkable: no learning rate to tune, no iterations to run, no convergence monitoring needed.

So why did we introduce gradient descent at all?

Gradient Descent vs Normal Equation

Gradient DescentNormal Equation
Learning rateMust be chosen carefullyNot needed
IterationsMany — depends on α\alpha and dataNone — one computation
Feature scalingRecommendedIrrelevant
Scales to large nnYes — cost O(kn2)\sim O(kn^2)No — cost O(n3)\sim O(n^3)
Works beyond linear regressionYes — general purposeNo

The critical entry is the computational cost. The normal equation requires computing (XTX)1(X^T X)^{-1}, which is a matrix inversion of an (n+1)×(n+1)(n+1) \times (n+1) matrix. Matrix inversion scales as O(n3)O(n^3). For n=100n = 100 features this is trivial. For n=10,000n = 10{,}000 it starts to be expensive. For n=106n = 10^6 — which is common in text and genomics applications — it becomes completely intractable.

Gradient descent, by contrast, scales much more favourably. Each iteration costs O(mn)O(mn) — one pass over the data — and even with many iterations the total cost is manageable for large nn.

The practical guideline that emerges from experience: if nn is up to roughly 10410^4, the normal equation is fine and convenient. Beyond that, gradient descent is the preferred approach.

There is another reason to prefer gradient descent that becomes apparent later in the course: the normal equation is specific to linear regression. The moment you move to logistic regression, neural networks, or any other model, there is no closed-form solution — gradient descent (or a variant of it) is the only option. Learning it well for linear regression builds the habit and intuition you will need for everything that follows.

[!NOTE] The normal equation can also fail if XTXX^T X is not invertible — which happens when you have redundant features (linearly dependent columns) or more features than training examples (nmn \geq m). In those cases the matrix has no inverse. Gradient descent handles these situations more gracefully, though they are symptoms of a deeper problem with the dataset that should be addressed regardless.


Summary

Multivariate linear regression is the proper general form of the algorithm introduced in Lecture 1. The key additions:

We extended the hypothesis to hθ(x)=θTxh_\theta(x) = \theta^T x by introducing the zero feature x0=1x_0 = 1 and collecting all parameters into a vector θRn+1\theta \in \mathbb{R}^{n+1}. The cost function is structurally identical to before. Gradient descent updates all n+1n+1 parameters simultaneously at each iteration, with the update for θj\theta_j picking up xj(i)x^{(i)}_j as a weighting factor.

We also introduced three practical ideas that matter for making this work in practice: feature scaling and mean normalisation (to improve convergence speed), feature engineering and polynomial regression (to fit nonlinear patterns using the same linear framework), and the normal equation (a one-shot analytic alternative that works well when nn is small).

The next part of the course introduces a fundamentally different kind of problem — not predicting a continuous number, but predicting a discrete class. That requires a new type of hypothesis and a new cost function: logistic regression.