Everything in Lecture 1 and the gradient descent section assumed a single input feature: house size predicts house price. The hypothesis was , with exactly two parameters to learn.
This is useful for building intuition, but it is not how real problems look. A house has a size, a number of bathrooms, a number of floors, an age, a neighbourhood, a distance to the city centre. A tumour has a size, a cell uniformity score, a cell thickness measure, a clump thickness. Any real dataset has many features, often hundreds or thousands. We need the machinery to handle all of them at once.
The generalisation is called multivariate linear regression — or equivalently, linear regression with multiple features. The ideas are identical to the univariate case; what changes is the notation, and the fact that the parameters are now a vector rather than a pair of numbers.
Before writing the hypothesis, we need to extend the notation from Lecture 1 to handle multiple features.
| Symbol | Meaning |
|---|---|
| Number of training examples (rows in the dataset) | |
| Number of features (columns, not counting the output ) | |
| The full feature vector for the -th training example — a vector of numbers | |
| The value of feature in the -th training example | |
| The output for the -th training example |
So is no longer a single number — it is a column vector with entries. For example, if is the fourth house in a dataset with four features (size, bathrooms, floors, age), then:
And would be the number of bathrooms of that fourth house. The index in parentheses selects the example; the subscript selects the feature within that example.
[!WARNING] Keep the two indices straight: the superscript always refers to the training example (row), and the subscript always refers to the feature (column). is feature 2 of example 4 — not the second power of the fourth example.
In the univariate case, the hypothesis was . The natural extension to features is to give each feature its own parameter:
Each is the weight on feature — it tells the model how much feature contributes to the prediction. If you are predicting house price and is size in square metres, then is roughly "how many euros does one extra square metre add to the price, holding everything else equal."
Now we have parameters: (the intercept) and through (one weight per feature). Writing them out each time is cumbersome. We need a compact notation.
The key notational trick that makes everything clean is to introduce a zero feature: define for every single training example. This is not adding any information — it is just a constant 1 appended to every feature vector.
Why? Because it lets us absorb the intercept into the same structure as all other parameters. Now the feature vector for any example becomes:
And the parameter vector is:
Now the hypothesis is simply a dot product — the transpose of multiplied by :
This is the vectorised form of the multivariate hypothesis. Every time you see , it means exactly the same thing as the long sum — it is just compressed into two symbols. This notation is what everyone uses in papers, textbooks, and code. Get comfortable with it.
The cost function extends in the obvious way — the form is identical to before, just with the new :
Note we now write rather than — is the whole parameter vector, not just a pair. The goal is still .
With the vectorised notation in place, the gradient descent update rule is remarkably clean. We now need to update every for at every iteration, simultaneously.
The general update rule is:
Repeat until convergence:
Working out the partial derivative — same chain rule as before, the inner term differentiates with respect to to give — gives:
Substituting back, the concrete update is:
For this simplifies slightly because for every example, so the factor disappears and you just sum the raw errors. The univariate updates from the previous section are the special case of this more general formula.
The structure is the same no matter how many features you have — the sum is always over all examples, and the gradient for each parameter picks up the appropriate feature value as a weight. Scale this to features and the formula looks exactly the same; only the size of the vectors changes.
Now that we have multiple features, a practical problem arises. Consider predicting house price with two features:
These two features differ by nearly three orders of magnitude in scale. The contour plot of becomes a very elongated ellipse — nearly a line — rather than the circular bowl we saw in the one-feature case.
What does this mean for gradient descent? The gradient direction becomes dominated by the feature with the larger range. The algorithm takes large steps along the direction and tiny ones along , causing it to zigzag down the long narrow valley of the cost surface rather than going straight to the minimum. It can take many more iterations than necessary — sometimes orders of magnitude more.
The fix is feature scaling: rescale the features so they all live on approximately the same range. The simplest approach is to divide each feature by its maximum value:
Now both features range from 0 to 1. The contour plot becomes much more circular, gradient descent can take a much more direct path to the minimum, and convergence is dramatically faster.
The practical target range is approximately or . Some guidelines on when to bother:
The point is not the absolute values but their similarity across features and their proximity to a reasonable range.
Alongside feature scaling, practitioners often also apply mean normalisation: shifting features so they have approximately zero mean. This is done by replacing with:
where is the mean of feature across the training set, and is either the range (max minus min) or the standard deviation. The feature is never normalised.
The combined effect of feature scaling and mean normalisation is that each feature ends up centred around zero with a similar spread. Gradient descent on such data behaves much more predictably and converges in far fewer iterations.
[!NOTE] Feature scaling and mean normalisation are part of what is called data pre-processing or data preparation in the broader ML pipeline. We encounter them here in the context of making gradient descent behave well, but their importance goes beyond just convergence speed — they also affect regularisation, distance-based algorithms, and neural network training. We will return to data preparation in more depth later in the course.
The learning rate is a hyperparameter — you set it before training, and gradient descent itself does not learn it. Choosing it well is a practical skill, and there is a concrete debugging workflow.
The debugging plot: after running gradient descent, plot as a function of iteration number. For a correct implementation with an appropriate :
If is increasing, or oscillating up and down, is almost certainly too large — the updates are overshooting. Reduce by a factor of 3 or 10 and try again.
If is decreasing but very slowly, might be too small. Increase it.
Practical search strategies. A useful approach is to try learning rates spaced by a factor of approximately 3:
Run a short trial at each value, plot the loss curves, and pick the largest that still shows smooth monotone decrease. You are looking for the fastest convergence without divergence.
[!MATH] It can be shown (under mild conditions on ) that if is sufficiently small, will decrease on every iteration. For linear regression this holds for any where is a norm of the data matrix — in practice this bound is not useful directly, but it confirms that making small enough always works. The art is making it small enough to converge while large enough to be practical.
Having established the multivariate framework, a natural question arises: do we have to use the raw features as they appear in the dataset? The answer is no — and this turns out to be one of the most important degrees of freedom in building ML models.
Creating new features. Suppose you are predicting house price and you have the façade width and the depth of a plot. You could feed both to the model as separate features. But you might notice that what actually predicts price better is the total area — and . You can simply define a new feature and feed that to the model instead of or alongside the originals.
This is not "inventing data." You are not adding information that was not there — you are choosing how to present the information to the learning algorithm. A useful new feature can dramatically improve model performance, while a poorly chosen feature adds noise. Whether a combination of features is meaningful depends on domain knowledge: if you understand the problem, you can often design features that make the model's job much easier.
Polynomial regression. Sometimes the relationship between the input and output is clearly not linear — a scatter plot might show a curve. With the multivariate framework, we can handle this without leaving linear regression at all.
Suppose the housing price data looks like it follows a curve. Instead of , we might try:
or even a cubic:
From the model's perspective these are just multivariate linear regression problems. If we define , , , then the cubic hypothesis is — exactly the form we already know how to train.
The model is still linear in the parameters — gradient descent and the cost function are unchanged. The nonlinearity lives entirely in the feature transformation, not in the model itself. This is a subtle but important distinction.
One could also try a square root feature — — if the data suggests the curve flattens out at large values. The choice of which polynomial (or other nonlinear transformation) to use is guided by looking at the data and by domain intuition.
[!NOTE] If you use polynomial features, feature scaling becomes even more important. If ranges from 1 to 1000, then ranges from 1 to 1,000,000 and from 1 to . Without scaling, the different powers will be on wildly different scales and gradient descent will struggle. Always scale after constructing polynomial features.
Gradient descent is an iterative algorithm — it takes many steps to converge. It turns out that for linear regression specifically, there is an analytic method that finds the optimal in a single step, with no iterations at all: the normal equation.
The derivation involves calculus on matrices — you take the gradient of with respect to (now a vector), set it to zero, and solve. The result is:
where is the design matrix — an matrix whose -th row is the feature vector — and is the -dimensional vector of training outputs. You compute this matrix expression once and you have the exact optimal .
This is genuinely remarkable: no learning rate to tune, no iterations to run, no convergence monitoring needed.
So why did we introduce gradient descent at all?
| Gradient Descent | Normal Equation | |
|---|---|---|
| Learning rate | Must be chosen carefully | Not needed |
| Iterations | Many — depends on and data | None — one computation |
| Feature scaling | Recommended | Irrelevant |
| Scales to large | Yes — cost | No — cost |
| Works beyond linear regression | Yes — general purpose | No |
The critical entry is the computational cost. The normal equation requires computing , which is a matrix inversion of an matrix. Matrix inversion scales as . For features this is trivial. For it starts to be expensive. For — which is common in text and genomics applications — it becomes completely intractable.
Gradient descent, by contrast, scales much more favourably. Each iteration costs — one pass over the data — and even with many iterations the total cost is manageable for large .
The practical guideline that emerges from experience: if is up to roughly , the normal equation is fine and convenient. Beyond that, gradient descent is the preferred approach.
There is another reason to prefer gradient descent that becomes apparent later in the course: the normal equation is specific to linear regression. The moment you move to logistic regression, neural networks, or any other model, there is no closed-form solution — gradient descent (or a variant of it) is the only option. Learning it well for linear regression builds the habit and intuition you will need for everything that follows.
[!NOTE] The normal equation can also fail if is not invertible — which happens when you have redundant features (linearly dependent columns) or more features than training examples (). In those cases the matrix has no inverse. Gradient descent handles these situations more gracefully, though they are symptoms of a deeper problem with the dataset that should be addressed regardless.
Multivariate linear regression is the proper general form of the algorithm introduced in Lecture 1. The key additions:
We extended the hypothesis to by introducing the zero feature and collecting all parameters into a vector . The cost function is structurally identical to before. Gradient descent updates all parameters simultaneously at each iteration, with the update for picking up as a weighting factor.
We also introduced three practical ideas that matter for making this work in practice: feature scaling and mean normalisation (to improve convergence speed), feature engineering and polynomial regression (to fit nonlinear patterns using the same linear framework), and the normal equation (a one-shot analytic alternative that works well when is small).
The next part of the course introduces a fundamentally different kind of problem — not predicting a continuous number, but predicting a discrete class. That requires a new type of hypothesis and a new cost function: logistic regression.