At the end of Lecture 1 we had a complete problem statement. We have a training set, we chose a hypothesis , and we defined the cost function:
The goal is .
What we did not answer is: how do we actually find that minimum? For the simple case of univariate linear regression you could solve it analytically — set the derivatives to zero and solve the system. But that only works for linear regression. The moment the model gets more complex — more parameters, nonlinear hypotheses, deep networks — closed-form solutions become impossible. We need a general method that works regardless of the shape or complexity of .
That method is gradient descent.
Before writing any formula, build the picture. Imagine you are standing somewhere on a hilly landscape, and your goal is to reach the lowest valley. You cannot see the whole landscape — you can only feel the slope of the ground directly beneath your feet. What do you do?
You look around 360 degrees, find the direction that goes downhill most steeply, take a small step in that direction, then stop and repeat. Look around again, find the steepest descent, take another small step. Repeat until you are at a point where every direction around you goes uphill — you are at a local minimum.
This is exactly what gradient descent does in parameter space. The "landscape" is the surface of . The "position" is the current pair . The "steepest downhill direction" is the negative of the gradient — the direction in which decreases fastest. The "small step" is controlled by a number called the learning rate .
The algorithm starts by initialising and to some values — typically both zero, though any starting point works for the convex cost surfaces we will see in linear regression.
Then it repeatedly applies this update rule until convergence:
Read this carefully. The symbol means assignment — we are replacing the old value of with a new one. The new value is the old value minus a small step in the direction of the partial derivative.
Why subtract rather than add? Because the partial derivative points in the direction of steepest increase in . We want to decrease , so we go in the opposite direction — hence the minus sign.
Breaking the update rule into its pieces:
Each iteration moves to a point of lower cost. Over many iterations, the parameters walk down the cost surface toward the minimum.
There is one implementation detail that is easy to get wrong but matters: you must update both parameters simultaneously, not sequentially.
This means you compute both new values using the old values of and , then assign both at once:
Why does this matter? If you updated first, then used the already-updated to compute the derivative for , you would be computing the gradient at a different point than intended — has already moved. The update for would be using mixed information from two different positions on the cost surface. This is technically a different algorithm and it can behave erratically.
[!WARNING] Always compute all partial derivatives using the current parameter values before updating any of them. This applies to all parameters simultaneously, even when you have millions of them in a neural network.
The learning rate is a hyperparameter — a number you choose before training, not something the algorithm learns from data. It controls how large a step to take in the downhill direction at each iteration.
Getting it right matters more than it might seem. Consider what happens at the two extremes:
If is too small: the steps are tiny. The algorithm will eventually reach the minimum, but it will take an enormous number of iterations to get there — practically, training becomes too slow to be useful.
If is too large: the steps are so big that you overshoot the minimum. Instead of settling into the valley, the algorithm leaps over it to the other side, where the slope pushes it back — but again too far. The parameters oscillate back and forth, and can actually increase over iterations rather than decrease. In the worst case the algorithm diverges entirely.
The sweet spot is somewhere in between — large enough to converge in a reasonable number of iterations, small enough not to overshoot. In practice you try a few values and watch the loss curve.
There is also something worth noticing: even with a fixed , the step size naturally decreases as you approach the minimum. This is because the slope — the partial derivative — gets smaller as you get closer to the bottom of the bowl. Near the minimum the gradient is nearly zero, so the update becomes nearly zero even with the same . The algorithm naturally slows down and settles in without any manual adjustment.
[!NOTE] If gradient descent is working correctly, should decrease on every single iteration. If you ever see increase, something is wrong — most likely is too large, or there is a bug in the derivative computation.
So far we have said "compute the partial derivative of with respect to " without actually doing it. Let us do it now for the linear regression cost function. This is the calculus step — worth following once carefully so it is never a black box.
Recall:
We need and separately.
Differentiate using the chain rule. The square gives a factor of 2, the inner term differentiates with respect to to give 1. The 2 from the chain rule cancels with the :
Same process, but now the inner term differentiates with respect to to give — it comes along as a factor:
Notice the only difference: the derivative picks up an extra inside the sum. This makes intuitive sense — is the slope, and its effect on the prediction scales with the size of the input .
Substituting these derivatives back into the update rule, the complete gradient descent algorithm for linear regression is:
Repeat until convergence:
This is the complete, unambiguous algorithm. Every iteration: compute both sums using the current and , then update both simultaneously.
Reading the update for : if the model is consistently predicting too high, the sum of will be positive, so decreases — the intercept pulls down. If the model is consistently too low, the sum is negative, increases. The update automatically corrects in the right direction.
The same logic applies to , but weighted by — examples with larger input values have a greater influence on how the slope is adjusted.
The version of gradient descent described here is specifically called batch gradient descent. The word "batch" refers to the fact that at every single update step, we compute the sum over all training examples.
Every parameter update looks at the entire training set. This is why the derivatives have in them — you are averaging the gradient signal over every example before taking a step.
This is the most stable form of gradient descent — the gradient estimate is exact (no approximation). The cost: if is very large — millions of training examples — computing the full sum at every step becomes expensive. This motivates alternative variants like stochastic gradient descent (update using one example at a time) and mini-batch gradient descent (update using a small random subset), which we will encounter later in the course.
[!NOTE] For the rest of this lecture and the next few, "gradient descent" means batch gradient descent unless explicitly stated otherwise. It is the right starting point for understanding — the variants are optimisations of the same underlying idea.
How do we know when to stop? The algorithm runs for some number of iterations, and ideally is decreasing each time. There are two practical stopping criteria:
Fixed number of iterations — simply run for 1000 iterations, or 10,000. Simple and predictable, but you might stop too early or waste time running after convergence.
Convergence threshold — stop when the decrease in from one iteration to the next falls below some small number , say . If is barely changing, you are near the minimum and further iterations are not useful.
In practice you plot versus the number of iterations — the loss curve. For a working implementation with a good , this curve should decrease smoothly and flatten out. The shape tells you a lot:
This is worth stating explicitly rather than leaving as a vague reassurance. For linear regression specifically, the cost function is a convex paraboloid — a bowl with no bumps, ridges, or secondary valleys. It has exactly one global minimum.
This means that gradient descent, regardless of where and are initialised, will always descend toward the same unique minimum. There is no risk of getting stuck in a local minimum — there are none. The algorithm is guaranteed to converge to the globally optimal parameters, provided is small enough.
[!MATH] A function is convex if the line segment between any two points on the function lies above or on the function. Equivalently, all eigenvalues of the Hessian (the matrix of second derivatives) are non-negative everywhere. For with a linear , this is straightforwardly satisfied — the Hessian is where is the data matrix, which is positive semi-definite by construction. This convexity guarantee is what makes linear regression so tractable, and it is precisely what deep neural networks sacrifice in exchange for expressive power.
It is useful to see the complete picture in one place, with each piece connected back to its motivation:
Initialise , (or any values — for convex it does not matter).
Repeat until convergence:
Each repetition of this loop is called an iteration or an epoch in the batch setting. After enough iterations, and converge to the values that minimise — the parameters of the best-fit line through the training data.
It helps to hold both spaces in mind at once while the algorithm runs.
In parameter space ( vs ), you are watching a point move across the contour plot from Lecture 1. Each iteration moves the point toward the centre — the minimum of . The step direction is always perpendicular to the contour lines (that is what "steepest descent" means geometrically), and the step size is times the gradient magnitude.
In data space ( vs ), you are watching a line rotate and shift. As and change, the hypothesis line adjusts — gradually tilting and translating until it fits through the data as well as possible.
The two views are two descriptions of the same process. One is the mechanism; the other is the effect.
Gradient descent is the engine that powers nearly all of modern ML. The specific formulas for linear regression are just one instance of a completely general idea:
The three things to internalise from this lecture:
The next natural question: what happens when is not a simple line? What if we have many input features, not just one ? That is Lecture 3 — Multivariate Linear Regression.