← All lectures
Lecture · L02

Gradient Descent

doneGradientDescent

Where We Left Off

At the end of Lecture 1 we had a complete problem statement. We have a training set, we chose a hypothesis hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x, and we defined the cost function:

J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2

The goal is minθ0,θ1J(θ0,θ1)\min_{\theta_0, \theta_1} J(\theta_0, \theta_1).

What we did not answer is: how do we actually find that minimum? For the simple case of univariate linear regression you could solve it analytically — set the derivatives to zero and solve the system. But that only works for linear regression. The moment the model gets more complex — more parameters, nonlinear hypotheses, deep networks — closed-form solutions become impossible. We need a general method that works regardless of the shape or complexity of JJ.

That method is gradient descent.


The Core Intuition

Before writing any formula, build the picture. Imagine you are standing somewhere on a hilly landscape, and your goal is to reach the lowest valley. You cannot see the whole landscape — you can only feel the slope of the ground directly beneath your feet. What do you do?

You look around 360 degrees, find the direction that goes downhill most steeply, take a small step in that direction, then stop and repeat. Look around again, find the steepest descent, take another small step. Repeat until you are at a point where every direction around you goes uphill — you are at a local minimum.

This is exactly what gradient descent does in parameter space. The "landscape" is the surface of J(θ0,θ1)J(\theta_0, \theta_1). The "position" is the current pair (θ0,θ1)(\theta_0, \theta_1). The "steepest downhill direction" is the negative of the gradient — the direction in which JJ decreases fastest. The "small step" is controlled by a number called the learning rate α\alpha.


The Gradient Descent Algorithm

The algorithm starts by initialising θ0\theta_0 and θ1\theta_1 to some values — typically both zero, though any starting point works for the convex cost surfaces we will see in linear regression.

Then it repeatedly applies this update rule until convergence:

θj:=θjαθjJ(θ0,θ1)for j=0,1\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) \qquad \text{for } j = 0, 1

Read this carefully. The :=:= symbol means assignment — we are replacing the old value of θj\theta_j with a new one. The new value is the old value minus a small step in the direction of the partial derivative.

Why subtract rather than add? Because the partial derivative Jθj\frac{\partial J}{\partial \theta_j} points in the direction of steepest increase in JJ. We want to decrease JJ, so we go in the opposite direction — hence the minus sign.

Breaking the update rule into its pieces:

  • θj\theta_j — the parameter we are currently updating (θ0\theta_0 or θ1\theta_1)
  • α\alpha — the learning rate: a small positive number controlling the step size
  • θjJ(θ0,θ1)\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) — the partial derivative of the cost with respect to θj\theta_j: tells us the slope of JJ in the θj\theta_j direction at our current position

Each iteration moves (θ0,θ1)(\theta_0, \theta_1) to a point of lower cost. Over many iterations, the parameters walk down the cost surface toward the minimum.


Simultaneous Update — A Critical Detail

There is one implementation detail that is easy to get wrong but matters: you must update both parameters simultaneously, not sequentially.

This means you compute both new values using the old values of θ0\theta_0 and θ1\theta_1, then assign both at once:

temp0:=θ0αJθ0\text{temp}_0 := \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}

temp1:=θ1αJθ1\text{temp}_1 := \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}

θ0:=temp0\theta_0 := \text{temp}_0

θ1:=temp1\theta_1 := \text{temp}_1

Why does this matter? If you updated θ0\theta_0 first, then used the already-updated θ0\theta_0 to compute the derivative for θ1\theta_1, you would be computing the gradient at a different point than intended — θ0\theta_0 has already moved. The update for θ1\theta_1 would be using mixed information from two different positions on the cost surface. This is technically a different algorithm and it can behave erratically.

[!WARNING] Always compute all partial derivatives using the current parameter values before updating any of them. This applies to all parameters simultaneously, even when you have millions of them in a neural network.


The Learning Rate α\alpha

The learning rate α\alpha is a hyperparameter — a number you choose before training, not something the algorithm learns from data. It controls how large a step to take in the downhill direction at each iteration.

Getting it right matters more than it might seem. Consider what happens at the two extremes:

If α\alpha is too small: the steps are tiny. The algorithm will eventually reach the minimum, but it will take an enormous number of iterations to get there — practically, training becomes too slow to be useful.

If α\alpha is too large: the steps are so big that you overshoot the minimum. Instead of settling into the valley, the algorithm leaps over it to the other side, where the slope pushes it back — but again too far. The parameters oscillate back and forth, and JJ can actually increase over iterations rather than decrease. In the worst case the algorithm diverges entirely.

The sweet spot is somewhere in between — large enough to converge in a reasonable number of iterations, small enough not to overshoot. In practice you try a few values and watch the loss curve.

There is also something worth noticing: even with a fixed α\alpha, the step size naturally decreases as you approach the minimum. This is because the slope — the partial derivative — gets smaller as you get closer to the bottom of the bowl. Near the minimum the gradient is nearly zero, so the update αJθj\alpha \cdot \frac{\partial J}{\partial \theta_j} becomes nearly zero even with the same α\alpha. The algorithm naturally slows down and settles in without any manual adjustment.

[!NOTE] If gradient descent is working correctly, J(θ0,θ1)J(\theta_0, \theta_1) should decrease on every single iteration. If you ever see JJ increase, something is wrong — most likely α\alpha is too large, or there is a bug in the derivative computation.


Computing the Partial Derivatives

So far we have said "compute the partial derivative of JJ with respect to θj\theta_j" without actually doing it. Let us do it now for the linear regression cost function. This is the calculus step — worth following once carefully so it is never a black box.

Recall:

J(θ0,θ1)=12mi=1m(θ0+θ1x(i)y(i))2J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left(\theta_0 + \theta_1 x^{(i)} - y^{(i)}\right)^2

We need Jθ0\frac{\partial J}{\partial \theta_0} and Jθ1\frac{\partial J}{\partial \theta_1} separately.

Derivative with respect to θ0\theta_0

Differentiate using the chain rule. The square gives a factor of 2, the inner term (θ0+θ1x(i)y(i))(\theta_0 + \theta_1 x^{(i)} - y^{(i)}) differentiates with respect to θ0\theta_0 to give 1. The 2 from the chain rule cancels with the 12\frac{1}{2}:

Jθ0=1mi=1m(hθ(x(i))y(i))\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)

Derivative with respect to θ1\theta_1

Same process, but now the inner term differentiates with respect to θ1\theta_1 to give x(i)x^{(i)} — it comes along as a factor:

Jθ1=1mi=1m(hθ(x(i))y(i))x(i)\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right) x^{(i)}

Notice the only difference: the θ1\theta_1 derivative picks up an extra x(i)x^{(i)} inside the sum. This makes intuitive sense — θ1\theta_1 is the slope, and its effect on the prediction scales with the size of the input x(i)x^{(i)}.


Gradient Descent for Linear Regression

Substituting these derivatives back into the update rule, the complete gradient descent algorithm for linear regression is:

Repeat until convergence:

θ0:=θ0α1mi=1m(hθ(x(i))y(i))\theta_0 := \theta_0 - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)

θ1:=θ1α1mi=1m(hθ(x(i))y(i))x(i)\theta_1 := \theta_1 - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right) x^{(i)}

This is the complete, unambiguous algorithm. Every iteration: compute both sums using the current θ0\theta_0 and θ1\theta_1, then update both simultaneously.

Reading the update for θ0\theta_0: if the model is consistently predicting too high, the sum of (hy)(h - y) will be positive, so θ0\theta_0 decreases — the intercept pulls down. If the model is consistently too low, the sum is negative, θ0\theta_0 increases. The update automatically corrects in the right direction.

The same logic applies to θ1\theta_1, but weighted by x(i)x^{(i)} — examples with larger input values have a greater influence on how the slope is adjusted.


Batch Gradient Descent

The version of gradient descent described here is specifically called batch gradient descent. The word "batch" refers to the fact that at every single update step, we compute the sum i=1m\sum_{i=1}^m over all mm training examples.

Every parameter update looks at the entire training set. This is why the derivatives have 1mi=1m\frac{1}{m}\sum_{i=1}^{m} in them — you are averaging the gradient signal over every example before taking a step.

This is the most stable form of gradient descent — the gradient estimate is exact (no approximation). The cost: if mm is very large — millions of training examples — computing the full sum at every step becomes expensive. This motivates alternative variants like stochastic gradient descent (update using one example at a time) and mini-batch gradient descent (update using a small random subset), which we will encounter later in the course.

[!NOTE] For the rest of this lecture and the next few, "gradient descent" means batch gradient descent unless explicitly stated otherwise. It is the right starting point for understanding — the variants are optimisations of the same underlying idea.


Convergence

How do we know when to stop? The algorithm runs for some number of iterations, and ideally JJ is decreasing each time. There are two practical stopping criteria:

Fixed number of iterations — simply run for 1000 iterations, or 10,000. Simple and predictable, but you might stop too early or waste time running after convergence.

Convergence threshold — stop when the decrease in JJ from one iteration to the next falls below some small number ϵ\epsilon, say 10310^{-3}. If JJ is barely changing, you are near the minimum and further iterations are not useful.

In practice you plot JJ versus the number of iterations — the loss curve. For a working implementation with a good α\alpha, this curve should decrease smoothly and flatten out. The shape tells you a lot:

  • Smooth monotone decrease → everything is working, α\alpha is reasonable
  • Oscillating or increasing → α\alpha is too large
  • Decreasing but very slowly → α\alpha might be too small, or you simply need more iterations
  • Flatlines immediately → α\alpha is far too small, or the gradient is almost zero everywhere (rare for linear regression)

Why This Works for Linear Regression

This is worth stating explicitly rather than leaving as a vague reassurance. For linear regression specifically, the cost function J(θ0,θ1)J(\theta_0, \theta_1) is a convex paraboloid — a bowl with no bumps, ridges, or secondary valleys. It has exactly one global minimum.

This means that gradient descent, regardless of where θ0\theta_0 and θ1\theta_1 are initialised, will always descend toward the same unique minimum. There is no risk of getting stuck in a local minimum — there are none. The algorithm is guaranteed to converge to the globally optimal parameters, provided α\alpha is small enough.

[!MATH] A function is convex if the line segment between any two points on the function lies above or on the function. Equivalently, all eigenvalues of the Hessian (the matrix of second derivatives) are non-negative everywhere. For J(θ0,θ1)=12m(hθ(x(i))y(i))2J(\theta_0, \theta_1) = \frac{1}{2m}\sum(h_\theta(x^{(i)}) - y^{(i)})^2 with a linear hh, this is straightforwardly satisfied — the Hessian is 1mXTX\frac{1}{m} X^T X where XX is the data matrix, which is positive semi-definite by construction. This convexity guarantee is what makes linear regression so tractable, and it is precisely what deep neural networks sacrifice in exchange for expressive power.


Putting It All Together: The Full Algorithm

It is useful to see the complete picture in one place, with each piece connected back to its motivation:

Initialise θ0=0\theta_0 = 0, θ1=0\theta_1 = 0 (or any values — for convex JJ it does not matter).

Repeat until convergence:

  1. Compute the prediction error for every training example: hθ(x(i))y(i)h_\theta(x^{(i)}) - y^{(i)}
  2. Average those errors (with the appropriate x(i)x^{(i)} weighting for θ1\theta_1) to get the gradient
  3. Move both parameters one small step α\alpha in the downhill direction, simultaneously

Each repetition of this loop is called an iteration or an epoch in the batch setting. After enough iterations, θ0\theta_0 and θ1\theta_1 converge to the values that minimise JJ — the parameters of the best-fit line through the training data.


Geometric Picture: What Is Actually Happening

It helps to hold both spaces in mind at once while the algorithm runs.

In parameter space (θ0\theta_0 vs θ1\theta_1), you are watching a point move across the contour plot from Lecture 1. Each iteration moves the point toward the centre — the minimum of JJ. The step direction is always perpendicular to the contour lines (that is what "steepest descent" means geometrically), and the step size is α\alpha times the gradient magnitude.

In data space (xx vs yy), you are watching a line rotate and shift. As θ0\theta_0 and θ1\theta_1 change, the hypothesis line adjusts — gradually tilting and translating until it fits through the data as well as possible.

The two views are two descriptions of the same process. One is the mechanism; the other is the effect.


Summary

Gradient descent is the engine that powers nearly all of modern ML. The specific formulas for linear regression are just one instance of a completely general idea:

  1. Compute how JJ changes with respect to each parameter — the partial derivatives
  2. Update each parameter by a small step in the direction that decreases JJ
  3. Repeat until the parameters stop changing meaningfully

The three things to internalise from this lecture:

  • The update rule θj:=θjαJθj\theta_j := \theta_j - \alpha \frac{\partial J}{\partial \theta_j} is the complete algorithm
  • The learning rate α\alpha controls stability and speed — too large diverges, too small is impractical
  • For linear regression the cost surface is convex, so convergence to the global minimum is guaranteed

The next natural question: what happens when hh is not a simple line? What if we have many input features, not just one xx? That is Lecture 3 — Multivariate Linear Regression.