There is no single universally accepted definition of ML. The two most cited:
Arthur Samuel (1959):
"ML is the field of study that gives computers the ability to learn without being explicitly programmed."
Tom Mitchell (1997):
"A machine is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at T, as measured by P, improves with experience E."
(Tom has apparently been asked whether he just wrote this definition to rhyme.)
Mitchell's is the most rigorous. To make it concrete, apply it to the checkers example:
And to spam filtering:
[!NOTE] What "AI" actually means in this course (skipping the philosophy): most AI research today is not trying to recreate general intelligence. It is automating task execution at a scale and speed not humanly possible. Facial recognition is a good example — not smarter than a human in any general sense, but it can match hundreds of millions of faces in seconds. The keyword is integration of AI with human intelligence, not replacement.
The math behind ML is decades old. So why the explosion only recently? Three things converged:
The canonical turning point: AlexNet (2012) — a deep network that demolished the competition on ImageNet and made the entire community realise the era of deep learning had arrived.
| Type | What you give the model | What it does | Example |
|---|---|---|---|
| Supervised | Labeled pairs | Learns mapping | Spam filter, tumour classification |
| Unsupervised | Only , no labels | Finds structure in data | Clustering patients by gene expression |
| Reinforcement | Agent + reward signal | Learns by trial and error | Robotic goalkeeper |
The course is primarily supervised, some unsupervised, and reinforcement learning — "we will see."
In supervised learning, we already know the correct output for each training example. We give the algorithm both the inputs and the right answers — we are supervising it.
There are two sub-types and it is worth being precise about which is which, because the math differs:
Regression — the output is continuous, something you could place on a number line. House prices are the classic example: you input size, location, number of rooms, and you want a predicted price in euros. The output could be any real number.

Classification — the output is discrete, one of a fixed set of categories. Is this tumour malignant or benign? The output is 0 or 1. There is no "halfway between malignant and benign" — it's one or the other.

In the tumour example: if you only have tumour size, the algorithm draws a 1D threshold — everything above size is malignant, below is benign. If you add patient age, now you have a 2D plane and the boundary becomes a curve. If you add thickness, cell uniformity, cell shape, the algorithm can use all of them simultaneously to draw a much more nuanced boundary.
This is the real power of ML — it scales to as many features as you can give it. But there is a catch: more features require more data to learn from reliably. With 10 features and only 20 examples, the model has too much freedom and not enough signal. The features and the data have to grow together.
The difference from supervised learning is simple but profound: you have no labels. Nobody has told the algorithm what the right answer is. You just hand it the data and ask "can you find any structure in here?"
Clustering is the most common type — grouping similar points together. Some examples of where this is actually used:
The key point about unsupervised learning is this: you don't know what the categories mean until after you find them. The algorithm gives you "Group 1" and "Group 2". What those groups represent biologically, commercially, or physically — that interpretation belongs to a human with domain knowledge.

Non-clustering is a separate category — finding structure that isn't about grouping. The cocktail party problem is the canonical example: two people speaking simultaneously in a room, two microphones placed at different positions, and the goal is to separate the two voices from the two mixed recordings. Humans do this effortlessly and unconsciously. For machines it is genuinely hard, and it is an unsupervised problem because there are no labels — nobody has pre-tagged which audio belongs to which speaker.
Before diving into the math, it is worth having a clear picture of the overall workflow, because all the notation and formulas that follow are just precise ways of describing steps in this pipeline.
The idea is simple. You start with a training set — a collection of examples where you know both the input and the correct output. You feed that to a learning algorithm, whose job is to produce a hypothesis . The hypothesis is just a function — it takes a new input and produces a predicted output .
Training set → Learning Algorithm → Hypothesis h
Then at inference time — when you actually want to use the model:
New x → h → Predicted ŷ
Everything in the rest of this course is about: what form should take? How do we measure if is good? How do we find the best ?
You will see this notation in every ML paper, every lecture, every textbook. It is worth committing to memory now so it never slows you down later.
| Symbol | Meaning |
|---|---|
| Number of training examples | |
| Input variable / feature | |
| Output variable / target | |
| The -th training example | |
| The hypothesis — the function the model learns | |
| Parameters of the model |
[!WARNING] means the -th example — it is not to the power of . The superscript in parentheses is always an index into the training set.
For the housing dataset: might be 2104 sqft with euros. is the second house. is however many rows you have in total.
Now that we have notation, we can write down the simplest possible hypothesis. For the housing price example — one input variable (size), one output (price) — we guess that the relationship is approximately linear:
This is a straight line. is the intercept (where the line crosses the -axis) and is the slope (how much price increases per unit of size). The values are the parameters — the numbers the algorithm will learn from data.
This specific model has a name: univariate linear regression.
The question this immediately raises is: there are infinitely many possible lines. Which one is best? That is precisely what the cost function answers.
To choose the best and , we need a way to measure how good or bad any particular choice is. The natural idea: a good hypothesis is one where the predictions are close to the true values across all training examples.
The gap between prediction and truth on a single example is the error:
We want this to be small for every example. But how do we turn individual errors into one single number we can minimise? We need to aggregate them somehow.
The naive idea — just sum the raw errors — has a fatal flaw: a prediction that is too high on one example and too low on another would sum to zero, looking perfect when it is clearly not. The errors cancel each other out.
The fix: square each error before summing. Squaring makes every term positive, so nothing can cancel. It also penalises large errors disproportionately — being off by 10 costs 100, while being off by 1 costs only 1. The model is pushed hard to avoid big mistakes.
Putting this together, the squared error cost function is:
The averages over all examples so doesn't artificially grow just because you have a larger dataset. The is pure mathematical convenience — when you differentiate later, the exponent 2 comes down and cancels with it, making the gradient formula cleaner. It has no effect on which minimises .
The goal is now precisely stated:
This is the conceptual point, and it is worth pausing on.
and are different kinds of objects living in different spaces.
lives in data space. Given fixed values, is a function of — it draws a line through your data plot. Change and the line moves.
lives in parameter space. Given fixed data, is a function of and — it takes a point in the plane and returns a single number representing how bad that choice of parameters is. Every possible line you could draw through the data corresponds to exactly one point on the surface.
Minimising means finding the point in parameter space where the surface is lowest — which corresponds to the line in data space that best fits the training data.
To see this clearly, simplify temporarily. Set , forcing the line through the origin: . Now depends on only one number, , and we can plot it.
Take three training examples: , , — points that lie perfectly on .
| What the line looks like | ||
|---|---|---|
| 1.0 | Passes through all three points | 0 |
| 0.5 | Undershoots everything | |
| 0.0 | Flat line at zero | |
| 2.0 | Overshoots everything |
Plot these points and you get a parabola — is zero at (perfect fit) and grows as moves away from 1 in either direction. The minimum of the parabola corresponds exactly to the best fit line. This is not a coincidence — it is always the case.
Restore and now is a surface in three dimensions — a bowl shape (technically a paraboloid). We cannot easily see a 3D surface, so we use a contour plot: slice the bowl horizontally at different heights and project the resulting ovals down onto the plane.
Every oval is a set of pairs that all give the same cost . The outer ovals are high cost — bad fits. The inner ovals are lower cost — better fits. The very centre of the innermost oval is the minimum — the globally best and .
[!MATH] For univariate linear regression, the cost surface is a convex paraboloid — it has exactly one global minimum and no local minima at all. Any method that walks downhill will always reach the same optimal solution regardless of where it starts. This is a special property of linear regression. Deep neural networks have highly non-convex loss surfaces with many local minima and saddle points — far harder to optimise.
The full pipeline introduced in this lecture, now with precise meaning attached to every step:
Step 4 is the one we have not answered yet: we know what to minimise, but not how. That is the subject of Lecture 2 — Gradient Descent.