← All lectures
Lecture · L01

Definitions of Machine Learning

doneintrotaxonomy

Definitions (brief — just vocabulary)

There is no single universally accepted definition of ML. The two most cited:

Arthur Samuel (1959):

"ML is the field of study that gives computers the ability to learn without being explicitly programmed."

Tom Mitchell (1997):

"A machine is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at T, as measured by P, improves with experience E."

(Tom has apparently been asked whether he just wrote this definition to rhyme.)

Mitchell's is the most rigorous. To make it concrete, apply it to the checkers example:

  • T = playing checkers
  • E = playing many games against itself
  • P = probability of winning the next game

And to spam filtering:

  • T = classifying emails as spam or not
  • E = watching you label emails as spam/not spam
  • P = fraction of emails correctly classified

[!NOTE] What "AI" actually means in this course (skipping the philosophy): most AI research today is not trying to recreate general intelligence. It is automating task execution at a scale and speed not humanly possible. Facial recognition is a good example — not smarter than a human in any general sense, but it can match hundreds of millions of faces in seconds. The keyword is integration of AI with human intelligence, not replacement.


Why Now? (brief)

The math behind ML is decades old. So why the explosion only recently? Three things converged:

  1. Big Data — internet-scale datasets that simply didn't exist before
  2. Hardware — GPUs that can do the massive parallel computation ML requires
  3. Cloud computing — access to that compute without owning a supercomputer (though you do pay for it)

The canonical turning point: AlexNet (2012) — a deep network that demolished the competition on ImageNet and made the entire community realise the era of deep learning had arrived.


Types of ML (brief — dedicated lectures later)

TypeWhat you give the modelWhat it doesExample
SupervisedLabeled (x,y)(x, y) pairsLearns mapping xyx \to ySpam filter, tumour classification
UnsupervisedOnly xx, no labelsFinds structure in dataClustering patients by gene expression
ReinforcementAgent + reward signalLearns by trial and errorRobotic goalkeeper

The course is primarily supervised, some unsupervised, and reinforcement learning — "we will see."


Supervised Learning in Detail

In supervised learning, we already know the correct output for each training example. We give the algorithm both the inputs and the right answers — we are supervising it.

There are two sub-types and it is worth being precise about which is which, because the math differs:

Regression — the output is continuous, something you could place on a number line. House prices are the classic example: you input size, location, number of rooms, and you want a predicted price in euros. The output could be any real number.

Linear Regression

Classification — the output is discrete, one of a fixed set of categories. Is this tumour malignant or benign? The output is 0 or 1. There is no "halfway between malignant and benign" — it's one or the other.

Linear Regression

Why more features help — and when they don't

In the tumour example: if you only have tumour size, the algorithm draws a 1D threshold — everything above size XX is malignant, below is benign. If you add patient age, now you have a 2D plane and the boundary becomes a curve. If you add thickness, cell uniformity, cell shape, the algorithm can use all of them simultaneously to draw a much more nuanced boundary.

This is the real power of ML — it scales to as many features as you can give it. But there is a catch: more features require more data to learn from reliably. With 10 features and only 20 examples, the model has too much freedom and not enough signal. The features and the data have to grow together.


Unsupervised Learning in Detail

The difference from supervised learning is simple but profound: you have no labels. Nobody has told the algorithm what the right answer is. You just hand it the data and ask "can you find any structure in here?"

Clustering is the most common type — grouping similar points together. Some examples of where this is actually used:

  • Google News automatically groups articles about the same story, without anyone manually tagging them
  • Given DNA microarray data measuring gene expression across patients, a clustering algorithm groups patients into categories — without knowing anything about their age, gender, or diagnosis. The algorithm finds the groups; a biologist then figures out what those groups mean
  • Market segmentation: hand a clustering algorithm your customer database and it will find natural groupings — the algorithm discovers the segments, you then name them

The key point about unsupervised learning is this: you don't know what the categories mean until after you find them. The algorithm gives you "Group 1" and "Group 2". What those groups represent biologically, commercially, or physically — that interpretation belongs to a human with domain knowledge.

Clustering

Non-clustering is a separate category — finding structure that isn't about grouping. The cocktail party problem is the canonical example: two people speaking simultaneously in a room, two microphones placed at different positions, and the goal is to separate the two voices from the two mixed recordings. Humans do this effortlessly and unconsciously. For machines it is genuinely hard, and it is an unsupervised problem because there are no labels — nobody has pre-tagged which audio belongs to which speaker.


The ML Modelling Pipeline

Before diving into the math, it is worth having a clear picture of the overall workflow, because all the notation and formulas that follow are just precise ways of describing steps in this pipeline.

The idea is simple. You start with a training set — a collection of examples where you know both the input and the correct output. You feed that to a learning algorithm, whose job is to produce a hypothesis hh. The hypothesis is just a function — it takes a new input xx and produces a predicted output y^\hat{y}.

Training set → Learning Algorithm → Hypothesis h

Then at inference time — when you actually want to use the model:

New x → h → Predicted ŷ

Everything in the rest of this course is about: what form should hh take? How do we measure if hh is good? How do we find the best hh?


Notation

You will see this notation in every ML paper, every lecture, every textbook. It is worth committing to memory now so it never slows you down later.

SymbolMeaning
mmNumber of training examples
xxInput variable / feature
yyOutput variable / target
(x(i),y(i))(x^{(i)}, y^{(i)})The ii-th training example
hθ(x)h_\theta(x)The hypothesis — the function the model learns
θ0,θ1\theta_0, \theta_1Parameters of the model

[!WARNING] (x(i),y(i))(x^{(i)}, y^{(i)}) means the ii-th example — it is not xx to the power of ii. The superscript in parentheses is always an index into the training set.

For the housing dataset: x(1)x^{(1)} might be 2104 sqft with y(1)=460,000y^{(1)} = 460{,}000 euros. x(2)x^{(2)} is the second house. mm is however many rows you have in total.


The Hypothesis

Now that we have notation, we can write down the simplest possible hypothesis. For the housing price example — one input variable (size), one output (price) — we guess that the relationship is approximately linear:

hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x

This is a straight line. θ0\theta_0 is the intercept (where the line crosses the yy-axis) and θ1\theta_1 is the slope (how much price increases per unit of size). The θ\theta values are the parameters — the numbers the algorithm will learn from data.

This specific model has a name: univariate linear regression.

  • Univariate — one input variable xx
  • Linear — the hypothesis is a linear function of xx
  • Regression — the output is continuous

The question this immediately raises is: there are infinitely many possible lines. Which one is best? That is precisely what the cost function answers.


The Cost Function JJ

To choose the best θ0\theta_0 and θ1\theta_1, we need a way to measure how good or bad any particular choice is. The natural idea: a good hypothesis is one where the predictions hθ(x(i))h_\theta(x^{(i)}) are close to the true values y(i)y^{(i)} across all mm training examples.

The gap between prediction and truth on a single example is the error:

hθ(x(i))y(i)h_\theta(x^{(i)}) - y^{(i)}

We want this to be small for every example. But how do we turn mm individual errors into one single number we can minimise? We need to aggregate them somehow.

The naive idea — just sum the raw errors — has a fatal flaw: a prediction that is +5+5 too high on one example and 5-5 too low on another would sum to zero, looking perfect when it is clearly not. The errors cancel each other out.

The fix: square each error before summing. Squaring makes every term positive, so nothing can cancel. It also penalises large errors disproportionately — being off by 10 costs 100, while being off by 1 costs only 1. The model is pushed hard to avoid big mistakes.

Putting this together, the squared error cost function is:

J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2

The 1m\frac{1}{m} averages over all examples so JJ doesn't artificially grow just because you have a larger dataset. The 12\frac{1}{2} is pure mathematical convenience — when you differentiate later, the exponent 2 comes down and cancels with it, making the gradient formula cleaner. It has no effect on which θ\theta minimises JJ.

The goal is now precisely stated:

minθ0,θ1 J(θ0,θ1)\min_{\theta_0,\, \theta_1}\ J(\theta_0, \theta_1)

Two completely different spaces

This is the conceptual point, and it is worth pausing on.

hθ(x)h_\theta(x) and J(θ0,θ1)J(\theta_0, \theta_1) are different kinds of objects living in different spaces.

hh lives in data space. Given fixed θ\theta values, hh is a function of xx — it draws a line through your data plot. Change θ\theta and the line moves.

JJ lives in parameter space. Given fixed data, JJ is a function of θ0\theta_0 and θ1\theta_1 — it takes a point in the (θ0,θ1)(\theta_0, \theta_1) plane and returns a single number representing how bad that choice of parameters is. Every possible line you could draw through the data corresponds to exactly one point on the JJ surface.

Minimising JJ means finding the point in parameter space where the surface is lowest — which corresponds to the line in data space that best fits the training data.

Building intuition: one parameter first

To see this clearly, simplify temporarily. Set θ0=0\theta_0 = 0, forcing the line through the origin: hθ(x)=θ1xh_\theta(x) = \theta_1 x. Now JJ depends on only one number, θ1\theta_1, and we can plot it.

Take three training examples: (1,1)(1,1), (2,2)(2,2), (3,3)(3,3) — points that lie perfectly on y=xy = x.

θ1\theta_1What the line looks likeJ(θ1)J(\theta_1)
1.0Passes through all three points0
0.5Undershoots everything0.58\approx 0.58
0.0Flat line at zero2.33\approx 2.33
2.0Overshoots everything2.33\approx 2.33

Plot these points and you get a parabola — JJ is zero at θ1=1\theta_1 = 1 (perfect fit) and grows as θ1\theta_1 moves away from 1 in either direction. The minimum of the parabola corresponds exactly to the best fit line. This is not a coincidence — it is always the case.

With both parameters: the contour plot

Restore θ0\theta_0 and now J(θ0,θ1)J(\theta_0, \theta_1) is a surface in three dimensions — a bowl shape (technically a paraboloid). We cannot easily see a 3D surface, so we use a contour plot: slice the bowl horizontally at different heights and project the resulting ovals down onto the (θ0,θ1)(\theta_0, \theta_1) plane.

Every oval is a set of (θ0,θ1)(\theta_0, \theta_1) pairs that all give the same cost JJ. The outer ovals are high cost — bad fits. The inner ovals are lower cost — better fits. The very centre of the innermost oval is the minimum — the globally best θ0\theta_0 and θ1\theta_1.

[!MATH] For univariate linear regression, the cost surface is a convex paraboloid — it has exactly one global minimum and no local minima at all. Any method that walks downhill will always reach the same optimal solution regardless of where it starts. This is a special property of linear regression. Deep neural networks have highly non-convex loss surfaces with many local minima and saddle points — far harder to optimise.


Summary

The full pipeline introduced in this lecture, now with precise meaning attached to every step:

  1. Collect a training set {(x(i),y(i))}i=1m\{(x^{(i)}, y^{(i)})\}_{i=1}^m
  2. Choose a hypothesis family — for one variable, hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x
  3. Define a cost function J(θ0,θ1)J(\theta_0, \theta_1) that quantifies how wrong hh is across all training examples
  4. Minimise JJ to find the θ\theta values that make hh fit the data as well as possible

Step 4 is the one we have not answered yet: we know what to minimise, but not how. That is the subject of Lecture 2 — Gradient Descent.