AML Basic — Applied Machine Learning

A New Kind of Problem

Everything so far has been about predicting a real-valued output. Given house features, predict the price — a number that can be anything from zero to millions. The output $y$ lives on a continuous scale.

Classification is different. The output $y$ can only take a small number of discrete values. Some examples from the lecture:

Email: is this message spam $(y=1)$ or not spam $(y=0)$ ?
Oncology: is this tumour malignant $(y=1)$ or benign $(y=0)$ ?
Finance: is this transaction fraudulent $(y=1)$ or legitimate $(y=0)$ ?

What all of these share is that $y \in \{0, 1\}$ — two possible outcomes, nothing in between, nothing outside. This is the binary classification problem. The value 0 is called the negative class and 1 is called the positive class. The assignment is somewhat arbitrary — by convention, the positive class usually represents the presence of something (spam, malignancy, fraud) and the negative class its absence, but what matters is consistency, not which label you attach to which outcome.

Why Not Just Use Linear Regression?

The most natural first attempt: take linear regression as we know it and apply it to a classification problem. Predict a number, and then threshold it — if the prediction is $\geq 0.5$ , say $y = 1$ ; if it is $< 0.5$ , say $y = 0$ .

This can appear to work on simple examples. If you fit a line to a dataset of tumour sizes labelled 0 or 1, the line might cross 0.5 at a reasonable decision point. But the approach is fundamentally fragile.

The problem becomes clear the moment you add new data far to the right of the existing cluster — a very large tumour, for instance. The best-fit line tilts to accommodate the new point, and its crossing of 0.5 shifts. Cases that were correctly classified as malignant are now below the threshold and get classified as benign. The model got worse by seeing more data, which is exactly the wrong behaviour for a learning algorithm.

There is also a conceptual mismatch: linear regression can output any real number, including values well below 0 and well above 1. But for a binary classification problem the output should represent something like a probability — and probabilities must live in $[0, 1]$ . A hypothesis that predicts $h_\theta(x) = 3.7$ or $h_\theta(x) = -0.2$ for a binary problem is outputting nonsense.

These two failures together — sensitivity to outliers and unconstrained output range — motivate a different approach.

The Logistic Function

What we need is a hypothesis that always outputs a value between 0 and 1, regardless of the input. The function that achieves this is the sigmoid function, also called the logistic function:

$g(z) = \frac{1}{1 + e^{-z}}$

Its properties are exactly what we need. As $z \to +\infty$ , $e^{-z} \to 0$ and $g(z) \to 1$ . As $z \to -\infty$ , $e^{-z} \to \infty$ and $g(z) \to 0$ . At $z = 0$ , $g(0) = \frac{1}{1+1} = 0.5$ . The function is smooth, differentiable everywhere, and always strictly between 0 and 1. Its shape is an S-curve — flat near the extremes, steep in the middle.

The Logistic Regression Hypothesis

The logistic regression hypothesis is formed by plugging the linear combination $\theta^T x$ — the same expression we used in linear regression — into the sigmoid function:

$h_\theta(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}$

The inner part $\theta^T x$ can take any real value, just as before. But wrapping it in $g(\cdot)$ squashes the output into $(0, 1)$ . No matter what the features say, the output of the hypothesis is always a valid number between 0 and 1.

How to interpret this output. The number $h_\theta(x)$ is interpreted as the estimated probability that $y = 1$ given the input $x$ , under the parameters $\theta$ :

$h_\theta(x) = P(y = 1 \mid x;\, \theta)$

If $h_\theta(x) = 0.7$ for some patient's tumour data, the model is saying: given these features, there is a 70% probability that this tumour is malignant. Since the two classes must sum to 1:

$P(y = 0 \mid x;\, \theta) = 1 - h_\theta(x)$

This probabilistic interpretation is principled and useful — it tells you not just the predicted class but also the model's confidence.

Making a prediction. To get a hard 0/1 prediction, we threshold at 0.5:

$\hat{y} = \begin{cases} 1 & \text{if } h_\theta(x) \geq 0.5 \\ 0 & \text{if } h_\theta(x) < 0.5 \end{cases}$

Since $g(z) \geq 0.5$ if and only if $z \geq 0$ , and $z = \theta^T x$ , this threshold rule is equivalent to:

$\hat{y} = 1 \iff \theta^T x \geq 0$

The Decision Boundary

The decision boundary is the set of all points $x$ in feature space where the model is exactly on the fence — where $h_\theta(x) = 0.5$ , or equivalently where $\theta^T x = 0$ . It is the border separating the region where the model predicts $y = 1$ from the region where it predicts $y = 0$ .

For a simple two-feature problem, $\theta^T x = 0$ defines a line in the $x_1$ - $x_2$ plane. On one side the model says positive; on the other, negative.

Crucial point: the decision boundary is a property of the hypothesis — specifically, of the parameters $\theta$ . It is not a property of the training data. The training data is used to learn the parameters, but once the parameters are fixed, the decision boundary is determined entirely by $\theta$ , and it would apply equally to any dataset.

Linear Decision Boundaries

With the standard hypothesis $h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2)$ , the decision boundary is the line $\theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0$ . This is a straight line in feature space — hence "linear" decision boundary.

Non-Linear Decision Boundaries

What if the data is not linearly separable? What if the positive class forms a cluster in the middle, surrounded by the negative class, and no straight line can separate them?

We use the same idea as polynomial regression: add higher-order features. If we define $h_\theta(x) = g(\theta_0 + \theta_1 x_1^2 + \theta_2 x_2^2)$ , the decision boundary is $\theta_0 + \theta_1 x_1^2 + \theta_2 x_2^2 = 0$ — which is a circle in the $x_1$ - $x_2$ plane. With more complex polynomial features, you can produce ellipses, parabolas, and arbitrary curves.

The model is still logistic regression — the same algorithm, the same sigmoid function, the same training procedure. The nonlinearity comes from the feature engineering, not from changing anything fundamental about the algorithm. This is the same trick we saw with polynomial regression: keep the model structure linear in $\theta$ , let the features carry the nonlinearity.

[!NOTE] By choosing richer feature spaces, logistic regression can learn surprisingly complex decision boundaries. The practical question of how complex to make the feature space — and how to avoid overfitting when you do — is the subject of regularisation, which comes later in the course.

The Cost Function

We now need to fit the parameters $\theta$ — we need a cost function to minimise. The temptation is to reuse the squared error cost from linear regression:

$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2$

but with the new sigmoid hypothesis. This does not work. The reason is subtle but important.

With a linear $h$ , $J(\theta)$ is a convex bowl — gradient descent finds the global minimum reliably. But with the sigmoid function inside $h$ , the composition of the sigmoid with the squared error produces a non-convex function of $\theta$ . The cost surface has many local minima, hills, and plateaus. Gradient descent can get stuck at a suboptimal local minimum and never find the global one. The algorithm would be fundamentally unreliable.

We need a different cost function — one that is convex when combined with the logistic hypothesis. We define the cost for a single training example as:

$\text{cost}(h_\theta(x), y) = \begin{cases} -\log(h_\theta(x)) & \text{if } y = 1 \\ -\log(1 - h_\theta(x)) & \text{if } y = 0 \end{cases}$

Understanding why these two cases are the right choice requires looking at each separately.

When $y = 1$ : the cost is $-\log(h_\theta(x))$ . If $h_\theta(x) = 1$ (the model is perfectly confident and correct), $-\log(1) = 0$ — zero cost. If $h_\theta(x) \to 0$ (the model is completely wrong, predicting near-zero probability for a case that is actually positive), $-\log(0) \to \infty$ — infinite cost. The function penalises wrong predictions with increasing severity as the prediction diverges from the truth, and the penalty grows without bound as the model becomes completely confident in the wrong direction.

When $y = 0$ : the cost is $-\log(1 - h_\theta(x))$ . If $h_\theta(x) = 0$ (correctly predicting zero probability for a negative case), $-\log(1) = 0$ — zero cost. If $h_\theta(x) \to 1$ (predicting near-certainty for a case that is actually negative), $-\log(0) \to \infty$ — infinite cost again. The same structure: correct confident predictions cost nothing; wrong confident predictions cost infinitely much.

This cost function is called the binary cross-entropy or log loss. The logarithm is what gives it the right shape — the convexity that makes gradient descent reliable.

The Simplified Cost Function

Writing the cost as a two-case expression is correct but awkward to work with. There is a cleaner way to write exactly the same thing as a single formula:

$\text{cost}(h_\theta(x), y) = -y \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))$

Verify that this is identical to the two-case version. When $y = 1$ : the second term vanishes (since $1 - y = 0$ ) and you are left with $-\log(h_\theta(x))$ . When $y = 0$ : the first term vanishes (since $y = 0$ ) and you are left with $-\log(1 - h_\theta(x))$ . Same function, one formula.

Summing over all $m$ training examples, the full cost function for logistic regression is:

$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log\left(h_\theta(x^{(i)})\right) + (1 - y^{(i)}) \log\left(1 - h_\theta(x^{(i)})\right) \right]$

This is the standard logistic regression cost function. It is convex in $\theta$ — provable using the theory of convex functions, but we take that as a given here. The implication is that gradient descent will reliably find the global minimum, just as it does for linear regression.

[!MATH] This cost function has a deeper justification beyond its convenient shape. It can be derived from the principle of maximum likelihood estimation: given the probabilistic model $P(y=1 \mid x;\theta) = h_\theta(x)$ , the log-likelihood of the training data is exactly $\frac{1}{m}\sum_i [y^{(i)}\log h + (1-y^{(i)})\log(1-h)]$ . Minimising $J(\theta)$ is equivalent to maximising this log-likelihood — finding the parameters that make the observed training data most probable under the model. This statistical grounding is why this particular cost function is used universally for logistic regression, and not some arbitrary alternative.

Compare the two cost functions side by side:

$J_{\text{linear}}(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2$

$J_{\text{logistic}}(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]$

Different expressions, same underlying goal: a scalar measure of how badly the current parameters fit the training data, to be minimised.

Gradient Descent for Logistic Regression

With the cost function defined, we apply gradient descent to minimise it. The update rule is the same general form as always:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \qquad \text{simultaneously for all } j$

Computing the partial derivative of the logistic cost with respect to $\theta_j$ requires the chain rule applied to the log and to the sigmoid — the calculation is a few lines of calculus. The result, after simplification, is:

$\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right) x^{(i)}_j$

The gradient descent update is therefore:

Repeat until convergence:

$\theta_j := \theta_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right) x^{(i)}_j \qquad \text{for all } j = 0, 1, \ldots, n$

Look at this carefully. It is identical in form to the gradient descent update for linear regression. The same expression $\frac{1}{m}\sum (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j$ appears in both.

Does this mean linear regression and logistic regression are the same algorithm? No. The difference is hidden inside $h_\theta(x)$ . For linear regression, $h_\theta(x) = \theta^T x$ — a linear function. For logistic regression, $h_\theta(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}$ — the sigmoid of a linear function. The update rules look the same on paper, but they are computing different things because the hypothesis is different.

This is actually a nice result — it means the same gradient descent code can handle both algorithms if the hypothesis function is parameterised correctly.

[!NOTE] Everything we said about gradient descent for linear regression applies here too. Update all parameters simultaneously. Monitor $J(\theta)$ over iterations — it should decrease monotonically. Apply feature scaling if features are on different scales — it speeds up convergence for logistic regression exactly as it does for linear regression. Choose $\alpha$ using the same factor-of-3 search strategy.

Multi-Class Classification

So far we have only handled binary classification — two classes, $y \in \{0, 1\}$ . Many real problems have more than two classes: weather forecasting with four outcomes (sunny, rainy, snowy, windy), email foldering with five categories (home, work, family, hobby, spam), medical diagnosis with several possible conditions.

The approach for multi-class classification is called one-vs-all (also written one-vs-rest). The idea is simple and elegant: reduce the multi-class problem to a set of binary problems, one for each class.

Suppose you have $K$ classes. For each class $k \in \{1, 2, \ldots, K\}$ :

Relabel the training data: examples belonging to class $k$ get label 1 (positive), all other examples get label 0 (negative)
Train a standard binary logistic regression classifier $h^{(k)}_\theta(x)$ on this relabelled data
This classifier learns to answer the question: "Is this example class $k$ , or not?"

After training, you have $K$ separate classifiers, each specialised to detect one class versus everything else.

Making predictions on a new example $x$ : run all $K$ classifiers and pick the class whose classifier outputs the highest probability:

$\hat{y} = \arg\max_k\ h^{(k)}_\theta(x)$

The class whose classifier is most confident that the example belongs to it wins.

The geometric picture: each binary classifier learns its own decision boundary, separating one class from the rest of the space. Together the $K$ boundaries partition the feature space into regions, one per class. A new point falls into the region where the corresponding classifier fires most strongly.

[!NOTE] One-vs-all is a meta-algorithm — a way to extend any binary classifier to multiple classes. It works with logistic regression but also with SVMs and other models. Its main limitation is that the classifiers are trained independently: classifier 1 does not know what classifiers 2 and 3 are learning. More sophisticated multi-class approaches train jointly, but one-vs-all is the right starting point and performs well in practice.

Summary

Logistic regression is the foundational classification algorithm, and probably the most widely used single ML model in practice.

The key ideas in order:

Linear regression fails for classification because its output is unbounded and sensitive to outliers. We need a hypothesis that outputs values in $[0, 1]$ .

The sigmoid function $g(z) = \frac{1}{1 + e^{-z}}$ achieves this. The logistic regression hypothesis is $h_\theta(x) = g(\theta^T x)$ , interpreted as the probability that $y = 1$ .

The decision boundary is the set of points where $\theta^T x = 0$ . It separates the predicted positive and negative regions. With polynomial features it can be nonlinear.

The squared error cost function is non-convex when combined with the sigmoid — gradient descent would not reliably find the global minimum. The correct cost function is binary cross-entropy (log loss), which is convex and has a maximum likelihood justification.

The gradient descent update for logistic regression has the same algebraic form as for linear regression — only the hypothesis $h_\theta(x)$ differs.

Multi-class classification with $K > 2$ classes is handled by one-vs-all: train $K$ binary classifiers, one per class, and at prediction time pick the class with highest confidence.

Logistic regression is not the final word in classification — neural networks, SVMs, and tree-based methods all extend and improve upon it in different ways. But it is the starting point every practitioner needs to understand deeply, because the ideas here recur constantly: sigmoid outputs, cross-entropy loss, and probabilistic interpretation are all central to deep learning as well.