Classification in a Nutshell

Classification is the art of drawing boundaries. You take messy, high-dimensional data and force it into neat categories.

In the wild, this could be spam vs. not-spam, cat vs. dog, tumor vs. healthy tissue.

In textbooks, it’s often MNIST, the 70,000-image dataset of handwritten digits that’s become the “Hello World” of machine learning.

MNIST looks simple, but it hides the essence of classification:

Inputs: images, each a 28×28 grid of pixels → vectors in $\mathbb{R}^{784}$ .
Outputs: 10 possible digits (0–9).
Goal: learn a function $f: \mathbb{R}^{784} \to {0,1,\dots,9}$ .

That’s it. Strip away the hype, and classification is about learning the function that maps features to labels.

Step 1: The Idea of Decision Boundaries

Think of classification as drawing walls in a huge room. Each wall splits the space into regions: “everything on this side is a 3, everything on that side is a 7.”

Mathematically, the simplest wall is linear:

f(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)

where $\mathbf{x}$ is your input (a flattened image), $\mathbf{w}$ is a weight vector, and $b$ is a bias.

If $f(\mathbf{x}) = +1$ , you say “class A.” If it’s $-1$ , you say “class B.”

This is binary classification. MNIST is harder. It’s 10-way classification. But the principle holds: learn a set of boundaries that carve up the space of digits.

Step 2: Probabilities, Not Just Boundaries

Hard decisions are brittle. Instead of only predicting “3” or “7,” we want a probability distribution over all 10 classes.

Enter softmax regression (multi-class logistic regression):

P(y = k \mid \mathbf{x}) = \frac{\exp(\mathbf{w}k^\top \mathbf{x} + b_k)}{\sum{j=0}^{9} \exp(\mathbf{w}_j^\top \mathbf{x} + b_j)}

Each class $k$ gets a score. Exponentiate, normalize, and you’ve got probabilities.

Step 3: Learning from Mistakes

How do we tune those weights? By minimizing a loss. The gold standard is cross-entropy loss:

\mathcal{L} = -\sum_{i=1}^N \log P(y^{(i)} \mid \mathbf{x}^{(i)})

where $(\mathbf{x}^{(i)}, y^{(i)})$ are your training examples.

The loss punishes confident wrong predictions and rewards confident correct ones.

Optimization is just gradient descent:

\mathbf{w} \gets \mathbf{w} - \eta \, \nabla_{\mathbf{w}} \mathcal{L}

with $\eta$ as the learning rate.

Step 4: Why Neural Nets Beat Logistic Regression

On MNIST, a plain softmax regression gets ~92% accuracy. Not bad.

But if you stack layers of nonlinear functions:

h = \sigma(W_1 \mathbf{x} + b_1), \quad \hat{y} = \text{softmax}(W_2 h + b_2)

you unlock much richer decision boundaries.

Neural nets can bend and twist the “walls” in ways linear models never can.

Convolutional neural nets (CNNs) go further by exploiting image structure. That’s how they push MNIST accuracy past 99%.

Step 5: What MNIST Actually Teaches You

MNIST isn’t about handwritten digits. It’s a sandbox to learn the deep truths of classification:

Every problem is about separating regions in feature space.
Probabilities > hard labels.
Losses tell you how “wrong” you are.
Optimization is just moving weights downhill.
Deeper models = more flexible boundaries.

Once you grasp these, you can swap MNIST for anything: medical scans, stock movements, audio signals. The math doesn’t change.

In a Nutshell

Classification = boundaries, probabilities, and losses.

MNIST is just the training wheels.

The real game is scaling this logic to data messier than digits scribbled on paper.