Knowledge Distillation: Fit a Big Model's Smarts Into a Small One

#ai #llm #machinelearning #beginners

🎓 Distill a model yourself (real softmax + real gradient descent): https://dev48v.infy.uk/ai/days/day23-distillation.html

The problem: big models are expensive to serve

A state-of-the-art model can have billions of parameters, and every single prediction runs all of them. That means high latency, big GPU bills, and no chance of running in a browser or on a phone. But you rarely need the very best model — you need something 90% as good that answers in a fraction of the time and cost. Distillation is how you get there.

The trick: learn from soft labels, not hard ones

Here's the insight that makes it work. A normal training label is "hard": for an image of a dog, the target is dog = 1, wolf = 0, cat = 0. That one-hot vector tells the model the right answer but throws away something important — that a dog looks far more like a wolf than like a cat.

Now look at what a big teacher model actually outputs. Run its scores through a softmax and you get a whole probability distribution:

dog  0.70
wolf 0.25
cat  0.05

Even though the top answer is "dog", the runner-up probabilities carry information: dog and wolf are cousins, cat is a stranger. Geoffrey Hinton's team called this the "dark knowledge" — the structure hidden in the wrong-class probabilities. It's often the most valuable part of the signal, and hard labels erase it completely.

Temperature: a magnifying glass on the teacher's doubt

There's a catch: a confident teacher's softmax is spiky — something like (0.97, 0.02, 0.01) — so the dark knowledge is squashed into tiny numbers the student can barely learn from. The fix is a temperature knob. Before the softmax, divide the logits by T:

softmax(logits / T)

With T = 1 you get the normal spiky output. Crank T up and the distribution flattens, lifting those small wrong-class probabilities into a usable range. In the demo, the teacher's distribution goes from dog 85.8% / wolf 11.6% / cat 2.6% at T=1 to dog 39.6% / wolf 32.4% / cat 27.9% at T=10, and the entropy more than doubles. Same knowledge — now visible. The same T is applied to the student during training so they're compared fairly.

The distillation loss

The student is trained so its softened distribution matches the teacher's. The natural measure is KL divergence, scaled by T² to keep gradients sensible, usually blended with a little ordinary cross-entropy on the true label so the student stays anchored to ground truth:

loss = α · KL(teacher ‖ student) · T²  +  (1 − α) · cross_entropy(hard_label, student)

The soft term makes the student mimic the teacher; the hard term keeps it honest if the teacher is ever wrong.

Why soft targets teach more

A hard label gives one bit of feedback per example: right or wrong. A soft target gives a whole vector of graded feedback across every class — a much denser learning signal. The student converges faster, needs less data, and learns smoother decision boundaries.

The demo makes this concrete with a real tiny softmax classifier trained by actual gradient descent on 2-D points. Train it on the teacher's soft targets and it hits 100% accuracy and its full distribution lines up with the teacher's (100% agreement — it inherited the dark knowledge). Train it on hard labels only and it still gets 100% accuracy, but its agreement with the teacher sits at just 71% — it learned the answer, not the nuance. Same tiny model, more knowledge transferred, for free.

The payoff, and the limits

A well-distilled student keeps most of the teacher's accuracy while being much smaller and faster. DistilBERT famously kept about 97% of BERT's performance while being ~40% smaller and ~60% faster. TinyBERT, DistilGPT2, mobile vision models, and the "mini" / "flash" tiers of frontier LLMs all lean on this idea.

But be honest about the limits: a student can't beat its teacher, it inherits the teacher's biases and mistakes, and if you shrink it too far quality collapses. Distillation compresses knowledge — it doesn't create it.

🔨 The whole loop — teacher soft labels, temperature slider, and a student trained by real gradient descent — runs on the page: https://dev48v.infy.uk/ai/days/day23-distillation.html

Part of AIFromZero. 🌐 https://dev48v.infy.uk