The Greed of Low Bias and Low Variance

#beginners #datascience #machinelearning #ai

Every model is a little bit greedy

I will start with a example of two archers. The first one always aims slightly left of the bullseye i.e. to the left of the centre.
Every single arrow, same mistake, same direction. The second one aims dead
center on average, but for both of them, their hand shakes, so the arrows land all over the board.

Neither archer is "bad". They're both being greedy in their own way. The first archer is greedy for consistency. He has locked onto one strategy and won't budge, even though it's wrong. He just wants the result with same input even though it is wrong. The second archer is greedy for flexibility. He will react to every gust of wind, every twitch, every tiny signal, even the ones that don't matter.

That's bias and variance. The maddening part of machine learning is that you
can't fix both at once. Reduce one, and the other usually grows. Once you
understand why it happens, and how we actually measure it, it stops being a
vague textbook warning and starts feeling like common sense.

Bias: the model that won't change its mind

Bias is systematic error. It's what happens when your model is too simple to capture the real pattern in the data. Think a straight line trying to fit a curve. No matter how much data you throw at it, it keeps making the same kind of mistake, because the shape it's allowed to take is fundamentally wrong.
High-bias models underfit: they're stable, but stably wrong.

Variance: the model that overreacts

Variance is instability. It's what happens when your model is so flexible that it doesn't just learn the underlying pattern, it learns the noise too. Train it on one sample of data and it draws one curve. Train it on a slightly different sample and it draws a wildly different curve. High-variance models overfit: they're accurate on the data they've seen, but unreliable everywhere else.

So how do you actually measure variance?

"How much does my model swing around" sounds vague until you turn it into a
concrete procedure. The trick: fix the test point, vary the training data.

Take your training set and resample it. The standard way is bootstrapping, where you randomly draw examples with replacement to create a new training set of the same size. Do this B = 100 times and you get 100 slightly different training sets, all pulled from the same pool.
Train your model fresh on each of those 100 training sets. Same algorithm, same hyperparameters. The only thing that changes is which examples it saw.
Now take one single test point, something none of those models trained on, and ask all 100 models to predict it.
You now have 100 different predictions for the exact same input. The spread of those 100 numbers is your variance.

Note what stayed constant: the test point. If you let the test data change too, like in plain k-fold cross-validation, you're no longer isolating variance.
You're mixing it with "this fold happened to be harder." Keeping the test point fixed is what makes the isolation real.

Why do we square the deviations?

Once you have your 100 predictions and their average, why not measure "how far
off" each prediction is with plain distance instead of squaring it?

The first problem is cancellation. A prediction that's +5 above the average
and one that's -5 below cancel out to zero when summed, even though both
represent real instability. Squaring makes every deviation positive, so errors in opposite directions can't hide each other.

The second is that squaring punishes big swings harder. A model that's
occasionally wildly wrong is more dangerous than one that's consistently a little off. Squaring grows faster than the deviation itself, so an error of 10 contributes 100x more than an error of 1. That's by design. You want the measurement to be sensitive to extreme instability, not just average shakiness.

Third, it matches the actual statistical definition. Variance, in statistics, is the expected squared deviation from the mean. This isn't something machine learning invented. It's the same definition you'd use to describe how spread out any dataset is, so using it here isn't a choice, it's just staying
consistent.

And fourth, squared functions are smooth and differentiable everywhere, which
matters once you start optimizing models. This is also why the famous identity
Total Error = Bias² + Variance + Irreducible Noise uses squared bias. The decomposition only works cleanly when both terms are squared quantities.

The actual goal

You're not trying to make bias zero or variance zero. That target doesn't exist outside of toy examples with infinite noiseless data. You're looking for the point where their combined greed costs you the least: a model flexible enough to be roughly right, but stable enough not to chase noise.

That means giving up a little of each archer's stubbornness. Add just enough
flexibility to fix the systematic miss, and just enough constraint to steady the shaky hand. Regularization, more training data, ensembling are all different tools for the same goal: tell your model to be less greedy.