Overfitting Made Visible: Train/Test and the U-Shaped Error Curve

#machinelearning #ai #beginners #datascience

A model that aces the data it studied is worthless if it flops on new data — like a student who memorised the answer key. That's overfitting, and the train/test split is how you catch it. Here's the U-shaped curve, made interactive.

🎯 Drag the complexity slider: https://dev48v.infy.uk/ml/day8-overfitting.html

The discipline

const [train, test] = split(shuffle(data), 0.8);  // lock the test set away
const model = fit(train);                          // learn ONLY on train
const testErr = error(model, test);                // the honest number

Why train error lies

Add complexity (higher polynomial degree, more parameters) and training error always drops — eventually to zero, when the curve threads through every point including the noise. So you can never judge a model by its training score.

Test error is U-shaped

Too simple → both errors high (underfit).
Just right → both low (the sweet spot).
Too complex → train error tiny, test error shoots up (overfit — memorising noise).

Slide the demo's degree to 11 and watch the curve wiggle through every training point while test error explodes. The telltale sign is the GAP between low train and high test error.

Fighting it (bias-variance)

Simpler model · more data · regularization (penalize big weights) · early stopping · cross-validation to pick complexity. Managing this tradeoff is most of practical ML.