A model that aces the data it studied is worthless if it flops on new data — like a student who memorised the answer key. That's overfitting, and the train/test split is how you catch it. Here's the U-shaped curve, made interactive.
🎯 Drag the complexity slider: https://dev48v.infy.uk/ml/day8-overfitting.html
The discipline
const [train, test] = split(shuffle(data), 0.8); // lock the test set away
const model = fit(train); // learn ONLY on train
const testErr = error(model, test); // the honest number
Why train error lies
Add complexity (higher polynomial degree, more parameters) and training error always drops — eventually to zero, when the curve threads through every point including the noise. So you can never judge a model by its training score.
Test error is U-shaped
- Too simple → both errors high (underfit).
- Just right → both low (the sweet spot).
- Too complex → train error tiny, test error shoots up (overfit — memorising noise).
Slide the demo's degree to 11 and watch the curve wiggle through every training point while test error explodes. The telltale sign is the GAP between low train and high test error.
Fighting it (bias-variance)
Simpler model · more data · regularization (penalize big weights) · early stopping · cross-validation to pick complexity. Managing this tradeoff is most of practical ML.
The takeaway
Split the data, trust the test set, sit at the bottom of the U. See it.
Top comments (0)