Series: How Machines Learn: A Complete Guide from Zero to AI Engineer
Phase 6: Machine Learning (The Core)
You trained your model. It got 99% on training data. You tested it on new data. It got 61%.
That gap is the problem. And it has a name: overfitting.
This is one of the most important concepts in all of machine learning. If you don't understand it, you'll keep building models that look great in your notebook and fail in the real world.
What You'll Learn Here
- What overfitting actually means in plain terms
- What underfitting is and why both are bad
- The bias-variance tradeoff (explained without scary math)
- How to detect overfitting with a learning curve
- Practical ways to fix it with real code
The Overfitting Story
Imagine you're studying for a history exam. Instead of understanding why events happened, you memorize every single detail from the textbook. Every date, every name, every quote.
On the practice test, you score 100%. It's the exact same questions from the book.
On the real exam, the teacher rephrases the questions slightly. You freeze. You memorized the exact words, not the concepts. You fail.
Your ML model does the same thing when it overfits. It memorizes the training data so perfectly that it picks up on noise, random quirks, and flukes in that specific dataset. When it sees new data, those memorized quirks don't exist, and the model is lost.
Overfitting vs Underfitting
There are two ways a model can go wrong.
Underfitting: The model didn't learn enough. It's too simple. It misses the real pattern even in the training data.
Overfitting: The model learned too much. It's too complex. It memorized the training data including all its noise.
Good fit: The model learned the actual pattern and can apply it to new data.
Think of it like this. You're trying to draw a line through data points on a graph.
- Underfit: You draw a flat horizontal line. It's too simple. It misses the trend.
- Overfit: You draw a line that zigzags through every single point exactly. It matches training data perfectly but has nothing to do with the real pattern.
- Good fit: You draw a smooth curve that captures the actual trend without chasing every outlier.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
# Create some data with a true pattern + noise
np.random.seed(42)
X = np.sort(np.random.rand(30, 1) * 10, axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.3, X.shape[0])
# Three models: underfit, good fit, overfit
degrees = [1, 3, 15]
titles = ['Underfit (degree=1)', 'Good Fit (degree=3)', 'Overfit (degree=15)']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
for ax, degree, title in zip(axes, degrees, titles):
model = Pipeline([
('poly', PolynomialFeatures(degree=degree)),
('linear', LinearRegression())
])
model.fit(X, y)
y_plot = model.predict(X_plot)
ax.scatter(X, y, color='gray', alpha=0.6, label='Data')
ax.plot(X_plot, y_plot, color='blue', linewidth=2)
ax.set_title(title)
ax.set_ylim(-2, 2)
plt.tight_layout()
plt.savefig('overfit_comparison.png', dpi=100)
plt.show()
Run this and look at all three plots. The degree=15 model passes through almost every point perfectly. It looks impressive. But ask it to predict on new data and it goes wild.
Detecting Overfitting: The Training vs Test Gap
The easiest way to catch overfitting is to compare training accuracy and test accuracy.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Try different tree depths
print(f"{'Depth':<8} {'Train Acc':<12} {'Test Acc':<12} {'Gap':<8}")
print("-" * 42)
for depth in [1, 2, 3, 5, 10, 20, None]:
model = DecisionTreeClassifier(max_depth=depth, random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
gap = train_acc - test_acc
depth_label = str(depth) if depth else 'None'
print(f"{depth_label:<8} {train_acc:.3f} {test_acc:.3f} {gap:.3f}")
Output:
Depth Train Acc Test Acc Gap
------------------------------------------
1 0.904 0.895 0.009
2 0.940 0.930 0.010
3 0.962 0.947 0.015
5 0.979 0.947 0.032
10 0.998 0.930 0.068
20 1.000 0.912 0.088
None 1.000 0.912 0.088
Look at that pattern. As the tree gets deeper:
- Training accuracy goes up (model memorizes more)
- Test accuracy goes up then starts dropping
- The gap keeps growing
When you see a big gap between training and test accuracy, that's overfitting.
The Bias-Variance Tradeoff
This is the theory behind all of it. It sounds complicated but it's actually a simple idea.
Bias is how wrong your model is on average. A high-bias model makes strong assumptions and misses the real pattern. It underfits.
Variance is how much your model changes when you train it on different data. A high-variance model is sensitive to every little detail in the training data. It overfits.
The tradeoff: reducing one usually increases the other.
- Simple models: high bias, low variance (underfit)
- Complex models: low bias, high variance (overfit)
- The sweet spot: enough complexity to learn the pattern, not so much that you memorize noise
Bias-Variance Tradeoff:
Error
|
| \ /
| \ / <- Total Error
| \ /
| \ /
| Variance \ /
| \/
| Bias
|________________________
Model Complexity
Sweet spot is at the bottom of the total error curve.
You can't eliminate both bias and variance at the same time. The goal is to find the right balance for your specific problem.
Learning Curves: Visualizing the Problem
A learning curve shows you how training and test performance change as you add more training data. It's one of the most useful diagnostic tools in ML.
from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
data = load_breast_cancer()
X, y = data.data, data.target
def plot_learning_curve(model, title):
train_sizes, train_scores, test_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='accuracy'
)
train_mean = train_scores.mean(axis=1)
test_mean = test_scores.mean(axis=1)
plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_mean, label='Training accuracy', color='blue')
plt.plot(train_sizes, test_mean, label='Test accuracy', color='orange')
plt.fill_between(train_sizes,
train_scores.mean(axis=1) - train_scores.std(axis=1),
train_scores.mean(axis=1) + train_scores.std(axis=1),
alpha=0.1, color='blue')
plt.xlabel('Training set size')
plt.ylabel('Accuracy')
plt.title(title)
plt.legend()
plt.tight_layout()
plt.savefig(f'learning_curve_{title[:5]}.png')
plt.show()
# Overfit model (deep tree)
plot_learning_curve(
DecisionTreeClassifier(max_depth=None, random_state=42),
'Overfit Model - Deep Tree'
)
# Better model (shallow tree)
plot_learning_curve(
DecisionTreeClassifier(max_depth=3, random_state=42),
'Better Model - Shallow Tree'
)
Reading the learning curve:
- Overfit: training accuracy is high, test accuracy is much lower. Big gap between the two lines.
- Underfit: both lines are low and close together. Adding more data doesn't help much.
- Good fit: the two lines are close and both are high.
How to Fix Overfitting
Once you detect it, here's how to fix it.
Fix 1: Get more training data
The most effective fix when you can get it. More data makes it harder for the model to memorize noise because there's just too much to memorize.
Fix 2: Simplify the model
Use fewer features, reduce tree depth, reduce polynomial degree. A simpler model can't memorize as much.
# Instead of this (overfitting)
model = DecisionTreeClassifier(max_depth=None)
# Try this
model = DecisionTreeClassifier(max_depth=3)
Fix 3: Regularization
Regularization adds a penalty for complexity directly into the model's training. The model learns to prefer simpler solutions.
from sklearn.linear_model import Ridge, Lasso, LogisticRegression
# Ridge regression: penalizes large coefficients (L2 regularization)
ridge = Ridge(alpha=1.0) # higher alpha = more regularization
# Lasso regression: can shrink some coefficients to zero (L1)
lasso = Lasso(alpha=0.1)
# Logistic regression with regularization (C is inverse of alpha)
lr = LogisticRegression(C=0.1) # lower C = more regularization
Fix 4: Cross-validation
Instead of relying on one train/test split, use cross-validation to get a more reliable estimate and catch overfitting early.
Fix 5: Pruning (for trees)
Decision trees can be pruned after training to remove branches that don't add much value.
# Cost complexity pruning
model = DecisionTreeClassifier(ccp_alpha=0.01, random_state=42)
model.fit(X_train, y_train)
A Side-by-Side Comparison
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {
'Overfit (no limit)': DecisionTreeClassifier(max_depth=None, random_state=42),
'Better (depth=5)': DecisionTreeClassifier(max_depth=5, random_state=42),
'Underfit (depth=1)': DecisionTreeClassifier(max_depth=1, random_state=42),
}
print(f"{'Model':<25} {'Train':<8} {'Test':<8} {'CV Mean':<10}")
print("-" * 55)
for name, model in models.items():
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
cv_acc = cross_val_score(model, X_train, y_train, cv=5).mean()
print(f"{name:<25} {train_acc:.3f} {test_acc:.3f} {cv_acc:.3f}")
Output:
Model Train Test CV Mean
-------------------------------------------------------
Overfit (no limit) 1.000 0.912 0.924
Better (depth=5) 0.984 0.956 0.953
Underfit (depth=1) 0.904 0.895 0.901
The middle model has the best test and CV scores. That's the one you'd pick.
Quick Cheat Sheet
| Sign | What it means | Fix |
|---|---|---|
| Train high, test low | Overfitting | Simplify model, regularize, more data |
| Train low, test low | Underfitting | More complexity, better features |
| Train high, test high | Good fit | Ship it |
| Big gap train vs test | Overfitting | Cross-validate, regularize |
| Both curves low on learning curve | Underfitting | More features or complex model |
Practice Challenges
Level 1:
Run the depth comparison table on load_wine(). Find the depth where test accuracy peaks.
Level 2:
Plot learning curves for both an overfit and underfit version of a decision tree on the same dataset. See the difference visually.
Level 3:
Use Ridge regression on a regression dataset (load_diabetes()). Try alpha values of 0.01, 0.1, 1, 10, 100. Plot how train and test error change. Find the sweet spot.
References
- Scikit-learn: Underfitting and Overfitting
- Scikit-learn: Learning Curves
- Bias-Variance Tradeoff explained visually
- Kaggle: Overfitting and Underfitting
Next up, Post 54: Linear Regression: Predicting Numbers From Patterns. We build the most fundamental ML model from scratch, understand the math behind it, and use scikit-learn to make real predictions.
Top comments (0)