Akhilesh

Posted on May 5

53. Overfitting: When Your Model Is Too Good at Being Wrong

#ai #productivity #beginners #machinelearning

Series: How Machines Learn: A Complete Guide from Zero to AI Engineer
Phase 6: Machine Learning (The Core)

You trained your model. It got 99% on training data. You tested it on new data. It got 61%.

That gap is the problem. And it has a name: overfitting.

This is one of the most important concepts in all of machine learning. If you don't understand it, you'll keep building models that look great in your notebook and fail in the real world.

What You'll Learn Here

What overfitting actually means in plain terms
What underfitting is and why both are bad
The bias-variance tradeoff (explained without scary math)
How to detect overfitting with a learning curve
Practical ways to fix it with real code

The Overfitting Story

Imagine you're studying for a history exam. Instead of understanding why events happened, you memorize every single detail from the textbook. Every date, every name, every quote.

On the practice test, you score 100%. It's the exact same questions from the book.

On the real exam, the teacher rephrases the questions slightly. You freeze. You memorized the exact words, not the concepts. You fail.

Your ML model does the same thing when it overfits. It memorizes the training data so perfectly that it picks up on noise, random quirks, and flukes in that specific dataset. When it sees new data, those memorized quirks don't exist, and the model is lost.

Overfitting vs Underfitting

There are two ways a model can go wrong.

Underfitting: The model didn't learn enough. It's too simple. It misses the real pattern even in the training data.

Overfitting: The model learned too much. It's too complex. It memorized the training data including all its noise.

Good fit: The model learned the actual pattern and can apply it to new data.

Think of it like this. You're trying to draw a line through data points on a graph.

Underfit: You draw a flat horizontal line. It's too simple. It misses the trend.
Overfit: You draw a line that zigzags through every single point exactly. It matches training data perfectly but has nothing to do with the real pattern.
Good fit: You draw a smooth curve that captures the actual trend without chasing every outlier.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Create some data with a true pattern + noise
np.random.seed(42)
X = np.sort(np.random.rand(30, 1) * 10, axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.3, X.shape[0])

# Three models: underfit, good fit, overfit
degrees = [1, 3, 15]
titles  = ['Underfit (degree=1)', 'Good Fit (degree=3)', 'Overfit (degree=15)']

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

for ax, degree, title in zip(axes, degrees, titles):
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    model.fit(X, y)
    y_plot = model.predict(X_plot)

    ax.scatter(X, y, color='gray', alpha=0.6, label='Data')
    ax.plot(X_plot, y_plot, color='blue', linewidth=2)
    ax.set_title(title)
    ax.set_ylim(-2, 2)

plt.tight_layout()
plt.savefig('overfit_comparison.png', dpi=100)
plt.show()

Run this and look at all three plots. The degree=15 model passes through almost every point perfectly. It looks impressive. But ask it to predict on new data and it goes wild.

Detecting Overfitting: The Training vs Test Gap

The easiest way to catch overfitting is to compare training accuracy and test accuracy.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Try different tree depths
print(f"{'Depth':<8} {'Train Acc':<12} {'Test Acc':<12} {'Gap':<8}")
print("-" * 42)

for depth in [1, 2, 3, 5, 10, 20, None]:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)

    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc  = accuracy_score(y_test,  model.predict(X_test))
    gap       = train_acc - test_acc

    depth_label = str(depth) if depth else 'None'
    print(f"{depth_label:<8} {train_acc:.3f}       {test_acc:.3f}       {gap:.3f}")

Output:

Depth    Train Acc    Test Acc     Gap     
------------------------------------------
1        0.904        0.895        0.009
2        0.940        0.930        0.010
3        0.962        0.947        0.015
5        0.979        0.947        0.032
10       0.998        0.930        0.068
20       1.000        0.912        0.088
None     1.000        0.912        0.088

Look at that pattern. As the tree gets deeper:

Training accuracy goes up (model memorizes more)
Test accuracy goes up then starts dropping
The gap keeps growing

When you see a big gap between training and test accuracy, that's overfitting.

The Bias-Variance Tradeoff

This is the theory behind all of it. It sounds complicated but it's actually a simple idea.

Bias is how wrong your model is on average. A high-bias model makes strong assumptions and misses the real pattern. It underfits.

Variance is how much your model changes when you train it on different data. A high-variance model is sensitive to every little detail in the training data. It overfits.

The tradeoff: reducing one usually increases the other.

Simple models: high bias, low variance (underfit)
Complex models: low bias, high variance (overfit)
The sweet spot: enough complexity to learn the pattern, not so much that you memorize noise

Bias-Variance Tradeoff:

Error
 |
 |  \          /
 |   \        /   <- Total Error
 |    \      /
 |     \    /
 |  Variance \  /
 |            \/
 |         Bias
 |________________________
        Model Complexity

Sweet spot is at the bottom of the total error curve.

You can't eliminate both bias and variance at the same time. The goal is to find the right balance for your specific problem.

Learning Curves: Visualizing the Problem

A learning curve shows you how training and test performance change as you add more training data. It's one of the most useful diagnostic tools in ML.

from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt

data = load_breast_cancer()
X, y = data.data, data.target

def plot_learning_curve(model, title):
    train_sizes, train_scores, test_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='accuracy'
    )

    train_mean = train_scores.mean(axis=1)
    test_mean  = test_scores.mean(axis=1)

    plt.figure(figsize=(8, 5))
    plt.plot(train_sizes, train_mean, label='Training accuracy', color='blue')
    plt.plot(train_sizes, test_mean,  label='Test accuracy',     color='orange')
    plt.fill_between(train_sizes,
                     train_scores.mean(axis=1) - train_scores.std(axis=1),
                     train_scores.mean(axis=1) + train_scores.std(axis=1),
                     alpha=0.1, color='blue')
    plt.xlabel('Training set size')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.tight_layout()
    plt.savefig(f'learning_curve_{title[:5]}.png')
    plt.show()

# Overfit model (deep tree)
plot_learning_curve(
    DecisionTreeClassifier(max_depth=None, random_state=42),
    'Overfit Model - Deep Tree'
)

# Better model (shallow tree)
plot_learning_curve(
    DecisionTreeClassifier(max_depth=3, random_state=42),
    'Better Model - Shallow Tree'
)

Reading the learning curve:

Overfit: training accuracy is high, test accuracy is much lower. Big gap between the two lines.
Underfit: both lines are low and close together. Adding more data doesn't help much.
Good fit: the two lines are close and both are high.

How to Fix Overfitting

Once you detect it, here's how to fix it.

Fix 1: Get more training data

The most effective fix when you can get it. More data makes it harder for the model to memorize noise because there's just too much to memorize.

Fix 2: Simplify the model

Use fewer features, reduce tree depth, reduce polynomial degree. A simpler model can't memorize as much.

# Instead of this (overfitting)
model = DecisionTreeClassifier(max_depth=None)

# Try this
model = DecisionTreeClassifier(max_depth=3)

Fix 3: Regularization

Regularization adds a penalty for complexity directly into the model's training. The model learns to prefer simpler solutions.

from sklearn.linear_model import Ridge, Lasso, LogisticRegression

# Ridge regression: penalizes large coefficients (L2 regularization)
ridge = Ridge(alpha=1.0)  # higher alpha = more regularization

# Lasso regression: can shrink some coefficients to zero (L1)
lasso = Lasso(alpha=0.1)

# Logistic regression with regularization (C is inverse of alpha)
lr = LogisticRegression(C=0.1)  # lower C = more regularization

Fix 4: Cross-validation

Instead of relying on one train/test split, use cross-validation to get a more reliable estimate and catch overfitting early.

Fix 5: Pruning (for trees)

Decision trees can be pruned after training to remove branches that don't add much value.

# Cost complexity pruning
model = DecisionTreeClassifier(ccp_alpha=0.01, random_state=42)
model.fit(X_train, y_train)

A Side-by-Side Comparison

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    'Overfit  (no limit)': DecisionTreeClassifier(max_depth=None, random_state=42),
    'Better   (depth=5)':  DecisionTreeClassifier(max_depth=5,    random_state=42),
    'Underfit (depth=1)':  DecisionTreeClassifier(max_depth=1,    random_state=42),
}

print(f"{'Model':<25} {'Train':<8} {'Test':<8} {'CV Mean':<10}")
print("-" * 55)

for name, model in models.items():
    model.fit(X_train, y_train)
    train_acc = model.score(X_train, y_train)
    test_acc  = model.score(X_test,  y_test)
    cv_acc    = cross_val_score(model, X_train, y_train, cv=5).mean()
    print(f"{name:<25} {train_acc:.3f}    {test_acc:.3f}    {cv_acc:.3f}")

Output:

Model                     Train    Test     CV Mean   
-------------------------------------------------------
Overfit  (no limit)       1.000    0.912    0.924     
Better   (depth=5)        0.984    0.956    0.953     
Underfit (depth=1)        0.904    0.895    0.901

The middle model has the best test and CV scores. That's the one you'd pick.

Quick Cheat Sheet

Sign	What it means	Fix
Train high, test low	Overfitting	Simplify model, regularize, more data
Train low, test low	Underfitting	More complexity, better features
Train high, test high	Good fit	Ship it
Big gap train vs test	Overfitting	Cross-validate, regularize
Both curves low on learning curve	Underfitting	More features or complex model

Practice Challenges

Level 1:
Run the depth comparison table on load_wine(). Find the depth where test accuracy peaks.

Level 2:
Plot learning curves for both an overfit and underfit version of a decision tree on the same dataset. See the difference visually.

Level 3:
Use Ridge regression on a regression dataset (load_diabetes()). Try alpha values of 0.01, 0.1, 1, 10, 100. Plot how train and test error change. Find the sweet spot.

References

Next up, Post 54: Linear Regression: Predicting Numbers From Patterns. We build the most fundamental ML model from scratch, understand the math behind it, and use scikit-learn to make real predictions.

DEV Community