Phylis Jepchumba, MSc

Posted on Jun 5

Understanding Underfitting and Overfitting: An Introduction

#beginners #datascience #machinelearning #tutorial

Have you ever trained a model that performed beautifully on your training data but fell apart the moment it saw new data? Or perhaps you built something so simple it couldn't even learn the training data properly? These are the classic traps of overfitting and underfitting — and every machine learning practitioner runs into them.

In this article, we'll cover what they are, how to detect them, how to fix them, and where the bias-variance tradeoff ties it all together — with real-world examples and code throughout.

What is Model Fitting?

Model fitting is the process of training a predictive model on a dataset to find the optimal parameters that best capture the underlying patterns in the data.

The goal is simple: the model should generalize well to unseen data — not just memorize the training examples.

There are three possible outcomes when fitting a model:

Outcome	Description
Good fit	Captures underlying patterns, generalizes well
Underfitting	Too simple, misses patterns even in training data
Overfitting	Too complex, memorizes noise, fails on new data

What is Underfitting?

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It performs poorly on both the training set and on new, unseen data.

Think of it like this: imagine asking a child to predict house prices and they only use the rule "all houses cost $100,000." That model ignores all relevant features (size, location, age) and will be wrong almost every time.

Why Does Underfitting Occur?

Model is too simple: A linear model trying to fit a curved, nonlinear relationship
Too few features: Important variables are left out
Too much regularization: Penalizing complexity so heavily that the model can't learn anything meaningful
Insufficient training: The model hasn't been trained long enough

Real-World Example

Suppose you're predicting whether an email is spam. If you only use the feature "email length" and ignore word content, sender, and links, your model will underfit — it simply doesn't have enough signal to make good predictions.

Detecting Underfitting

A model that underfits will show high error on both training and validation data.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np

# Generate non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 200).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, 200)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Underfit model: linear model on non-linear data
model = LinearRegression()
model.fit(X_train, y_train)

train_error = mean_squared_error(y_train, model.predict(X_train))
test_error = mean_squared_error(y_test, model.predict(X_test))

print(f"Train MSE: {train_error:.4f}")  # High
print(f"Test MSE:  {test_error:.4f}")   # Also high → underfitting

How to Fix Underfitting

1. Use a more complex model

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Upgrade to polynomial regression
poly_model = make_pipeline(PolynomialFeatures(degree=5), LinearRegression())
poly_model.fit(X_train, y_train)

train_error = mean_squared_error(y_train, poly_model.predict(X_train))
test_error = mean_squared_error(y_test, poly_model.predict(X_test))

print(f"Train MSE: {train_error:.4f}")  # Lower
print(f"Test MSE:  {test_error:.4f}")   # Also lower → better fit

2. Add more relevant features

import pandas as pd

# Before: only one feature
df_underfit = pd.DataFrame({'email_length': [120, 300, 50]})

# After: add meaningful features
df_better = pd.DataFrame({
    'email_length': [120, 300, 50],
    'num_links': [5, 0, 12],
    'contains_free': [1, 0, 1],
    'sender_known': [0, 1, 0]
})

3. Reduce regularization strength

from sklearn.linear_model import Ridge

# Too much regularization → underfitting
model_overreg = Ridge(alpha=1000)

# Reduced regularization → better balance
model_balanced = Ridge(alpha=0.1)

What is Overfitting?

Overfitting occurs when a model learns the training data too well — including its noise and random fluctuations — rather than the true underlying pattern. It performs great on training data but poorly on new data.

Think of a student who memorizes every answer in a practice exam word-for-word, but can't answer anything when the wording changes slightly.

Why Does Overfitting Occur?

Model is too complex: Too many parameters relative to training data
Too little training data: The model memorizes rather than generalizes
Noisy data: Random patterns in the data get learned as if they're real
Training too long: The model starts fitting noise over time

Real-World Example

You're building a fraud detection model. If your model memorizes every specific transaction in your training set (exact amounts, timestamps, merchant IDs), it will flag as fraud things it hasn't seen before — even legitimate transactions — while missing new fraud patterns it wasn't explicitly trained on.

Detecting Overfitting

An overfit model shows low training error but high validation error — a clear gap between the two.

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Overfit model: very deep decision tree
model = DecisionTreeClassifier(max_depth=None)  # No limit = memorizes everything
model.fit(X_train, y_train)

print(f"Train Accuracy: {accuracy_score(y_train, model.predict(X_train)):.4f}")  # Near 1.0
print(f"Test Accuracy:  {accuracy_score(y_test, model.predict(X_test)):.4f}")    # Much lower

Plotting learning curves is one of the best visual tools:

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    DecisionTreeClassifier(max_depth=None),
    X, y, cv=5, scoring='accuracy',
    train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.plot(train_sizes, train_scores.mean(axis=1), label='Training Accuracy')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation Accuracy')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve — Detecting Overfitting')
plt.legend()
plt.show()
# A large gap between the two lines = overfitting

How to Fix Overfitting

1. Use Cross-Validation

from sklearn.model_selection import cross_val_score

model = DecisionTreeClassifier(max_depth=5)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.4f} ± {scores.std():.4f}")

2. Apply Regularization (L1 / L2)

from sklearn.linear_model import Lasso, Ridge, LogisticRegression

# L1 (Lasso) — drives some feature weights to zero
lasso = Lasso(alpha=0.1)

# L2 (Ridge) — shrinks all weights, prevents large coefficients
ridge = Ridge(alpha=1.0)

# Logistic Regression with L2 regularization
lr = LogisticRegression(C=0.1, penalty='l2')  # Lower C = more regularization

3. Limit Model Complexity

# Constrain tree depth instead of letting it grow freely
from sklearn.tree import DecisionTreeClassifier

good_model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
good_model.fit(X_train, y_train)

print(f"Train: {accuracy_score(y_train, good_model.predict(X_train)):.4f}")
print(f"Test:  {accuracy_score(y_test, good_model.predict(X_test)):.4f}")
# Gap is now much smaller

4. Data Augmentation (Image Example)

from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    zoom_range=0.15
)
# Artificially increases training diversity, reducing overfitting

5. Dropout (Neural Networks)

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.4),   # Drop 40% of neurons during training
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

6. Early Stopping

early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,             # Stop if val_loss doesn't improve for 5 epochs
    restore_best_weights=True
)

model.fit(X_train, y_train,
          validation_data=(X_test, y_test),
          epochs=200,
          callbacks=[early_stop])

The Bias-Variance Tradeoff

To truly understand underfitting and overfitting, you need to understand the bias-variance tradeoff — one of the most fundamental concepts in machine learning.

The total prediction error of a model can be broken down as:

Total Error = Bias² + Variance + Irreducible Noise

Term	What it means	Connection
Bias	Error from wrong assumptions; model misses patterns	High bias → underfitting
Variance	Sensitivity to fluctuations in training data	High variance → overfitting
Irreducible noise	Noise inherent in the data; can't be reduced	Always present

The Tradeoff in Practice

Simple model  →  High Bias, Low Variance  →  Underfitting
Complex model →  Low Bias, High Variance  →  Overfitting
Optimal model →  Balanced Bias & Variance →  Good generalization

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

np.random.seed(42)
X = np.linspace(0, 1, 100).reshape(-1, 1)
y = np.sin(2 * np.pi * X).ravel() + np.random.normal(0, 0.2, 100)

X_train, X_test = X[:70], X[70:]
y_train, y_test = y[:70], y[70:]

degrees = [1, 3, 5, 10, 20]
train_errors, test_errors = [], []

for d in degrees:
    model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X_train, y_train)
    train_errors.append(mean_squared_error(y_train, model.predict(X_train)))
    test_errors.append(mean_squared_error(y_test, model.predict(X_test)))

plt.plot(degrees, train_errors, label='Training Error (Bias↓)')
plt.plot(degrees, test_errors, label='Test Error')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.show()
# Sweet spot is where test error is lowest

The goal is to find the sweet spot — a model complex enough to capture real patterns but not so complex it learns the noise.

Quick Reference: Underfitting vs Overfitting

	Underfitting	Overfitting
Also called	High bias	High variance
Training error	High	Low
Validation error	High	High
Model complexity	Too simple	Too complex
Fix	More complexity, more features	Regularization, more data, dropout

Conclusion

Getting model fitting right is at the heart of machine learning. The key takeaways:

Underfitting = model too simple → increase complexity or add features
Overfitting = model too complex → regularize, add data, or simplify
Bias-variance tradeoff = the fundamental tension between the two
Always evaluate on a held-out validation set — training accuracy alone tells you nothing about generalization

The sweet spot between underfitting and overfitting is where the most useful, reliable models live. With the detection techniques and fixes in this article, you have everything you need to find it.

If you found this helpful, drop a ❤️ and feel free to share! Questions or ideas? Leave a comment below.

Top comments (1)

Zeba Mushtaq • Jun 6

Really well explained! Overfitting is one of those concepts that sounds simple but trips up so many beginners. I've dealt with this while building ML models for my projects ,...your visual explanation makes it click instantly. Thanks for this! 👏