DEV Community

Phylis Jepchumba, MSc
Phylis Jepchumba, MSc

Posted on

Understanding Underfitting and Overfitting: An Introduction

Have you ever trained a model that performed beautifully on your training data but fell apart the moment it saw new data? Or perhaps you built something so simple it couldn't even learn the training data properly? These are the classic traps of overfitting and underfitting — and every machine learning practitioner runs into them.

In this article, we'll cover what they are, how to detect them, how to fix them, and where the bias-variance tradeoff ties it all together — with real-world examples and code throughout.


What is Model Fitting?

Model fitting is the process of training a predictive model on a dataset to find the optimal parameters that best capture the underlying patterns in the data.

The goal is simple: the model should generalize well to unseen data — not just memorize the training examples.

There are three possible outcomes when fitting a model:

Outcome Description
Good fit Captures underlying patterns, generalizes well
Underfitting Too simple, misses patterns even in training data
Overfitting Too complex, memorizes noise, fails on new data

What is Underfitting?

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It performs poorly on both the training set and on new, unseen data.

Think of it like this: imagine asking a child to predict house prices and they only use the rule "all houses cost $100,000." That model ignores all relevant features (size, location, age) and will be wrong almost every time.

Why Does Underfitting Occur?

  • Model is too simple: A linear model trying to fit a curved, nonlinear relationship
  • Too few features: Important variables are left out
  • Too much regularization: Penalizing complexity so heavily that the model can't learn anything meaningful
  • Insufficient training: The model hasn't been trained long enough

Real-World Example

Suppose you're predicting whether an email is spam. If you only use the feature "email length" and ignore word content, sender, and links, your model will underfit — it simply doesn't have enough signal to make good predictions.

Detecting Underfitting

A model that underfits will show high error on both training and validation data.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np

# Generate non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 200).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, 200)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Underfit model: linear model on non-linear data
model = LinearRegression()
model.fit(X_train, y_train)

train_error = mean_squared_error(y_train, model.predict(X_train))
test_error = mean_squared_error(y_test, model.predict(X_test))

print(f"Train MSE: {train_error:.4f}")  # High
print(f"Test MSE:  {test_error:.4f}")   # Also high → underfitting
Enter fullscreen mode Exit fullscreen mode

How to Fix Underfitting

1. Use a more complex model

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Upgrade to polynomial regression
poly_model = make_pipeline(PolynomialFeatures(degree=5), LinearRegression())
poly_model.fit(X_train, y_train)

train_error = mean_squared_error(y_train, poly_model.predict(X_train))
test_error = mean_squared_error(y_test, poly_model.predict(X_test))

print(f"Train MSE: {train_error:.4f}")  # Lower
print(f"Test MSE:  {test_error:.4f}")   # Also lower → better fit
Enter fullscreen mode Exit fullscreen mode

2. Add more relevant features

import pandas as pd

# Before: only one feature
df_underfit = pd.DataFrame({'email_length': [120, 300, 50]})

# After: add meaningful features
df_better = pd.DataFrame({
    'email_length': [120, 300, 50],
    'num_links': [5, 0, 12],
    'contains_free': [1, 0, 1],
    'sender_known': [0, 1, 0]
})
Enter fullscreen mode Exit fullscreen mode

3. Reduce regularization strength

from sklearn.linear_model import Ridge

# Too much regularization → underfitting
model_overreg = Ridge(alpha=1000)

# Reduced regularization → better balance
model_balanced = Ridge(alpha=0.1)
Enter fullscreen mode Exit fullscreen mode

What is Overfitting?

Overfitting occurs when a model learns the training data too well — including its noise and random fluctuations — rather than the true underlying pattern. It performs great on training data but poorly on new data.

Think of a student who memorizes every answer in a practice exam word-for-word, but can't answer anything when the wording changes slightly.

Why Does Overfitting Occur?

  • Model is too complex: Too many parameters relative to training data
  • Too little training data: The model memorizes rather than generalizes
  • Noisy data: Random patterns in the data get learned as if they're real
  • Training too long: The model starts fitting noise over time

Real-World Example

You're building a fraud detection model. If your model memorizes every specific transaction in your training set (exact amounts, timestamps, merchant IDs), it will flag as fraud things it hasn't seen before — even legitimate transactions — while missing new fraud patterns it wasn't explicitly trained on.

Detecting Overfitting

An overfit model shows low training error but high validation error — a clear gap between the two.

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Overfit model: very deep decision tree
model = DecisionTreeClassifier(max_depth=None)  # No limit = memorizes everything
model.fit(X_train, y_train)

print(f"Train Accuracy: {accuracy_score(y_train, model.predict(X_train)):.4f}")  # Near 1.0
print(f"Test Accuracy:  {accuracy_score(y_test, model.predict(X_test)):.4f}")    # Much lower
Enter fullscreen mode Exit fullscreen mode

Plotting learning curves is one of the best visual tools:

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    DecisionTreeClassifier(max_depth=None),
    X, y, cv=5, scoring='accuracy',
    train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.plot(train_sizes, train_scores.mean(axis=1), label='Training Accuracy')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation Accuracy')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve — Detecting Overfitting')
plt.legend()
plt.show()
# A large gap between the two lines = overfitting
Enter fullscreen mode Exit fullscreen mode

How to Fix Overfitting

1. Use Cross-Validation

from sklearn.model_selection import cross_val_score

model = DecisionTreeClassifier(max_depth=5)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.4f} ± {scores.std():.4f}")
Enter fullscreen mode Exit fullscreen mode

2. Apply Regularization (L1 / L2)

from sklearn.linear_model import Lasso, Ridge, LogisticRegression

# L1 (Lasso) — drives some feature weights to zero
lasso = Lasso(alpha=0.1)

# L2 (Ridge) — shrinks all weights, prevents large coefficients
ridge = Ridge(alpha=1.0)

# Logistic Regression with L2 regularization
lr = LogisticRegression(C=0.1, penalty='l2')  # Lower C = more regularization
Enter fullscreen mode Exit fullscreen mode

3. Limit Model Complexity

# Constrain tree depth instead of letting it grow freely
from sklearn.tree import DecisionTreeClassifier

good_model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
good_model.fit(X_train, y_train)

print(f"Train: {accuracy_score(y_train, good_model.predict(X_train)):.4f}")
print(f"Test:  {accuracy_score(y_test, good_model.predict(X_test)):.4f}")
# Gap is now much smaller
Enter fullscreen mode Exit fullscreen mode

4. Data Augmentation (Image Example)

from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    zoom_range=0.15
)
# Artificially increases training diversity, reducing overfitting
Enter fullscreen mode Exit fullscreen mode

5. Dropout (Neural Networks)

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.4),   # Drop 40% of neurons during training
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
Enter fullscreen mode Exit fullscreen mode

6. Early Stopping

early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,             # Stop if val_loss doesn't improve for 5 epochs
    restore_best_weights=True
)

model.fit(X_train, y_train,
          validation_data=(X_test, y_test),
          epochs=200,
          callbacks=[early_stop])
Enter fullscreen mode Exit fullscreen mode

The Bias-Variance Tradeoff

To truly understand underfitting and overfitting, you need to understand the bias-variance tradeoff — one of the most fundamental concepts in machine learning.

The total prediction error of a model can be broken down as:

Total Error = Bias² + Variance + Irreducible Noise
Enter fullscreen mode Exit fullscreen mode
Term What it means Connection
Bias Error from wrong assumptions; model misses patterns High bias → underfitting
Variance Sensitivity to fluctuations in training data High variance → overfitting
Irreducible noise Noise inherent in the data; can't be reduced Always present

The Tradeoff in Practice

Simple model  →  High Bias, Low Variance  →  Underfitting
Complex model →  Low Bias, High Variance  →  Overfitting
Optimal model →  Balanced Bias & Variance →  Good generalization
Enter fullscreen mode Exit fullscreen mode
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

np.random.seed(42)
X = np.linspace(0, 1, 100).reshape(-1, 1)
y = np.sin(2 * np.pi * X).ravel() + np.random.normal(0, 0.2, 100)

X_train, X_test = X[:70], X[70:]
y_train, y_test = y[:70], y[70:]

degrees = [1, 3, 5, 10, 20]
train_errors, test_errors = [], []

for d in degrees:
    model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X_train, y_train)
    train_errors.append(mean_squared_error(y_train, model.predict(X_train)))
    test_errors.append(mean_squared_error(y_test, model.predict(X_test)))

plt.plot(degrees, train_errors, label='Training Error (Bias↓)')
plt.plot(degrees, test_errors, label='Test Error')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.show()
# Sweet spot is where test error is lowest
Enter fullscreen mode Exit fullscreen mode

The goal is to find the sweet spot — a model complex enough to capture real patterns but not so complex it learns the noise.


Quick Reference: Underfitting vs Overfitting

Underfitting Overfitting
Also called High bias High variance
Training error High Low
Validation error High High
Model complexity Too simple Too complex
Fix More complexity, more features Regularization, more data, dropout

Conclusion

Getting model fitting right is at the heart of machine learning. The key takeaways:

  • Underfitting = model too simple → increase complexity or add features
  • Overfitting = model too complex → regularize, add data, or simplify
  • Bias-variance tradeoff = the fundamental tension between the two
  • Always evaluate on a held-out validation set — training accuracy alone tells you nothing about generalization

The sweet spot between underfitting and overfitting is where the most useful, reliable models live. With the detection techniques and fixes in this article, you have everything you need to find it.


If you found this helpful, drop a ❤️ and feel free to share! Questions or ideas? Leave a comment below.

Top comments (0)