Have you ever trained a model that performed beautifully on your training data but fell apart the moment it saw new data? Or perhaps you built something so simple it couldn't even learn the training data properly? These are the classic traps of overfitting and underfitting — and every machine learning practitioner runs into them.
In this article, we'll cover what they are, how to detect them, how to fix them, and where the bias-variance tradeoff ties it all together — with real-world examples and code throughout.
What is Model Fitting?
Model fitting is the process of training a predictive model on a dataset to find the optimal parameters that best capture the underlying patterns in the data.
The goal is simple: the model should generalize well to unseen data — not just memorize the training examples.
There are three possible outcomes when fitting a model:
| Outcome | Description |
|---|---|
| Good fit | Captures underlying patterns, generalizes well |
| Underfitting | Too simple, misses patterns even in training data |
| Overfitting | Too complex, memorizes noise, fails on new data |
What is Underfitting?
Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It performs poorly on both the training set and on new, unseen data.
Think of it like this: imagine asking a child to predict house prices and they only use the rule "all houses cost $100,000." That model ignores all relevant features (size, location, age) and will be wrong almost every time.
Why Does Underfitting Occur?
- Model is too simple: A linear model trying to fit a curved, nonlinear relationship
- Too few features: Important variables are left out
- Too much regularization: Penalizing complexity so heavily that the model can't learn anything meaningful
- Insufficient training: The model hasn't been trained long enough
Real-World Example
Suppose you're predicting whether an email is spam. If you only use the feature "email length" and ignore word content, sender, and links, your model will underfit — it simply doesn't have enough signal to make good predictions.
Detecting Underfitting
A model that underfits will show high error on both training and validation data.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np
# Generate non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 200).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, 200)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Underfit model: linear model on non-linear data
model = LinearRegression()
model.fit(X_train, y_train)
train_error = mean_squared_error(y_train, model.predict(X_train))
test_error = mean_squared_error(y_test, model.predict(X_test))
print(f"Train MSE: {train_error:.4f}") # High
print(f"Test MSE: {test_error:.4f}") # Also high → underfitting
How to Fix Underfitting
1. Use a more complex model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Upgrade to polynomial regression
poly_model = make_pipeline(PolynomialFeatures(degree=5), LinearRegression())
poly_model.fit(X_train, y_train)
train_error = mean_squared_error(y_train, poly_model.predict(X_train))
test_error = mean_squared_error(y_test, poly_model.predict(X_test))
print(f"Train MSE: {train_error:.4f}") # Lower
print(f"Test MSE: {test_error:.4f}") # Also lower → better fit
2. Add more relevant features
import pandas as pd
# Before: only one feature
df_underfit = pd.DataFrame({'email_length': [120, 300, 50]})
# After: add meaningful features
df_better = pd.DataFrame({
'email_length': [120, 300, 50],
'num_links': [5, 0, 12],
'contains_free': [1, 0, 1],
'sender_known': [0, 1, 0]
})
3. Reduce regularization strength
from sklearn.linear_model import Ridge
# Too much regularization → underfitting
model_overreg = Ridge(alpha=1000)
# Reduced regularization → better balance
model_balanced = Ridge(alpha=0.1)
What is Overfitting?
Overfitting occurs when a model learns the training data too well — including its noise and random fluctuations — rather than the true underlying pattern. It performs great on training data but poorly on new data.
Think of a student who memorizes every answer in a practice exam word-for-word, but can't answer anything when the wording changes slightly.
Why Does Overfitting Occur?
- Model is too complex: Too many parameters relative to training data
- Too little training data: The model memorizes rather than generalizes
- Noisy data: Random patterns in the data get learned as if they're real
- Training too long: The model starts fitting noise over time
Real-World Example
You're building a fraud detection model. If your model memorizes every specific transaction in your training set (exact amounts, timestamps, merchant IDs), it will flag as fraud things it hasn't seen before — even legitimate transactions — while missing new fraud patterns it wasn't explicitly trained on.
Detecting Overfitting
An overfit model shows low training error but high validation error — a clear gap between the two.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Overfit model: very deep decision tree
model = DecisionTreeClassifier(max_depth=None) # No limit = memorizes everything
model.fit(X_train, y_train)
print(f"Train Accuracy: {accuracy_score(y_train, model.predict(X_train)):.4f}") # Near 1.0
print(f"Test Accuracy: {accuracy_score(y_test, model.predict(X_test)):.4f}") # Much lower
Plotting learning curves is one of the best visual tools:
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
DecisionTreeClassifier(max_depth=None),
X, y, cv=5, scoring='accuracy',
train_sizes=np.linspace(0.1, 1.0, 10)
)
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training Accuracy')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation Accuracy')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve — Detecting Overfitting')
plt.legend()
plt.show()
# A large gap between the two lines = overfitting
How to Fix Overfitting
1. Use Cross-Validation
from sklearn.model_selection import cross_val_score
model = DecisionTreeClassifier(max_depth=5)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.4f} ± {scores.std():.4f}")
2. Apply Regularization (L1 / L2)
from sklearn.linear_model import Lasso, Ridge, LogisticRegression
# L1 (Lasso) — drives some feature weights to zero
lasso = Lasso(alpha=0.1)
# L2 (Ridge) — shrinks all weights, prevents large coefficients
ridge = Ridge(alpha=1.0)
# Logistic Regression with L2 regularization
lr = LogisticRegression(C=0.1, penalty='l2') # Lower C = more regularization
3. Limit Model Complexity
# Constrain tree depth instead of letting it grow freely
from sklearn.tree import DecisionTreeClassifier
good_model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
good_model.fit(X_train, y_train)
print(f"Train: {accuracy_score(y_train, good_model.predict(X_train)):.4f}")
print(f"Test: {accuracy_score(y_test, good_model.predict(X_test)):.4f}")
# Gap is now much smaller
4. Data Augmentation (Image Example)
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
zoom_range=0.15
)
# Artificially increases training diversity, reducing overfitting
5. Dropout (Neural Networks)
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.4), # Drop 40% of neurons during training
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(1, activation='sigmoid')
])
6. Early Stopping
early_stop = tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=5, # Stop if val_loss doesn't improve for 5 epochs
restore_best_weights=True
)
model.fit(X_train, y_train,
validation_data=(X_test, y_test),
epochs=200,
callbacks=[early_stop])
The Bias-Variance Tradeoff
To truly understand underfitting and overfitting, you need to understand the bias-variance tradeoff — one of the most fundamental concepts in machine learning.
The total prediction error of a model can be broken down as:
Total Error = Bias² + Variance + Irreducible Noise
| Term | What it means | Connection |
|---|---|---|
| Bias | Error from wrong assumptions; model misses patterns | High bias → underfitting |
| Variance | Sensitivity to fluctuations in training data | High variance → overfitting |
| Irreducible noise | Noise inherent in the data; can't be reduced | Always present |
The Tradeoff in Practice
Simple model → High Bias, Low Variance → Underfitting
Complex model → Low Bias, High Variance → Overfitting
Optimal model → Balanced Bias & Variance → Good generalization
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
np.random.seed(42)
X = np.linspace(0, 1, 100).reshape(-1, 1)
y = np.sin(2 * np.pi * X).ravel() + np.random.normal(0, 0.2, 100)
X_train, X_test = X[:70], X[70:]
y_train, y_test = y[:70], y[70:]
degrees = [1, 3, 5, 10, 20]
train_errors, test_errors = [], []
for d in degrees:
model = make_pipeline(PolynomialFeatures(d), LinearRegression())
model.fit(X_train, y_train)
train_errors.append(mean_squared_error(y_train, model.predict(X_train)))
test_errors.append(mean_squared_error(y_test, model.predict(X_test)))
plt.plot(degrees, train_errors, label='Training Error (Bias↓)')
plt.plot(degrees, test_errors, label='Test Error')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.show()
# Sweet spot is where test error is lowest
The goal is to find the sweet spot — a model complex enough to capture real patterns but not so complex it learns the noise.
Quick Reference: Underfitting vs Overfitting
| Underfitting | Overfitting | |
|---|---|---|
| Also called | High bias | High variance |
| Training error | High | Low |
| Validation error | High | High |
| Model complexity | Too simple | Too complex |
| Fix | More complexity, more features | Regularization, more data, dropout |
Conclusion
Getting model fitting right is at the heart of machine learning. The key takeaways:
- Underfitting = model too simple → increase complexity or add features
- Overfitting = model too complex → regularize, add data, or simplify
- Bias-variance tradeoff = the fundamental tension between the two
- Always evaluate on a held-out validation set — training accuracy alone tells you nothing about generalization
The sweet spot between underfitting and overfitting is where the most useful, reliable models live. With the detection techniques and fixes in this article, you have everything you need to find it.
If you found this helpful, drop a ❤️ and feel free to share! Questions or ideas? Leave a comment below.
Top comments (0)