The One-Line Summary: Cross-validation tests your model multiple times on different data splits, giving you a reliable performance estimate instead of a lucky (or unlucky) guess.
The Restaurant Critic's Dilemma
Imagine you're a restaurant critic.
You visit a new Italian place. You order the lasagna. It's... okay. Kind of bland. Maybe a 6/10.
You write your review: "Mediocre Italian. Skip it."
But here's what you didn't know:
- The head chef was sick that night
- You ordered their weakest dish
- The ingredients were from a bad batch
- It was their first week open
A month later, you hear everyone raving about this restaurant. You go back. You try different dishes. Different nights. The pasta is incredible. The risotto is life-changing. The tiramisu makes you weep with joy.
Your first visit was a fluke. One data point. One unlucky sample.
Now imagine a different critic.
She visits the restaurant five times. Different nights. Different dishes. Different occasions.
- Visit 1: Lasagna — 6/10
- Visit 2: Risotto — 9/10
- Visit 3: Seafood pasta — 8/10
- Visit 4: Osso buco — 9/10
- Visit 5: Pizza — 7/10
Average: 7.8/10
She writes: "Excellent Italian with one weak spot. The lasagna needs work, but everything else shines."
That's a fair review. That's a reliable assessment.
The first critic did a single train-test split.
The second critic did cross-validation.
And this difference? It's the line between trusting your model and actually knowing how good it is.
The Problem with a Single Split
Let me show you why a single train-test split is dangerous.
The Standard Approach
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Model accuracy: {score}") # 87%... or is it?
You get one number: 87%.
But here's the question nobody asks:
What if you got lucky?
What if, by pure chance, your test set happened to contain "easy" examples? What if your train set happened to contain the most informative samples?
Or what if you got unlucky?
What if your test set had all the weird outliers? What if your best training examples ended up in the test set?
The Horrifying Experiment
Watch this:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Generate dataset
X, y = make_classification(n_samples=200, n_features=20, random_state=42)
# Do 10 different random splits
scores = []
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=i # Different split each time
)
model = LogisticRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
scores.append(score)
print(f"Split {i+1}: {score:.1%}")
print(f"\nRange: {min(scores):.1%} to {max(scores):.1%}")
print(f"Difference: {(max(scores) - min(scores)):.1%}")
Output:
Split 1: 82.5%
Split 2: 90.0%
Split 3: 85.0%
Split 4: 77.5%
Split 5: 87.5%
Split 6: 92.5%
Split 7: 80.0%
Split 8: 85.0%
Split 9: 87.5%
Split 10: 82.5%
Range: 77.5% to 92.5%
Difference: 15.0%
The SAME model. The SAME data. Scores ranging from 77.5% to 92.5%.
If you happened to pick split 4, you'd think your model is mediocre.
If you happened to pick split 6, you'd think your model is excellent.
Neither is the truth. Both are flukes.
Enter Cross-Validation
Cross-validation solves this by testing on every possible portion of your data.
The Core Idea
Instead of one split, do multiple splits. Each time, a different portion is the test set. Average the results.
Traditional:
┌────────────────────────────────────────┐
│ Training Data (80%) │ Test (20%) │
└────────────────────────────────────────┘
→ One score
Cross-Validation (5-fold):
┌────────────────────────────────────────┐
│ Test │ │ → Score 1
├──────┼──────────────────────────────────┤
│ │ Test │ │ → Score 2
├──────┴──────┼───────────────────────────┤
│ │ Test │ │ → Score 3
├─────────────┴──────┼────────────────────┤
│ │ Test │ │ → Score 4
├────────────────────┴──────┼─────────────┤
│ │ Test │ → Score 5
└───────────────────────────┴─────────────┘
→ Average of 5 scores
Every data point gets to be in the test set exactly once. Every data point gets to be in the training set multiple times.
No more lucky or unlucky splits. You test on EVERYTHING.
K-Fold Cross-Validation
The most common type. Here's how it works:
Step by Step
Step 1: Divide your data into K equal parts (called "folds")
Data: [████████████████████████████████████████]
↓
K = 5 folds:
[████████] [████████] [████████] [████████] [████████]
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Step 2: For each fold:
- Use that fold as the test set
- Use all other folds as training set
- Train the model
- Evaluate on the test fold
- Record the score
Step 3: Average all K scores
Visual Walkthrough (5-Fold)
Round 1:
Train: [████████] [████████] [████████] [████████]
Test: [▓▓▓▓▓▓▓▓]
Score: 85%
Round 2:
Train: [▓▓▓▓▓▓▓▓] [████████] [████████] [████████]
Test: [████████]
Score: 88%
Round 3:
Train: [████████] [▓▓▓▓▓▓▓▓] [████████] [████████]
Test: [████████]
Score: 82%
Round 4:
Train: [████████] [████████] [▓▓▓▓▓▓▓▓] [████████]
Test: [████████]
Score: 87%
Round 5:
Train: [████████] [████████] [████████] [▓▓▓▓▓▓▓▓]
Test: [████████]
Score: 84%
Final: Average = (85 + 88 + 82 + 87 + 84) / 5 = 85.2% ± 2.3%
Now you know your model is about 85%, not "somewhere between 77% and 92%."
Cross-Validation in Code
It's surprisingly simple.
Basic K-Fold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
print(f"Result: {scores.mean():.1%} ± {scores.std()*2:.1%}")
Output:
Scores: [0.85 0.875 0.825 0.85 0.875]
Mean: 0.855
Std: 0.020
Result: 85.5% ± 4.0%
One line of code. Five evaluations. A reliable estimate.
More Control with KFold
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for fold, (train_idx, test_idx) in enumerate(kfold.split(X)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model = LogisticRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
scores.append(score)
print(f"Fold {fold+1}: {score:.1%}")
print(f"\nMean: {np.mean(scores):.1%} ± {np.std(scores)*2:.1%}")
Types of Cross-Validation
Different situations call for different approaches.
K-Fold (Standard)
Best for: Most situations
K: Usually 5 or 10
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
Stratified K-Fold (For Classification)
The problem: Regular K-Fold might put all examples of one class in the same fold.
The solution: Stratified K-Fold ensures each fold has the same proportion of classes.
Original data: 90% class A, 10% class B
Regular K-Fold might create:
Fold 1: 95% A, 5% B ← Unbalanced!
Fold 2: 80% A, 20% B ← Different distribution!
Stratified K-Fold guarantees:
Fold 1: 90% A, 10% B ← Same as original
Fold 2: 90% A, 10% B ← Same as original
Fold 3: 90% A, 10% B ← Same as original
from sklearn.model_selection import StratifiedKFold
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Use with cross_val_score
scores = cross_val_score(model, X, y, cv=skfold)
Rule: Always use Stratified K-Fold for classification.
Leave-One-Out (LOO)
The idea: Each fold contains just ONE sample.
If you have 100 samples, you do 100 rounds of training. Each round, train on 99 samples, test on 1.
Data: [●] [●] [●] [●] [●] ... 100 samples
Round 1: Test:[●] Train:[● ● ● ● ● ... 99 samples]
Round 2: Train:[●] Test:[●] Train:[● ● ● ... 98 samples]
Round 3: Train:[● ●] Test:[●] Train:[● ● ... 97 samples]
...
Round 100: Train:[● ● ● ● ● ... 99 samples] Test:[●]
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
Pros: Maximum use of data, lowest bias
Cons: Extremely slow (N rounds!), high variance
Use when: You have very small datasets (< 100 samples)
Time Series Split
The problem: For time series data, you can't use future data to predict the past. Regular K-Fold might leak future information.
The solution: Always train on past, test on future.
Regular K-Fold (WRONG for time series):
[Test] [Train] [Train] [Train] [Train]
↑ Past is testing on future! Data leakage!
Time Series Split (CORRECT):
Round 1: [Train] [Test]
Round 2: [Train] [Train] [Test]
Round 3: [Train] [Train] [Train] [Test]
Round 4: [Train] [Train] [Train] [Train] [Test]
Always: Past → Future
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)
Group K-Fold
The problem: Sometimes samples belong to groups (e.g., multiple readings from same patient). If the same patient appears in both train and test, you're cheating.
The solution: Ensure all samples from one group stay together.
Data from 3 patients:
Patient A: [● ● ●]
Patient B: [● ● ● ●]
Patient C: [● ●]
WRONG (Regular K-Fold):
Train: [A● A● B● B● C●]
Test: [A● B● B● C●] ← Same patients in both!
RIGHT (Group K-Fold):
Train: [A● A● A● B● B● B● B●]
Test: [C● C●] ← Different patients!
from sklearn.model_selection import GroupKFold
groups = [1, 1, 1, 2, 2, 2, 2, 3, 3] # Patient IDs
gkf = GroupKFold(n_splits=3)
for train_idx, test_idx in gkf.split(X, y, groups):
# Each patient only appears in train OR test, never both
pass
Choosing K (Number of Folds)
How many folds should you use?
K = 5 (Default)
Pros: Fast, good balance
Cons: 20% test data each round
When: Most situations, default choice
K = 10 (More Reliable)
Pros: More reliable estimate, more training data per fold
Cons: Slower
When: When you need higher confidence
K = N (Leave-One-Out)
Pros: Maximum training data, minimum bias
Cons: Very slow, high variance
When: Very small datasets
Rule of Thumb
| Dataset Size | Recommended K |
|---|---|
| < 100 samples | Leave-One-Out or 10-fold |
| 100 - 1,000 | 10-fold |
| 1,000 - 10,000 | 5-fold |
| > 10,000 | 5-fold or even 3-fold |
Larger dataset → Fewer folds needed (each fold is already representative)
What Cross-Validation Tells You
The output of cross-validation is two numbers:
scores = cross_val_score(model, X, y, cv=5)
mean = scores.mean() # Expected performance
std = scores.std() # Variance in performance
The Mean
What it represents: Your model's expected performance on unseen data.
How to use it: Compare models. Select hyperparameters.
The Standard Deviation
What it represents: How much performance varies depending on the data.
How to use it: Assess reliability.
Model A: 85% ± 2% ← Consistent, reliable
Model B: 85% ± 15% ← Inconsistent, risky
Same mean, very different reliability!
Reading the Results
Score: 85.0% ± 4.0%
This means: "On new data, expect around 85% accuracy, typically within the range of 81% to 89%."
High mean, low std: Great model! Reliable.
High mean, high std: Good but inconsistent. Might fail on some data.
Low mean, low std: Consistently bad. Need a better model.
Low mean, high std: Unreliable AND bad. Something is wrong.
Cross-Validation for Hyperparameter Tuning
This is where cross-validation really shines.
The Problem
You want to find the best hyperparameters (e.g., regularization strength α).
Bad approach:
- Split data into train/test
- Try different α values
- Pick the α with best test score
- Report that test score
Why it's bad: You're using the test set to make decisions! The test score is now optimistic.
Good approach:
- For each α value, do cross-validation
- Pick the α with best average CV score
- Train final model on ALL training data with that α
- Evaluate ONCE on held-out test set
Grid Search with Cross-Validation
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
# Define hyperparameter grid
param_grid = {
'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
}
# Grid search with 5-fold CV
grid_search = GridSearchCV(
Ridge(),
param_grid,
cv=5,
scoring='r2',
return_train_score=True
)
grid_search.fit(X_train, y_train)
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Final evaluation on held-out test set
final_score = grid_search.score(X_test, y_test)
print(f"Test score: {final_score:.3f}")
Output:
Best alpha: 1.0
Best CV score: 0.847
Test score: 0.831
Notice: CV score (0.847) is slightly higher than test score (0.831). This is normal — CV score is slightly optimistic because you chose the best α based on it.
The Nested Cross-Validation (Advanced)
For the most rigorous evaluation:
Outer loop: Evaluate model performance
Inner loop: Select hyperparameters
┌─────────────────────────────────────────────────┐
│ OUTER LOOP (5 folds for evaluation) │
│ ┌─────────────────────────────────────────────┐ │
│ │ INNER LOOP (5 folds for tuning) │ │
│ │ │ │
│ │ For each outer fold: │ │
│ │ 1. Hold out outer test fold │ │
│ │ 2. On remaining data, do inner CV │ │
│ │ 3. Find best hyperparameters │ │
│ │ 4. Train on all inner data │ │
│ │ 5. Evaluate on outer test fold │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Final: Average of 5 outer test scores │
└─────────────────────────────────────────────────┘
from sklearn.model_selection import cross_val_score, GridSearchCV
# Inner CV for hyperparameter tuning
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV does the inner loop
grid_search = GridSearchCV(
Ridge(),
{'alpha': [0.01, 0.1, 1.0, 10.0]},
cv=inner_cv
)
# Outer CV for evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=43)
# Nested CV
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)
print(f"Nested CV: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")
This gives you the most unbiased estimate of how your entire pipeline (including tuning) will perform on new data.
Common Mistakes
Mistake 1: Data Leakage Through Preprocessing
# WRONG: Fit scaler on ALL data, then cross-validate
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Sees test data!
scores = cross_val_score(model, X_scaled, y, cv=5)
# RIGHT: Include preprocessing in the pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
scores = cross_val_score(pipeline, X, y, cv=5) # Scaler fits only on training folds
Mistake 2: Using Regular K-Fold for Time Series
# WRONG: Regular K-Fold leaks future information
scores = cross_val_score(model, X_time_series, y, cv=5)
# RIGHT: Time series split
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X_time_series, y, cv=tscv)
Mistake 3: Not Stratifying for Imbalanced Classification
# WRONG: Regular K-Fold with imbalanced classes
scores = cross_val_score(model, X, y, cv=5)
# RIGHT: Stratified K-Fold preserves class ratios
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=skf)
Mistake 4: Reporting CV Score as Final Performance
# WRONG: "My model achieves 87% accuracy" (from CV)
scores = cross_val_score(model, X, y, cv=5)
print(f"Model accuracy: {scores.mean():.1%}") # This is not test performance!
# RIGHT: Hold out a true test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Use CV only on X_train for model selection
# Report X_test score as final performance
Cross-Validation Cheat Sheet
| Situation | Use This |
|---|---|
| General purpose | 5-fold or 10-fold |
| Classification | StratifiedKFold |
| Time series | TimeSeriesSplit |
| Grouped data | GroupKFold |
| Very small data | LeaveOneOut |
| Hyperparameter tuning | GridSearchCV with CV |
| Rigorous evaluation | Nested CV |
The Complete Cross-Validation Toolkit
# === Basic K-Fold ===
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
# === Stratified K-Fold (Classification) ===
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
# === Time Series Split ===
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)
# === Group K-Fold ===
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
# scores = cross_val_score(model, X, y, cv=gkf, groups=groups)
# === Leave-One-Out ===
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
# === Grid Search with CV ===
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, y)
# === Nested CV ===
from sklearn.model_selection import cross_val_score
nested_scores = cross_val_score(grid_search, X, y, cv=5)
Key Takeaways
Single train-test split is unreliable — You might get lucky or unlucky
Cross-validation tests on every portion — No more flukes
K-Fold splits data into K parts — Each takes a turn as test set
Stratified K-Fold for classification — Preserves class ratios
TimeSeriesSplit for temporal data — Respects time order
CV gives mean AND std — Both matter for reliability
Use CV for hyperparameter tuning — GridSearchCV does this automatically
Include preprocessing in pipeline — Avoid data leakage
The Restaurant Critic Analogy Revisited
| Critic Approach | ML Equivalent | Result |
|---|---|---|
| One visit, one dish | Single train-test split | Unreliable, might be a fluke |
| Five visits, multiple dishes | 5-fold cross-validation | Reliable average rating |
| Every dish, every night | Leave-one-out | Maximum reliability (but exhausting) |
The One-Sentence Summary
Cross-validation is visiting the restaurant five times instead of once — because one visit might catch the chef on a bad day.
What's Next?
Now that you understand cross-validation, you're ready for:
- Hyperparameter Tuning — Grid search, random search, Bayesian optimization
- Model Selection — Using CV to choose between algorithms
- Feature Selection — Using CV to find important features
- Ensemble Methods — Combining multiple cross-validated models
Follow me for the next article in this series!
Let's Connect!
If this made cross-validation finally click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's your go-to CV strategy? 5-fold? 10-fold? I'm curious!
The difference between "my model gets 87% accuracy" and "my model reliably gets 85-89% accuracy" is cross-validation. One is a guess. The other is knowledge.
Share this with someone who's still using a single train-test split and wondering why their results change every time they run the code.
Happy learning!
Top comments (0)