DEV Community

Cover image for Cross-Validation: Why Testing Your Model Once Is Like Judging a Restaurant by a Single Bite
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Cross-Validation: Why Testing Your Model Once Is Like Judging a Restaurant by a Single Bite

The One-Line Summary: Cross-validation tests your model multiple times on different data splits, giving you a reliable performance estimate instead of a lucky (or unlucky) guess.


The Restaurant Critic's Dilemma

Imagine you're a restaurant critic.

You visit a new Italian place. You order the lasagna. It's... okay. Kind of bland. Maybe a 6/10.

You write your review: "Mediocre Italian. Skip it."

But here's what you didn't know:

  • The head chef was sick that night
  • You ordered their weakest dish
  • The ingredients were from a bad batch
  • It was their first week open

A month later, you hear everyone raving about this restaurant. You go back. You try different dishes. Different nights. The pasta is incredible. The risotto is life-changing. The tiramisu makes you weep with joy.

Your first visit was a fluke. One data point. One unlucky sample.


Now imagine a different critic.

She visits the restaurant five times. Different nights. Different dishes. Different occasions.

  • Visit 1: Lasagna — 6/10
  • Visit 2: Risotto — 9/10
  • Visit 3: Seafood pasta — 8/10
  • Visit 4: Osso buco — 9/10
  • Visit 5: Pizza — 7/10

Average: 7.8/10

She writes: "Excellent Italian with one weak spot. The lasagna needs work, but everything else shines."

That's a fair review. That's a reliable assessment.


The first critic did a single train-test split.

The second critic did cross-validation.

And this difference? It's the line between trusting your model and actually knowing how good it is.


The Problem with a Single Split

Let me show you why a single train-test split is dangerous.

The Standard Approach

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

print(f"Model accuracy: {score}")  # 87%... or is it?
Enter fullscreen mode Exit fullscreen mode

You get one number: 87%.

But here's the question nobody asks:

What if you got lucky?

What if, by pure chance, your test set happened to contain "easy" examples? What if your train set happened to contain the most informative samples?

Or what if you got unlucky?

What if your test set had all the weird outliers? What if your best training examples ended up in the test set?


The Horrifying Experiment

Watch this:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate dataset
X, y = make_classification(n_samples=200, n_features=20, random_state=42)

# Do 10 different random splits
scores = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=i  # Different split each time
    )
    model = LogisticRegression()
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)
    print(f"Split {i+1}: {score:.1%}")

print(f"\nRange: {min(scores):.1%} to {max(scores):.1%}")
print(f"Difference: {(max(scores) - min(scores)):.1%}")
Enter fullscreen mode Exit fullscreen mode

Output:

Split 1: 82.5%
Split 2: 90.0%
Split 3: 85.0%
Split 4: 77.5%
Split 5: 87.5%
Split 6: 92.5%
Split 7: 80.0%
Split 8: 85.0%
Split 9: 87.5%
Split 10: 82.5%

Range: 77.5% to 92.5%
Difference: 15.0%
Enter fullscreen mode Exit fullscreen mode

The SAME model. The SAME data. Scores ranging from 77.5% to 92.5%.

If you happened to pick split 4, you'd think your model is mediocre.
If you happened to pick split 6, you'd think your model is excellent.

Neither is the truth. Both are flukes.


Enter Cross-Validation

Cross-validation solves this by testing on every possible portion of your data.

The Core Idea

Instead of one split, do multiple splits. Each time, a different portion is the test set. Average the results.

Traditional:
┌────────────────────────────────────────┐
│  Training Data (80%)    │  Test (20%)  │
└────────────────────────────────────────┘
                          → One score

Cross-Validation (5-fold):
┌────────────────────────────────────────┐
│ Test │                                  │  → Score 1
├──────┼──────────────────────────────────┤
│      │ Test │                           │  → Score 2
├──────┴──────┼───────────────────────────┤
│             │ Test │                    │  → Score 3
├─────────────┴──────┼────────────────────┤
│                    │ Test │             │  → Score 4
├────────────────────┴──────┼─────────────┤
│                           │    Test     │  → Score 5
└───────────────────────────┴─────────────┘
                          → Average of 5 scores
Enter fullscreen mode Exit fullscreen mode

Every data point gets to be in the test set exactly once. Every data point gets to be in the training set multiple times.

No more lucky or unlucky splits. You test on EVERYTHING.


K-Fold Cross-Validation

The most common type. Here's how it works:

Step by Step

Step 1: Divide your data into K equal parts (called "folds")

Data: [████████████████████████████████████████]
                        ↓
K = 5 folds:
      [████████] [████████] [████████] [████████] [████████]
       Fold 1     Fold 2     Fold 3     Fold 4     Fold 5
Enter fullscreen mode Exit fullscreen mode

Step 2: For each fold:

  • Use that fold as the test set
  • Use all other folds as training set
  • Train the model
  • Evaluate on the test fold
  • Record the score

Step 3: Average all K scores


Visual Walkthrough (5-Fold)

Round 1:
Train: [████████] [████████] [████████] [████████]
Test:  [▓▓▓▓▓▓▓▓]
Score: 85%

Round 2:
Train: [▓▓▓▓▓▓▓▓] [████████] [████████] [████████]
Test:             [████████]
Score: 88%

Round 3:
Train: [████████] [▓▓▓▓▓▓▓▓] [████████] [████████]
Test:                        [████████]
Score: 82%

Round 4:
Train: [████████] [████████] [▓▓▓▓▓▓▓▓] [████████]
Test:                                   [████████]
Score: 87%

Round 5:
Train: [████████] [████████] [████████] [▓▓▓▓▓▓▓▓]
Test:                                              [████████]
Score: 84%

Final: Average = (85 + 88 + 82 + 87 + 84) / 5 = 85.2% ± 2.3%
Enter fullscreen mode Exit fullscreen mode

Now you know your model is about 85%, not "somewhere between 77% and 92%."


Cross-Validation in Code

It's surprisingly simple.

Basic K-Fold

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
print(f"Result: {scores.mean():.1%} ± {scores.std()*2:.1%}")
Enter fullscreen mode Exit fullscreen mode

Output:

Scores: [0.85 0.875 0.825 0.85 0.875]
Mean: 0.855
Std: 0.020
Result: 85.5% ± 4.0%
Enter fullscreen mode Exit fullscreen mode

One line of code. Five evaluations. A reliable estimate.


More Control with KFold

from sklearn.model_selection import KFold

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = []
for fold, (train_idx, test_idx) in enumerate(kfold.split(X)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model = LogisticRegression()
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)

    print(f"Fold {fold+1}: {score:.1%}")

print(f"\nMean: {np.mean(scores):.1%} ± {np.std(scores)*2:.1%}")
Enter fullscreen mode Exit fullscreen mode

Types of Cross-Validation

Different situations call for different approaches.

K-Fold (Standard)

Best for: Most situations
K: Usually 5 or 10
Enter fullscreen mode Exit fullscreen mode
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Stratified K-Fold (For Classification)

The problem: Regular K-Fold might put all examples of one class in the same fold.

The solution: Stratified K-Fold ensures each fold has the same proportion of classes.

Original data: 90% class A, 10% class B

Regular K-Fold might create:
Fold 1: 95% A, 5% B   ← Unbalanced!
Fold 2: 80% A, 20% B  ← Different distribution!

Stratified K-Fold guarantees:
Fold 1: 90% A, 10% B  ← Same as original
Fold 2: 90% A, 10% B  ← Same as original
Fold 3: 90% A, 10% B  ← Same as original
Enter fullscreen mode Exit fullscreen mode
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Use with cross_val_score
scores = cross_val_score(model, X, y, cv=skfold)
Enter fullscreen mode Exit fullscreen mode

Rule: Always use Stratified K-Fold for classification.


Leave-One-Out (LOO)

The idea: Each fold contains just ONE sample.

If you have 100 samples, you do 100 rounds of training. Each round, train on 99 samples, test on 1.

Data: [●] [●] [●] [●] [●] ... 100 samples

Round 1:   Test:[●]  Train:[● ● ● ● ● ... 99 samples]
Round 2:   Train:[●] Test:[●]  Train:[● ● ● ... 98 samples]
Round 3:   Train:[● ●] Test:[●]  Train:[● ● ... 97 samples]
...
Round 100: Train:[● ● ● ● ● ... 99 samples] Test:[●]
Enter fullscreen mode Exit fullscreen mode
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
Enter fullscreen mode Exit fullscreen mode

Pros: Maximum use of data, lowest bias
Cons: Extremely slow (N rounds!), high variance

Use when: You have very small datasets (< 100 samples)


Time Series Split

The problem: For time series data, you can't use future data to predict the past. Regular K-Fold might leak future information.

The solution: Always train on past, test on future.

Regular K-Fold (WRONG for time series):
[Test] [Train] [Train] [Train] [Train]
  ↑ Past is testing on future! Data leakage!

Time Series Split (CORRECT):
Round 1: [Train] [Test]
Round 2: [Train] [Train] [Test]
Round 3: [Train] [Train] [Train] [Test]
Round 4: [Train] [Train] [Train] [Train] [Test]

Always: Past → Future
Enter fullscreen mode Exit fullscreen mode
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)
Enter fullscreen mode Exit fullscreen mode

Group K-Fold

The problem: Sometimes samples belong to groups (e.g., multiple readings from same patient). If the same patient appears in both train and test, you're cheating.

The solution: Ensure all samples from one group stay together.

Data from 3 patients:
Patient A: [● ● ●]
Patient B: [● ● ● ●]
Patient C: [● ●]

WRONG (Regular K-Fold):
Train: [A● A● B● B● C●]
Test:  [A● B● B● C●]    ← Same patients in both!

RIGHT (Group K-Fold):
Train: [A● A● A● B● B● B● B●]
Test:  [C● C●]          ← Different patients!
Enter fullscreen mode Exit fullscreen mode
from sklearn.model_selection import GroupKFold

groups = [1, 1, 1, 2, 2, 2, 2, 3, 3]  # Patient IDs
gkf = GroupKFold(n_splits=3)

for train_idx, test_idx in gkf.split(X, y, groups):
    # Each patient only appears in train OR test, never both
    pass
Enter fullscreen mode Exit fullscreen mode

Choosing K (Number of Folds)

How many folds should you use?

K = 5 (Default)

Pros: Fast, good balance
Cons: 20% test data each round
When: Most situations, default choice
Enter fullscreen mode Exit fullscreen mode

K = 10 (More Reliable)

Pros: More reliable estimate, more training data per fold
Cons: Slower
When: When you need higher confidence
Enter fullscreen mode Exit fullscreen mode

K = N (Leave-One-Out)

Pros: Maximum training data, minimum bias
Cons: Very slow, high variance
When: Very small datasets
Enter fullscreen mode Exit fullscreen mode

Rule of Thumb

Dataset Size Recommended K
< 100 samples Leave-One-Out or 10-fold
100 - 1,000 10-fold
1,000 - 10,000 5-fold
> 10,000 5-fold or even 3-fold

Larger dataset → Fewer folds needed (each fold is already representative)


What Cross-Validation Tells You

The output of cross-validation is two numbers:

scores = cross_val_score(model, X, y, cv=5)
mean = scores.mean()   # Expected performance
std = scores.std()     # Variance in performance
Enter fullscreen mode Exit fullscreen mode

The Mean

What it represents: Your model's expected performance on unseen data.

How to use it: Compare models. Select hyperparameters.

The Standard Deviation

What it represents: How much performance varies depending on the data.

How to use it: Assess reliability.

Model A: 85% ± 2%   ← Consistent, reliable
Model B: 85% ± 15%  ← Inconsistent, risky

Same mean, very different reliability!
Enter fullscreen mode Exit fullscreen mode

Reading the Results

Score: 85.0% ± 4.0%
Enter fullscreen mode Exit fullscreen mode

This means: "On new data, expect around 85% accuracy, typically within the range of 81% to 89%."

High mean, low std: Great model! Reliable.
High mean, high std: Good but inconsistent. Might fail on some data.
Low mean, low std: Consistently bad. Need a better model.
Low mean, high std: Unreliable AND bad. Something is wrong.


Cross-Validation for Hyperparameter Tuning

This is where cross-validation really shines.

The Problem

You want to find the best hyperparameters (e.g., regularization strength α).

Bad approach:

  1. Split data into train/test
  2. Try different α values
  3. Pick the α with best test score
  4. Report that test score

Why it's bad: You're using the test set to make decisions! The test score is now optimistic.

Good approach:

  1. For each α value, do cross-validation
  2. Pick the α with best average CV score
  3. Train final model on ALL training data with that α
  4. Evaluate ONCE on held-out test set

Grid Search with Cross-Validation

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

# Define hyperparameter grid
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
}

# Grid search with 5-fold CV
grid_search = GridSearchCV(
    Ridge(),
    param_grid,
    cv=5,
    scoring='r2',
    return_train_score=True
)

grid_search.fit(X_train, y_train)

print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Final evaluation on held-out test set
final_score = grid_search.score(X_test, y_test)
print(f"Test score: {final_score:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Best alpha: 1.0
Best CV score: 0.847
Test score: 0.831
Enter fullscreen mode Exit fullscreen mode

Notice: CV score (0.847) is slightly higher than test score (0.831). This is normal — CV score is slightly optimistic because you chose the best α based on it.


The Nested Cross-Validation (Advanced)

For the most rigorous evaluation:

Outer loop: Evaluate model performance
  Inner loop: Select hyperparameters

┌─────────────────────────────────────────────────┐
│ OUTER LOOP (5 folds for evaluation)             │
│ ┌─────────────────────────────────────────────┐ │
│ │ INNER LOOP (5 folds for tuning)             │ │
│ │                                             │ │
│ │ For each outer fold:                        │ │
│ │   1. Hold out outer test fold               │ │
│ │   2. On remaining data, do inner CV         │ │
│ │   3. Find best hyperparameters              │ │
│ │   4. Train on all inner data                │ │
│ │   5. Evaluate on outer test fold            │ │
│ └─────────────────────────────────────────────┘ │
│                                                 │
│ Final: Average of 5 outer test scores           │
└─────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode
from sklearn.model_selection import cross_val_score, GridSearchCV

# Inner CV for hyperparameter tuning
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# GridSearchCV does the inner loop
grid_search = GridSearchCV(
    Ridge(),
    {'alpha': [0.01, 0.1, 1.0, 10.0]},
    cv=inner_cv
)

# Outer CV for evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=43)

# Nested CV
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)

print(f"Nested CV: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode

This gives you the most unbiased estimate of how your entire pipeline (including tuning) will perform on new data.


Common Mistakes

Mistake 1: Data Leakage Through Preprocessing

# WRONG: Fit scaler on ALL data, then cross-validate
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Sees test data!
scores = cross_val_score(model, X_scaled, y, cv=5)

# RIGHT: Include preprocessing in the pipeline
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
scores = cross_val_score(pipeline, X, y, cv=5)  # Scaler fits only on training folds
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Using Regular K-Fold for Time Series

# WRONG: Regular K-Fold leaks future information
scores = cross_val_score(model, X_time_series, y, cv=5)

# RIGHT: Time series split
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X_time_series, y, cv=tscv)
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Not Stratifying for Imbalanced Classification

# WRONG: Regular K-Fold with imbalanced classes
scores = cross_val_score(model, X, y, cv=5)

# RIGHT: Stratified K-Fold preserves class ratios
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=skf)
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Reporting CV Score as Final Performance

# WRONG: "My model achieves 87% accuracy" (from CV)
scores = cross_val_score(model, X, y, cv=5)
print(f"Model accuracy: {scores.mean():.1%}")  # This is not test performance!

# RIGHT: Hold out a true test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Use CV only on X_train for model selection
# Report X_test score as final performance
Enter fullscreen mode Exit fullscreen mode

Cross-Validation Cheat Sheet

Situation Use This
General purpose 5-fold or 10-fold
Classification StratifiedKFold
Time series TimeSeriesSplit
Grouped data GroupKFold
Very small data LeaveOneOut
Hyperparameter tuning GridSearchCV with CV
Rigorous evaluation Nested CV

The Complete Cross-Validation Toolkit

# === Basic K-Fold ===
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)

# === Stratified K-Fold (Classification) ===
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

# === Time Series Split ===
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)

# === Group K-Fold ===
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
# scores = cross_val_score(model, X, y, cv=gkf, groups=groups)

# === Leave-One-Out ===
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

# === Grid Search with CV ===
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, y)

# === Nested CV ===
from sklearn.model_selection import cross_val_score
nested_scores = cross_val_score(grid_search, X, y, cv=5)
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Single train-test split is unreliable — You might get lucky or unlucky

  2. Cross-validation tests on every portion — No more flukes

  3. K-Fold splits data into K parts — Each takes a turn as test set

  4. Stratified K-Fold for classification — Preserves class ratios

  5. TimeSeriesSplit for temporal data — Respects time order

  6. CV gives mean AND std — Both matter for reliability

  7. Use CV for hyperparameter tuning — GridSearchCV does this automatically

  8. Include preprocessing in pipeline — Avoid data leakage


The Restaurant Critic Analogy Revisited

Critic Approach ML Equivalent Result
One visit, one dish Single train-test split Unreliable, might be a fluke
Five visits, multiple dishes 5-fold cross-validation Reliable average rating
Every dish, every night Leave-one-out Maximum reliability (but exhausting)

The One-Sentence Summary

Cross-validation is visiting the restaurant five times instead of once — because one visit might catch the chef on a bad day.


What's Next?

Now that you understand cross-validation, you're ready for:

  • Hyperparameter Tuning — Grid search, random search, Bayesian optimization
  • Model Selection — Using CV to choose between algorithms
  • Feature Selection — Using CV to find important features
  • Ensemble Methods — Combining multiple cross-validated models

Follow me for the next article in this series!


Let's Connect!

If this made cross-validation finally click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your go-to CV strategy? 5-fold? 10-fold? I'm curious!


The difference between "my model gets 87% accuracy" and "my model reliably gets 85-89% accuracy" is cross-validation. One is a guess. The other is knowledge.


Share this with someone who's still using a single train-test split and wondering why their results change every time they run the code.

Happy learning!

Top comments (0)