Sachin Kr. Rajput

Posted on Jan 13

Cross-Validation: Why Testing Your Model Once Is Like Judging a Restaurant by a Single Bite

#ai #beginners #machinelearning #datascience

The One-Line Summary: Cross-validation tests your model multiple times on different data splits, giving you a reliable performance estimate instead of a lucky (or unlucky) guess.

The Restaurant Critic's Dilemma

Imagine you're a restaurant critic.

You visit a new Italian place. You order the lasagna. It's... okay. Kind of bland. Maybe a 6/10.

You write your review: "Mediocre Italian. Skip it."

But here's what you didn't know:

The head chef was sick that night
You ordered their weakest dish
The ingredients were from a bad batch
It was their first week open

A month later, you hear everyone raving about this restaurant. You go back. You try different dishes. Different nights. The pasta is incredible. The risotto is life-changing. The tiramisu makes you weep with joy.

Your first visit was a fluke. One data point. One unlucky sample.

Now imagine a different critic.

She visits the restaurant five times. Different nights. Different dishes. Different occasions.

Visit 1: Lasagna — 6/10
Visit 2: Risotto — 9/10
Visit 3: Seafood pasta — 8/10
Visit 4: Osso buco — 9/10
Visit 5: Pizza — 7/10

Average: 7.8/10

She writes: "Excellent Italian with one weak spot. The lasagna needs work, but everything else shines."

That's a fair review. That's a reliable assessment.

The first critic did a single train-test split.

The second critic did cross-validation.

And this difference? It's the line between trusting your model and actually knowing how good it is.

The Problem with a Single Split

Let me show you why a single train-test split is dangerous.

The Standard Approach

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

print(f"Model accuracy: {score}")  # 87%... or is it?

You get one number: 87%.

But here's the question nobody asks:

What if you got lucky?

What if, by pure chance, your test set happened to contain "easy" examples? What if your train set happened to contain the most informative samples?

Or what if you got unlucky?

What if your test set had all the weird outliers? What if your best training examples ended up in the test set?

The Horrifying Experiment

Watch this:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate dataset
X, y = make_classification(n_samples=200, n_features=20, random_state=42)

# Do 10 different random splits
scores = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=i  # Different split each time
    )
    model = LogisticRegression()
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)
    print(f"Split {i+1}: {score:.1%}")

print(f"\nRange: {min(scores):.1%} to {max(scores):.1%}")
print(f"Difference: {(max(scores) - min(scores)):.1%}")

Output:

Split 1: 82.5%
Split 2: 90.0%
Split 3: 85.0%
Split 4: 77.5%
Split 5: 87.5%
Split 6: 92.5%
Split 7: 80.0%
Split 8: 85.0%
Split 9: 87.5%
Split 10: 82.5%

Range: 77.5% to 92.5%
Difference: 15.0%

The SAME model. The SAME data. Scores ranging from 77.5% to 92.5%.

If you happened to pick split 4, you'd think your model is mediocre.
If you happened to pick split 6, you'd think your model is excellent.

Neither is the truth. Both are flukes.

Enter Cross-Validation

Cross-validation solves this by testing on every possible portion of your data.

The Core Idea

Instead of one split, do multiple splits. Each time, a different portion is the test set. Average the results.

Traditional:
┌────────────────────────────────────────┐
│  Training Data (80%)    │  Test (20%)  │
└────────────────────────────────────────┘
                          → One score

Cross-Validation (5-fold):
┌────────────────────────────────────────┐
│ Test │                                  │  → Score 1
├──────┼──────────────────────────────────┤
│      │ Test │                           │  → Score 2
├──────┴──────┼───────────────────────────┤
│             │ Test │                    │  → Score 3
├─────────────┴──────┼────────────────────┤
│                    │ Test │             │  → Score 4
├────────────────────┴──────┼─────────────┤
│                           │    Test     │  → Score 5
└───────────────────────────┴─────────────┘
                          → Average of 5 scores

Every data point gets to be in the test set exactly once. Every data point gets to be in the training set multiple times.

No more lucky or unlucky splits. You test on EVERYTHING.

K-Fold Cross-Validation

The most common type. Here's how it works:

Step by Step

Step 1: Divide your data into K equal parts (called "folds")

Data: [████████████████████████████████████████]
                        ↓
K = 5 folds:
      [████████] [████████] [████████] [████████] [████████]
       Fold 1     Fold 2     Fold 3     Fold 4     Fold 5

Step 2: For each fold:

Use that fold as the test set
Use all other folds as training set
Train the model
Evaluate on the test fold
Record the score

Step 3: Average all K scores

Visual Walkthrough (5-Fold)

Round 1:
Train: [████████] [████████] [████████] [████████]
Test:  [▓▓▓▓▓▓▓▓]
Score: 85%

Round 2:
Train: [▓▓▓▓▓▓▓▓] [████████] [████████] [████████]
Test:             [████████]
Score: 88%

Round 3:
Train: [████████] [▓▓▓▓▓▓▓▓] [████████] [████████]
Test:                        [████████]
Score: 82%

Round 4:
Train: [████████] [████████] [▓▓▓▓▓▓▓▓] [████████]
Test:                                   [████████]
Score: 87%

Round 5:
Train: [████████] [████████] [████████] [▓▓▓▓▓▓▓▓]
Test:                                              [████████]
Score: 84%

Final: Average = (85 + 88 + 82 + 87 + 84) / 5 = 85.2% ± 2.3%

Now you know your model is about 85%, not "somewhere between 77% and 92%."

Cross-Validation in Code

It's surprisingly simple.

Basic K-Fold

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
print(f"Result: {scores.mean():.1%} ± {scores.std()*2:.1%}")

Output:

Scores: [0.85 0.875 0.825 0.85 0.875]
Mean: 0.855
Std: 0.020
Result: 85.5% ± 4.0%

One line of code. Five evaluations. A reliable estimate.

More Control with KFold

from sklearn.model_selection import KFold

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = []
for fold, (train_idx, test_idx) in enumerate(kfold.split(X)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model = LogisticRegression()
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)

    print(f"Fold {fold+1}: {score:.1%}")

print(f"\nMean: {np.mean(scores):.1%} ± {np.std(scores)*2:.1%}")

Types of Cross-Validation

Different situations call for different approaches.

K-Fold (Standard)

Best for: Most situations
K: Usually 5 or 10

from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

Stratified K-Fold (For Classification)

The problem: Regular K-Fold might put all examples of one class in the same fold.

The solution: Stratified K-Fold ensures each fold has the same proportion of classes.

Original data: 90% class A, 10% class B

Regular K-Fold might create:
Fold 1: 95% A, 5% B   ← Unbalanced!
Fold 2: 80% A, 20% B  ← Different distribution!

Stratified K-Fold guarantees:
Fold 1: 90% A, 10% B  ← Same as original
Fold 2: 90% A, 10% B  ← Same as original
Fold 3: 90% A, 10% B  ← Same as original

from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Use with cross_val_score
scores = cross_val_score(model, X, y, cv=skfold)

Rule: Always use Stratified K-Fold for classification.

Leave-One-Out (LOO)

The idea: Each fold contains just ONE sample.

If you have 100 samples, you do 100 rounds of training. Each round, train on 99 samples, test on 1.

Data: [●] [●] [●] [●] [●] ... 100 samples

Round 1:   Test:[●]  Train:[● ● ● ● ● ... 99 samples]
Round 2:   Train:[●] Test:[●]  Train:[● ● ● ... 98 samples]
Round 3:   Train:[● ●] Test:[●]  Train:[● ● ... 97 samples]
...
Round 100: Train:[● ● ● ● ● ... 99 samples] Test:[●]

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

Pros: Maximum use of data, lowest bias
Cons: Extremely slow (N rounds!), high variance

Use when: You have very small datasets (< 100 samples)

Time Series Split

The problem: For time series data, you can't use future data to predict the past. Regular K-Fold might leak future information.

The solution: Always train on past, test on future.

Regular K-Fold (WRONG for time series):
[Test] [Train] [Train] [Train] [Train]
  ↑ Past is testing on future! Data leakage!

Time Series Split (CORRECT):
Round 1: [Train] [Test]
Round 2: [Train] [Train] [Test]
Round 3: [Train] [Train] [Train] [Test]
Round 4: [Train] [Train] [Train] [Train] [Test]

Always: Past → Future

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)

Group K-Fold

The problem: Sometimes samples belong to groups (e.g., multiple readings from same patient). If the same patient appears in both train and test, you're cheating.

The solution: Ensure all samples from one group stay together.

Data from 3 patients:
Patient A: [● ● ●]
Patient B: [● ● ● ●]
Patient C: [● ●]

WRONG (Regular K-Fold):
Train: [A● A● B● B● C●]
Test:  [A● B● B● C●]    ← Same patients in both!

RIGHT (Group K-Fold):
Train: [A● A● A● B● B● B● B●]
Test:  [C● C●]          ← Different patients!

from sklearn.model_selection import GroupKFold

groups = [1, 1, 1, 2, 2, 2, 2, 3, 3]  # Patient IDs
gkf = GroupKFold(n_splits=3)

for train_idx, test_idx in gkf.split(X, y, groups):
    # Each patient only appears in train OR test, never both
    pass

Choosing K (Number of Folds)

How many folds should you use?

K = 5 (Default)

Pros: Fast, good balance
Cons: 20% test data each round
When: Most situations, default choice

K = 10 (More Reliable)

Pros: More reliable estimate, more training data per fold
Cons: Slower
When: When you need higher confidence

K = N (Leave-One-Out)

Pros: Maximum training data, minimum bias
Cons: Very slow, high variance
When: Very small datasets

Rule of Thumb

Dataset Size	Recommended K
< 100 samples	Leave-One-Out or 10-fold
100 - 1,000	10-fold
1,000 - 10,000	5-fold
> 10,000	5-fold or even 3-fold

Larger dataset → Fewer folds needed (each fold is already representative)

What Cross-Validation Tells You

The output of cross-validation is two numbers:

scores = cross_val_score(model, X, y, cv=5)
mean = scores.mean()   # Expected performance
std = scores.std()     # Variance in performance

The Mean

What it represents: Your model's expected performance on unseen data.

How to use it: Compare models. Select hyperparameters.

The Standard Deviation

What it represents: How much performance varies depending on the data.

How to use it: Assess reliability.

Model A: 85% ± 2%   ← Consistent, reliable
Model B: 85% ± 15%  ← Inconsistent, risky

Same mean, very different reliability!

Reading the Results

Score: 85.0% ± 4.0%

This means: "On new data, expect around 85% accuracy, typically within the range of 81% to 89%."

High mean, low std: Great model! Reliable.
High mean, high std: Good but inconsistent. Might fail on some data.
Low mean, low std: Consistently bad. Need a better model.
Low mean, high std: Unreliable AND bad. Something is wrong.

Cross-Validation for Hyperparameter Tuning

This is where cross-validation really shines.

The Problem

You want to find the best hyperparameters (e.g., regularization strength α).

Bad approach:

Split data into train/test
Try different α values
Pick the α with best test score
Report that test score

Why it's bad: You're using the test set to make decisions! The test score is now optimistic.

Good approach:

For each α value, do cross-validation
Pick the α with best average CV score
Train final model on ALL training data with that α
Evaluate ONCE on held-out test set

Grid Search with Cross-Validation

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

# Define hyperparameter grid
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
}

# Grid search with 5-fold CV
grid_search = GridSearchCV(
    Ridge(),
    param_grid,
    cv=5,
    scoring='r2',
    return_train_score=True
)

grid_search.fit(X_train, y_train)

print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Final evaluation on held-out test set
final_score = grid_search.score(X_test, y_test)
print(f"Test score: {final_score:.3f}")

Output:

Best alpha: 1.0
Best CV score: 0.847
Test score: 0.831

Notice: CV score (0.847) is slightly higher than test score (0.831). This is normal — CV score is slightly optimistic because you chose the best α based on it.

The Nested Cross-Validation (Advanced)

For the most rigorous evaluation:

Outer loop: Evaluate model performance
  Inner loop: Select hyperparameters

┌─────────────────────────────────────────────────┐
│ OUTER LOOP (5 folds for evaluation)             │
│ ┌─────────────────────────────────────────────┐ │
│ │ INNER LOOP (5 folds for tuning)             │ │
│ │                                             │ │
│ │ For each outer fold:                        │ │
│ │   1. Hold out outer test fold               │ │
│ │   2. On remaining data, do inner CV         │ │
│ │   3. Find best hyperparameters              │ │
│ │   4. Train on all inner data                │ │
│ │   5. Evaluate on outer test fold            │ │
│ └─────────────────────────────────────────────┘ │
│                                                 │
│ Final: Average of 5 outer test scores           │
└─────────────────────────────────────────────────┘

from sklearn.model_selection import cross_val_score, GridSearchCV

# Inner CV for hyperparameter tuning
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# GridSearchCV does the inner loop
grid_search = GridSearchCV(
    Ridge(),
    {'alpha': [0.01, 0.1, 1.0, 10.0]},
    cv=inner_cv
)

# Outer CV for evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=43)

# Nested CV
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)

print(f"Nested CV: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")

This gives you the most unbiased estimate of how your entire pipeline (including tuning) will perform on new data.

Common Mistakes

Mistake 1: Data Leakage Through Preprocessing

# WRONG: Fit scaler on ALL data, then cross-validate
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Sees test data!
scores = cross_val_score(model, X_scaled, y, cv=5)

# RIGHT: Include preprocessing in the pipeline
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
scores = cross_val_score(pipeline, X, y, cv=5)  # Scaler fits only on training folds

Mistake 2: Using Regular K-Fold for Time Series

# WRONG: Regular K-Fold leaks future information
scores = cross_val_score(model, X_time_series, y, cv=5)

# RIGHT: Time series split
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X_time_series, y, cv=tscv)

Mistake 3: Not Stratifying for Imbalanced Classification

# WRONG: Regular K-Fold with imbalanced classes
scores = cross_val_score(model, X, y, cv=5)

# RIGHT: Stratified K-Fold preserves class ratios
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=skf)

Mistake 4: Reporting CV Score as Final Performance

# WRONG: "My model achieves 87% accuracy" (from CV)
scores = cross_val_score(model, X, y, cv=5)
print(f"Model accuracy: {scores.mean():.1%}")  # This is not test performance!

# RIGHT: Hold out a true test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Use CV only on X_train for model selection
# Report X_test score as final performance

Cross-Validation Cheat Sheet

Situation	Use This
General purpose	5-fold or 10-fold
Classification	StratifiedKFold
Time series	TimeSeriesSplit
Grouped data	GroupKFold
Very small data	LeaveOneOut
Hyperparameter tuning	GridSearchCV with CV
Rigorous evaluation	Nested CV

The Complete Cross-Validation Toolkit

# === Basic K-Fold ===
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)

# === Stratified K-Fold (Classification) ===
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

# === Time Series Split ===
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)

# === Group K-Fold ===
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
# scores = cross_val_score(model, X, y, cv=gkf, groups=groups)

# === Leave-One-Out ===
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

# === Grid Search with CV ===
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, y)

# === Nested CV ===
from sklearn.model_selection import cross_val_score
nested_scores = cross_val_score(grid_search, X, y, cv=5)

Key Takeaways

Single train-test split is unreliable — You might get lucky or unlucky
Cross-validation tests on every portion — No more flukes
K-Fold splits data into K parts — Each takes a turn as test set
Stratified K-Fold for classification — Preserves class ratios
TimeSeriesSplit for temporal data — Respects time order
CV gives mean AND std — Both matter for reliability
Use CV for hyperparameter tuning — GridSearchCV does this automatically
Include preprocessing in pipeline — Avoid data leakage

The Restaurant Critic Analogy Revisited

Critic Approach	ML Equivalent	Result
One visit, one dish	Single train-test split	Unreliable, might be a fluke
Five visits, multiple dishes	5-fold cross-validation	Reliable average rating
Every dish, every night	Leave-one-out	Maximum reliability (but exhausting)

The One-Sentence Summary

Cross-validation is visiting the restaurant five times instead of once — because one visit might catch the chef on a bad day.

What's Next?

Now that you understand cross-validation, you're ready for:

Hyperparameter Tuning — Grid search, random search, Bayesian optimization
Model Selection — Using CV to choose between algorithms
Feature Selection — Using CV to find important features
Ensemble Methods — Combining multiple cross-validated models

Follow me for the next article in this series!

Let's Connect!

If this made cross-validation finally click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your go-to CV strategy? 5-fold? 10-fold? I'm curious!

The difference between "my model gets 87% accuracy" and "my model reliably gets 85-89% accuracy" is cross-validation. One is a guess. The other is knowledge.

Share this with someone who's still using a single train-test split and wondering why their results change every time they run the code.

Happy learning!