Sachin Kr. Rajput

Posted on Jan 21

K-Fold Cross-Validation: The Comedian Who Tested Jokes at Only One Comedy Club and Bombed Everywhere Else

#machinelearning #datascience #beginners #python

The One-Line Summary: K-fold cross-validation splits your data into K parts, trains K times (each time using a different part as the test set), and averages the results — giving you a reliable performance estimate instead of gambling on a single lucky/unlucky split.

The Comedian Who Bombed on Tour

Marcus had been perfecting his stand-up routine for months. He tested it at his local comedy club, The Laughing Llama.

Test night at The Laughing Llama:

Joke 1: "Why do programmers prefer dark mode?"
        → 87% laughed

Joke 2: "My code works on my machine..."
        → 92% laughed

Joke 3: "There are only 10 types of people..."
        → 95% laughed

OVERALL: 91% laugh rate! CRUSHING IT! 🎤

Marcus was confident. He booked a nationwide tour.

The Tour: Reality Check

Club 2 (Sports Bar):      
  Audience: Sports fans
  → 34% laugh rate. Crickets. 🦗

Club 3 (College Town):
  Audience: Students
  → 78% laugh rate. Pretty good!

Club 4 (Retirement Community):
  Audience: Seniors
  → 12% laugh rate. "What's dark mode?" 👴

Club 5 (Tech Conference):
  Audience: Developers
  → 97% laugh rate. Standing ovation! 🎉

Tour Average: 55% laugh rate

What Went Wrong?

Marcus tested his act on ONE audience (The Laughing Llama = tech-savvy locals) and assumed it would generalize everywhere.

But that one club wasn't representative. It was accidentally perfect for his material.

The Laughing Llama was his "lucky split."

What Marcus SHOULD Have Done

Test at K different clubs BEFORE the tour:

K-FOLD COMEDY VALIDATION (K=5):

Fold 1: Test at Sports Bar        → 34%
Fold 2: Test at College Town      → 78%
Fold 3: Test at Retirement Home   → 12%
Fold 4: Test at Tech Conference   → 97%
Fold 5: Test at Laughing Llama    → 91%

Average: 62.4% ± 33.2%

INSIGHT: "My act works great for SOME audiences 
         but bombs for others. High variance!"

Now Marcus KNOWS:

His true expected performance (~62%, not 91%)
His act is inconsistent (±33% variance)
He needs to diversify his material

K-Fold Cross-Validation Explained

The same logic applies to machine learning:

SINGLE TRAIN/TEST SPLIT (Dangerous):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

┌─────────────────────────────────┬─────────┐
│         TRAINING (80%)          │TEST(20%)│
└─────────────────────────────────┴─────────┘

One evaluation. One number. Could be lucky. Could be unlucky.
You'll never know.


K-FOLD CROSS-VALIDATION (K=5):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Fold 1: ┌──────┬────────────────────────────────┐
        │ TEST │         TRAINING               │
        └──────┴────────────────────────────────┘

Fold 2: ┌──────┬──────┬─────────────────────────┐
        │TRAIN │ TEST │       TRAINING          │
        └──────┴──────┴─────────────────────────┘

Fold 3: ┌────────────┬──────┬───────────────────┐
        │  TRAINING  │ TEST │     TRAINING      │
        └────────────┴──────┴───────────────────┘

Fold 4: ┌──────────────────┬──────┬─────────────┐
        │     TRAINING     │ TEST │  TRAINING   │
        └──────────────────┴──────┴─────────────┘

Fold 5: ┌────────────────────────────────┬──────┐
        │           TRAINING             │ TEST │
        └────────────────────────────────┴──────┘

Five evaluations. Five numbers. Average them.
MUCH more reliable!

The Algorithm

K-FOLD CROSS-VALIDATION:

1. Shuffle the data (optional but recommended)
2. Split data into K equal-sized "folds"
3. For i = 1 to K:
   a. Use fold i as TEST set
   b. Use all other folds as TRAINING set
   c. Train model on training set
   d. Evaluate on test set
   e. Record the score
4. Return: mean(scores), std(scores)

Key insight: Every data point gets to be in the test set exactly ONCE.

Code: Basic K-Fold

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Method 1: Manual K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

scores = []
for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)
    print(f"Fold {fold+1}: Accuracy = {score:.4f}")

print(f"\nMean: {np.mean(scores):.4f} ± {np.std(scores):.4f}")


# Method 2: One-liner with cross_val_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"\nCross-val scores: {scores.round(4)}")
print(f"Mean: {scores.mean():.4f} ± {scores.std():.4f}")

Output:

Fold 1: Accuracy = 0.9150
Fold 2: Accuracy = 0.9100
Fold 3: Accuracy = 0.9300
Fold 4: Accuracy = 0.9050
Fold 5: Accuracy = 0.9250

Mean: 0.9170 ± 0.0094

Cross-val scores: [0.915  0.91   0.93   0.905  0.925]
Mean: 0.9170 ± 0.0094

How to Choose K?

This is THE question everyone asks. Here's the complete guide:

The Tradeoff

LOW K (e.g., K=2):
├── More training data per fold (good for learning)
├── High bias (test sets are large, less representative)
├── Low variance (fewer evaluations to vary)
├── Fast (only 2 training runs)
└── PROBLEM: Each test set is 50% of data — too different from full training

HIGH K (e.g., K=N, leave-one-out):
├── Maximum training data (N-1 samples)
├── Low bias (training on almost everything)
├── High variance (N evaluations, each on 1 sample!)
├── Very slow (N training runs)
└── PROBLEM: Each fold is almost identical — correlated estimates

The Standard Choices

K=5: The Default

# Most common choice. Good balance.
cv = KFold(n_splits=5)

# Why 5?
# - Each fold uses 80% for training (enough data)
# - 5 evaluations (enough to estimate variance)
# - Not too slow
# - Empirically works well across many problems

Use K=5 when:

You have a moderate dataset (1,000 - 100,000 samples)
You're not sure what to use
Training isn't too expensive

K=10: The Thorough Choice

# More evaluations, slightly more reliable
cv = KFold(n_splits=10)

# Why 10?
# - Each fold uses 90% for training
# - 10 evaluations for better variance estimate
# - Still reasonable computation time

Use K=10 when:

You have a large dataset (>10,000 samples)
You want more confidence in your estimate
Computation time isn't a concern

K=3: The Quick Choice

# Fast, but less reliable
cv = KFold(n_splits=3)

# Why 3?
# - Only 3 training runs (fast!)
# - Each fold uses 67% for training
# - Good for quick experiments

Use K=3 when:

You have limited compute resources
Training is expensive (deep learning)
You're doing rapid iteration
You have a huge dataset (K=3 still gives millions of test samples)

K=N (Leave-One-Out): The Extreme

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

# N training runs, each testing on 1 sample!

Use LOO when:

You have very small data (<100 samples)
Each sample is precious
Training is fast (simple models)

DON'T use LOO when:

Large datasets (too slow)
High-variance models (overfitting per fold)

The Decision Framework

HOW MANY SAMPLES DO YOU HAVE?

< 100 samples:
    └─► Leave-One-Out (LOO) or K=10 with stratification

100 - 1,000 samples:
    └─► K=10 (more folds = more test samples per evaluation)

1,000 - 100,000 samples:
    └─► K=5 (standard choice)

> 100,000 samples:
    └─► K=3 or even K=2 (each fold has plenty of data)
    └─► Or: Single stratified split might be okay!


HOW EXPENSIVE IS TRAINING?

Cheap (linear models, small trees):
    └─► K=10 or even LOO

Moderate (random forests, gradient boosting):
    └─► K=5

Expensive (deep learning, large models):
    └─► K=3 or even single holdout with careful stratification

Comparing Different K Values

import numpy as np
from sklearn.model_selection import cross_val_score, KFold, LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import time

# Create dataset
X, y = make_classification(n_samples=500, n_features=20, random_state=42)

model = LogisticRegression(max_iter=1000)

k_values = [2, 3, 5, 10, 20]

print("K-Fold Comparison")
print("="*60)
print(f"{'K':<5} {'Mean':<10} {'Std':<10} {'Time (s)':<10} {'Training %':<12}")
print("-"*60)

for k in k_values:
    start = time.time()
    cv = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    elapsed = time.time() - start

    train_pct = (k-1)/k * 100
    print(f"{k:<5} {scores.mean():<10.4f} {scores.std():<10.4f} {elapsed:<10.3f} {train_pct:<12.1f}%")

# Leave-One-Out for comparison
print("-"*60)
start = time.time()
loo = LeaveOneOut()
scores_loo = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
elapsed = time.time() - start
print(f"{'LOO':<5} {scores_loo.mean():<10.4f} {scores_loo.std():<10.4f} {elapsed:<10.3f} {99.8:<12.1f}%")

Output:

K-Fold Comparison
============================================================
K     Mean       Std        Time (s)   Training %  
------------------------------------------------------------
2     0.8620     0.0080     0.012      50.0%
3     0.8693     0.0125     0.015      66.7%
5     0.8740     0.0167     0.023      80.0%
10    0.8760     0.0312     0.042      90.0%
20    0.8750     0.0445     0.081      95.0%
------------------------------------------------------------
LOO   0.8760     0.3297     2.341      99.8%

Observations:

K	Mean Accuracy	Std	Insight
2	0.862	0.008	Biased low (only 50% training)
5	0.874	0.017	Good balance
10	0.876	0.031	Slightly better, higher variance
LOO	0.876	0.330	Same mean, HUGE variance per fold

Stratified K-Fold (For Classification)

Critical for imbalanced data!

from sklearn.model_selection import StratifiedKFold, cross_val_score

# Imbalanced dataset: 5% positive
X, y = make_classification(
    n_samples=1000, weights=[0.95, 0.05], random_state=42
)

# Regular KFold - DANGEROUS!
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Stratified KFold - SAFE!
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Compare class distribution in each fold
print("Regular KFold - Positive % in each test fold:")
for fold, (train_idx, test_idx) in enumerate(kf.split(X, y)):
    pct = y[test_idx].mean() * 100
    print(f"  Fold {fold+1}: {pct:.1f}%")

print("\nStratified KFold - Positive % in each test fold:")
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    pct = y[test_idx].mean() * 100
    print(f"  Fold {fold+1}: {pct:.1f}%")

Output:

Regular KFold - Positive % in each test fold:
  Fold 1: 3.5%
  Fold 2: 6.5%
  Fold 3: 4.0%
  Fold 4: 7.0%   ← 40% more than expected!
  Fold 5: 4.0%

Stratified KFold - Positive % in each test fold:
  Fold 1: 5.0%
  Fold 2: 5.0%
  Fold 3: 5.0%
  Fold 4: 5.0%
  Fold 5: 5.0%   ← All exactly 5%!

Rule: Always use StratifiedKFold for classification!

Repeated K-Fold (More Robust)

Run K-fold multiple times with different random shuffles:

from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

# 5-fold CV, repeated 10 times = 50 total evaluations
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)

scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

print(f"50 evaluations (5-fold × 10 repeats)")
print(f"Mean: {scores.mean():.4f}")
print(f"Std:  {scores.std():.4f}")
print(f"95% CI: [{scores.mean() - 1.96*scores.std():.4f}, {scores.mean() + 1.96*scores.std():.4f}]")

Output:

50 evaluations (5-fold × 10 repeats)
Mean: 0.8742
Std:  0.0189
95% CI: [0.8372, 0.9112]

Use Repeated K-Fold when:

You need a confidence interval
You want to reduce variance
Computation allows it

Time Series Cross-Validation

Regular K-fold BREAKS time series! (Future data leaks into training)

from sklearn.model_selection import TimeSeriesSplit

# DON'T DO THIS for time series:
# kf = KFold(n_splits=5, shuffle=True)  # ❌ Shuffling breaks time!

# DO THIS:
tscv = TimeSeriesSplit(n_splits=5)

# Visualization of splits
X = np.arange(100).reshape(-1, 1)  # 100 time points
y = np.random.randn(100)

print("Time Series Cross-Validation Splits:")
print("="*60)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {fold+1}:")
    print(f"  Train: indices {train_idx[0]:3d} to {train_idx[-1]:3d} ({len(train_idx)} samples)")
    print(f"  Test:  indices {test_idx[0]:3d} to {test_idx[-1]:3d} ({len(test_idx)} samples)")

Output:

Time Series Cross-Validation Splits:
============================================================
Fold 1:
  Train: indices   0 to  49 (50 samples)
  Test:  indices  50 to  61 (12 samples)
Fold 2:
  Train: indices   0 to  61 (62 samples)
  Test:  indices  62 to  74 (13 samples)
Fold 3:
  Train: indices   0 to  74 (75 samples)
  Test:  indices  75 to  86 (12 samples)
Fold 4:
  Train: indices   0 to  86 (87 samples)
  Test:  indices  87 to  99 (13 samples)
Fold 5:
  Train: indices   0 to  99 (100 samples)
  Test:  indices 100 to 112 (13 samples)

Key difference: Training ALWAYS uses past data, test ALWAYS uses future data.

Time Series CV Visualization:

Time →→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→

Fold 1: [====TRAIN====][TEST]
Fold 2: [======TRAIN======][TEST]
Fold 3: [========TRAIN========][TEST]
Fold 4: [==========TRAIN==========][TEST]
Fold 5: [============TRAIN============][TEST]

Training grows, but NEVER includes future test data!

Nested Cross-Validation (For Hyperparameter Tuning)

Problem: If you tune hyperparameters using CV, then evaluate using the same CV, you're overfitting to your CV splits!

Solution: Nested CV — outer loop for evaluation, inner loop for tuning.

from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# Outer CV: for final performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Inner CV: for hyperparameter tuning
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Model with hyperparameters to tune
model = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None]
}

# GridSearchCV handles the inner loop
grid_search = GridSearchCV(
    model, param_grid, cv=inner_cv, scoring='accuracy', n_jobs=-1
)

# cross_val_score handles the outer loop
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy')

print("Nested Cross-Validation Results:")
print(f"Scores: {nested_scores.round(4)}")
print(f"Mean:   {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

Output:

Nested Cross-Validation Results:
Scores: [0.9250 0.9150 0.9400 0.9100 0.9350]
Mean:   0.9250 ± 0.0114

NESTED CV STRUCTURE:

OUTER LOOP (5 folds) - Estimates true performance
├── Fold 1: [=====Train=====][Test]
│   └── INNER LOOP (3 folds) - Tunes hyperparameters
│       ├── Inner Fold 1: tune
│       ├── Inner Fold 2: tune
│       └── Inner Fold 3: tune
│       → Best params found, evaluate on outer test
│
├── Fold 2: [Test][=====Train=====]
│   └── INNER LOOP (3 folds) - Tunes hyperparameters
│       → Best params found, evaluate on outer test
│
... (repeat for all outer folds)

Final: Average of 5 outer test scores = unbiased estimate!

Common Mistakes

Mistake 1: Using KFold Instead of StratifiedKFold for Classification

# ❌ WRONG for classification
from sklearn.model_selection import KFold
cv = KFold(n_splits=5)

# ✅ RIGHT for classification
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5)

# Or just use cv=5 in cross_val_score — it auto-detects!
scores = cross_val_score(classifier, X, y, cv=5)  # Uses StratifiedKFold

Mistake 2: Shuffling Time Series Data

# ❌ WRONG for time series
cv = KFold(n_splits=5, shuffle=True)  # Shuffling breaks temporal order!

# ✅ RIGHT for time series
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=5)

Mistake 3: Data Leakage in Preprocessing

# ❌ WRONG: Fit scaler on ALL data before CV
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)  # Leaks test info!
scores = cross_val_score(model, X_scaled, y, cv=5)

# ✅ RIGHT: Use Pipeline to fit scaler inside CV
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])
scores = cross_val_score(pipe, X, y, cv=5)  # Scaler fits only on train folds!

Mistake 4: Tuning AND Evaluating on Same CV

# ❌ WRONG: Optimistic estimate!
grid_search = GridSearchCV(model, params, cv=5)
grid_search.fit(X, y)
print(f"Best CV score: {grid_search.best_score_}")  # Overfitted to these folds!

# ✅ RIGHT: Use nested CV for unbiased estimate
nested_scores = cross_val_score(grid_search, X, y, cv=5)
print(f"True expected score: {nested_scores.mean()}")

Mistake 5: Ignoring Variance

# ❌ WRONG: Only reporting mean
scores = cross_val_score(model, X, y, cv=5)
print(f"Accuracy: {scores.mean():.4f}")  # What about variance?

# ✅ RIGHT: Report mean AND standard deviation
print(f"Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
print(f"Scores: {scores.round(4)}")  # Show all scores!

# If std is high, your estimate is unreliable!

Complete Example: The Right Way

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import (
    StratifiedKFold, cross_val_score, cross_validate,
    GridSearchCV, train_test_split
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score

# Create imbalanced dataset
X, y = make_classification(
    n_samples=2000, n_features=20, weights=[0.9, 0.1],
    random_state=42
)

print("="*60)
print("COMPLETE K-FOLD CROSS-VALIDATION WORKFLOW")
print("="*60)

# Step 1: Hold out a TRUE test set (never touched during CV!)
X_dev, X_test, y_dev, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"\n1. Data Split:")
print(f"   Development set: {len(X_dev)} samples")
print(f"   Final test set:  {len(X_test)} samples (untouched until end!)")

# Step 2: Define CV strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print(f"\n2. CV Strategy: 5-fold Stratified")

# Step 3: Create pipeline (preprocessing + model)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

# Step 4: Cross-validate with multiple metrics
scoring = {
    'accuracy': 'accuracy',
    'f1': 'f1',
    'precision': 'precision',
    'recall': 'recall'
}

cv_results = cross_validate(
    pipeline, X_dev, y_dev, cv=cv, scoring=scoring, return_train_score=True
)

print(f"\n3. Cross-Validation Results:")
print("-"*60)
print(f"{'Metric':<15} {'Train':<20} {'Validation':<20}")
print("-"*60)
for metric in ['accuracy', 'f1', 'precision', 'recall']:
    train_scores = cv_results[f'train_{metric}']
    val_scores = cv_results[f'test_{metric}']
    print(f"{metric:<15} {train_scores.mean():.3f} ± {train_scores.std():.3f}    {val_scores.mean():.3f} ± {val_scores.std():.3f}")

# Step 5: Check for overfitting
print(f"\n4. Overfitting Check:")
train_acc = cv_results['train_accuracy'].mean()
val_acc = cv_results['test_accuracy'].mean()
gap = train_acc - val_acc
print(f"   Train-Val Gap: {gap:.3f}")
if gap > 0.1:
    print("   ⚠️  Large gap! Consider regularization.")
else:
    print("   ✓ Gap is acceptable.")

# Step 6: Final evaluation on held-out test set
print(f"\n5. Final Test Evaluation:")
pipeline.fit(X_dev, y_dev)
test_accuracy = pipeline.score(X_test, y_test)
test_f1 = f1_score(y_test, pipeline.predict(X_test))
print(f"   Test Accuracy: {test_accuracy:.4f}")
print(f"   Test F1:       {test_f1:.4f}")

# Compare to CV estimate
print(f"\n6. Estimate Quality:")
print(f"   CV Accuracy Estimate:   {val_acc:.4f}")
print(f"   Actual Test Accuracy:   {test_accuracy:.4f}")
print(f"   Difference:             {abs(val_acc - test_accuracy):.4f}")
if abs(val_acc - test_accuracy) < 0.02:
    print("   ✓ CV estimate was reliable!")
else:
    print("   ⚠️  Some discrepancy — check data distribution")

Output:

============================================================
COMPLETE K-FOLD CROSS-VALIDATION WORKFLOW
============================================================

1. Data Split:
   Development set: 1600 samples
   Final test set:  400 samples (untouched until end!)

2. CV Strategy: 5-fold Stratified

3. Cross-Validation Results:
------------------------------------------------------------
Metric          Train                Validation          
------------------------------------------------------------
accuracy        1.000 ± 0.000        0.944 ± 0.011
f1              1.000 ± 0.000        0.763 ± 0.055
precision       1.000 ± 0.000        0.848 ± 0.062
recall          1.000 ± 0.000        0.700 ± 0.071

4. Overfitting Check:
   Train-Val Gap: 0.056
   ✓ Gap is acceptable.

5. Final Test Evaluation:
   Test Accuracy: 0.9400
   Test F1:       0.7407

6. Estimate Quality:
   CV Accuracy Estimate:   0.9438
   Actual Test Accuracy:   0.9400
   Difference:             0.0038
   ✓ CV estimate was reliable!

Quick Reference

Choosing K

Dataset Size	Recommended K	Reason
< 100	10 or LOO	Need max training data
100 - 1,000	10	Balance bias/variance
1,000 - 100,000	5	Standard choice
> 100,000	3 or holdout	Plenty of data per fold

CV Variants

Variant	Use Case
`KFold`	Regression
`StratifiedKFold`	Classification (always!)
`RepeatedStratifiedKFold`	Need confidence intervals
`TimeSeriesSplit`	Time series data
`LeaveOneOut`	Very small datasets
`GroupKFold`	Groups must stay together

The One-Liner

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"{scores.mean():.3f} ± {scores.std():.3f}")

Key Takeaways

Single splits are unreliable — You might get lucky or unlucky
K-fold tests on K different splits — Average is more reliable
K=5 is the default — Good balance for most cases
Always use StratifiedKFold for classification — Preserves class ratios
Report mean AND standard deviation — 0.92 ± 0.03 tells the full story
Use pipelines to prevent leakage — Preprocessing must happen inside CV
Time series needs TimeSeriesSplit — No shuffling, no future leakage
Nested CV for tuning + evaluation — Prevents overfitting to CV folds

The One-Sentence Summary

Marcus tested his comedy act at one club and thought he'd crush the tour, but he'd just gotten lucky with one tech-savvy audience — k-fold cross-validation is like testing at K different clubs and averaging the laughs, so you know your TRUE expected performance before you bet your career on it.

What's Next?

Now that you understand K-fold cross-validation, you're ready for:

Hyperparameter Tuning — Grid search, random search, Bayesian optimization
Learning Curves — Diagnosing bias vs variance
Model Selection — Statistical tests between models
Ensemble Methods — Combining multiple models

Follow me for the next article in this series!

Let's Connect!

If K-fold finally makes sense, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What K do you typically use? I default to 5, bump to 10 for important decisions, drop to 3 for deep learning. What about you?

The difference between "my model got 94% accuracy" and "my model gets 89-94% accuracy depending on the split"? K-fold cross-validation. One gives you false confidence. The other gives you the truth.

Share this with someone who keeps re-running their code until they get a good accuracy. They're fooling themselves — and K-fold will reveal it.

Happy validating! 🎭

DEV Community

K-Fold Cross-Validation: The Comedian Who Tested Jokes at Only One Comedy Club and Bombed Everywhere Else

The Comedian Who Bombed on Tour

What Went Wrong?

What Marcus SHOULD Have Done

K-Fold Cross-Validation Explained

The Algorithm

Code: Basic K-Fold

How to Choose K?

The Tradeoff

The Standard Choices

K=5: The Default

K=10: The Thorough Choice

K=3: The Quick Choice

K=N (Leave-One-Out): The Extreme

The Decision Framework

Comparing Different K Values

Stratified K-Fold (For Classification)

Repeated K-Fold (More Robust)

Time Series Cross-Validation

Nested Cross-Validation (For Hyperparameter Tuning)

Common Mistakes

Mistake 1: Using KFold Instead of StratifiedKFold for Classification

Mistake 2: Shuffling Time Series Data

Mistake 3: Data Leakage in Preprocessing

Mistake 4: Tuning AND Evaluating on Same CV

Mistake 5: Ignoring Variance

Complete Example: The Right Way

Quick Reference

Choosing K

CV Variants

The One-Liner

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)