DEV Community

Cover image for K-Fold Cross-Validation: The Comedian Who Tested Jokes at Only One Comedy Club and Bombed Everywhere Else
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

K-Fold Cross-Validation: The Comedian Who Tested Jokes at Only One Comedy Club and Bombed Everywhere Else

The One-Line Summary: K-fold cross-validation splits your data into K parts, trains K times (each time using a different part as the test set), and averages the results — giving you a reliable performance estimate instead of gambling on a single lucky/unlucky split.


The Comedian Who Bombed on Tour

Marcus had been perfecting his stand-up routine for months. He tested it at his local comedy club, The Laughing Llama.

Test night at The Laughing Llama:

Joke 1: "Why do programmers prefer dark mode?"
        → 87% laughed

Joke 2: "My code works on my machine..."
        → 92% laughed

Joke 3: "There are only 10 types of people..."
        → 95% laughed

OVERALL: 91% laugh rate! CRUSHING IT! 🎤
Enter fullscreen mode Exit fullscreen mode

Marcus was confident. He booked a nationwide tour.


The Tour: Reality Check

Club 2 (Sports Bar):      
  Audience: Sports fans
  → 34% laugh rate. Crickets. 🦗

Club 3 (College Town):
  Audience: Students
  → 78% laugh rate. Pretty good!

Club 4 (Retirement Community):
  Audience: Seniors
  → 12% laugh rate. "What's dark mode?" 👴

Club 5 (Tech Conference):
  Audience: Developers
  → 97% laugh rate. Standing ovation! 🎉
Enter fullscreen mode Exit fullscreen mode

Tour Average: 55% laugh rate


What Went Wrong?

Marcus tested his act on ONE audience (The Laughing Llama = tech-savvy locals) and assumed it would generalize everywhere.

But that one club wasn't representative. It was accidentally perfect for his material.

The Laughing Llama was his "lucky split."


What Marcus SHOULD Have Done

Test at K different clubs BEFORE the tour:

K-FOLD COMEDY VALIDATION (K=5):

Fold 1: Test at Sports Bar        → 34%
Fold 2: Test at College Town      → 78%
Fold 3: Test at Retirement Home   → 12%
Fold 4: Test at Tech Conference   → 97%
Fold 5: Test at Laughing Llama    → 91%

Average: 62.4% ± 33.2%

INSIGHT: "My act works great for SOME audiences 
         but bombs for others. High variance!"
Enter fullscreen mode Exit fullscreen mode

Now Marcus KNOWS:

  • His true expected performance (~62%, not 91%)
  • His act is inconsistent (±33% variance)
  • He needs to diversify his material

K-Fold Cross-Validation Explained

The same logic applies to machine learning:

SINGLE TRAIN/TEST SPLIT (Dangerous):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

┌─────────────────────────────────┬─────────┐
│         TRAINING (80%)          │TEST(20%)│
└─────────────────────────────────┴─────────┘

One evaluation. One number. Could be lucky. Could be unlucky.
You'll never know.


K-FOLD CROSS-VALIDATION (K=5):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Fold 1: ┌──────┬────────────────────────────────┐
        │ TEST │         TRAINING               │
        └──────┴────────────────────────────────┘

Fold 2: ┌──────┬──────┬─────────────────────────┐
        │TRAIN │ TEST │       TRAINING          │
        └──────┴──────┴─────────────────────────┘

Fold 3: ┌────────────┬──────┬───────────────────┐
        │  TRAINING  │ TEST │     TRAINING      │
        └────────────┴──────┴───────────────────┘

Fold 4: ┌──────────────────┬──────┬─────────────┐
        │     TRAINING     │ TEST │  TRAINING   │
        └──────────────────┴──────┴─────────────┘

Fold 5: ┌────────────────────────────────┬──────┐
        │           TRAINING             │ TEST │
        └────────────────────────────────┴──────┘

Five evaluations. Five numbers. Average them.
MUCH more reliable!
Enter fullscreen mode Exit fullscreen mode

The Algorithm

K-FOLD CROSS-VALIDATION:

1. Shuffle the data (optional but recommended)
2. Split data into K equal-sized "folds"
3. For i = 1 to K:
   a. Use fold i as TEST set
   b. Use all other folds as TRAINING set
   c. Train model on training set
   d. Evaluate on test set
   e. Record the score
4. Return: mean(scores), std(scores)
Enter fullscreen mode Exit fullscreen mode

Key insight: Every data point gets to be in the test set exactly ONCE.


Code: Basic K-Fold

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Method 1: Manual K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

scores = []
for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)
    print(f"Fold {fold+1}: Accuracy = {score:.4f}")

print(f"\nMean: {np.mean(scores):.4f} ± {np.std(scores):.4f}")


# Method 2: One-liner with cross_val_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"\nCross-val scores: {scores.round(4)}")
print(f"Mean: {scores.mean():.4f} ± {scores.std():.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Fold 1: Accuracy = 0.9150
Fold 2: Accuracy = 0.9100
Fold 3: Accuracy = 0.9300
Fold 4: Accuracy = 0.9050
Fold 5: Accuracy = 0.9250

Mean: 0.9170 ± 0.0094

Cross-val scores: [0.915  0.91   0.93   0.905  0.925]
Mean: 0.9170 ± 0.0094
Enter fullscreen mode Exit fullscreen mode

How to Choose K?

This is THE question everyone asks. Here's the complete guide:

The Tradeoff

LOW K (e.g., K=2):
├── More training data per fold (good for learning)
├── High bias (test sets are large, less representative)
├── Low variance (fewer evaluations to vary)
├── Fast (only 2 training runs)
└── PROBLEM: Each test set is 50% of data — too different from full training

HIGH K (e.g., K=N, leave-one-out):
├── Maximum training data (N-1 samples)
├── Low bias (training on almost everything)
├── High variance (N evaluations, each on 1 sample!)
├── Very slow (N training runs)
└── PROBLEM: Each fold is almost identical — correlated estimates
Enter fullscreen mode Exit fullscreen mode

The Standard Choices

K=5: The Default

# Most common choice. Good balance.
cv = KFold(n_splits=5)

# Why 5?
# - Each fold uses 80% for training (enough data)
# - 5 evaluations (enough to estimate variance)
# - Not too slow
# - Empirically works well across many problems
Enter fullscreen mode Exit fullscreen mode

Use K=5 when:

  • You have a moderate dataset (1,000 - 100,000 samples)
  • You're not sure what to use
  • Training isn't too expensive

K=10: The Thorough Choice

# More evaluations, slightly more reliable
cv = KFold(n_splits=10)

# Why 10?
# - Each fold uses 90% for training
# - 10 evaluations for better variance estimate
# - Still reasonable computation time
Enter fullscreen mode Exit fullscreen mode

Use K=10 when:

  • You have a large dataset (>10,000 samples)
  • You want more confidence in your estimate
  • Computation time isn't a concern

K=3: The Quick Choice

# Fast, but less reliable
cv = KFold(n_splits=3)

# Why 3?
# - Only 3 training runs (fast!)
# - Each fold uses 67% for training
# - Good for quick experiments
Enter fullscreen mode Exit fullscreen mode

Use K=3 when:

  • You have limited compute resources
  • Training is expensive (deep learning)
  • You're doing rapid iteration
  • You have a huge dataset (K=3 still gives millions of test samples)

K=N (Leave-One-Out): The Extreme

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

# N training runs, each testing on 1 sample!
Enter fullscreen mode Exit fullscreen mode

Use LOO when:

  • You have very small data (<100 samples)
  • Each sample is precious
  • Training is fast (simple models)

DON'T use LOO when:

  • Large datasets (too slow)
  • High-variance models (overfitting per fold)

The Decision Framework

HOW MANY SAMPLES DO YOU HAVE?

< 100 samples:
    └─► Leave-One-Out (LOO) or K=10 with stratification

100 - 1,000 samples:
    └─► K=10 (more folds = more test samples per evaluation)

1,000 - 100,000 samples:
    └─► K=5 (standard choice)

> 100,000 samples:
    └─► K=3 or even K=2 (each fold has plenty of data)
    └─► Or: Single stratified split might be okay!


HOW EXPENSIVE IS TRAINING?

Cheap (linear models, small trees):
    └─► K=10 or even LOO

Moderate (random forests, gradient boosting):
    └─► K=5

Expensive (deep learning, large models):
    └─► K=3 or even single holdout with careful stratification
Enter fullscreen mode Exit fullscreen mode

Comparing Different K Values

import numpy as np
from sklearn.model_selection import cross_val_score, KFold, LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import time

# Create dataset
X, y = make_classification(n_samples=500, n_features=20, random_state=42)

model = LogisticRegression(max_iter=1000)

k_values = [2, 3, 5, 10, 20]

print("K-Fold Comparison")
print("="*60)
print(f"{'K':<5} {'Mean':<10} {'Std':<10} {'Time (s)':<10} {'Training %':<12}")
print("-"*60)

for k in k_values:
    start = time.time()
    cv = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    elapsed = time.time() - start

    train_pct = (k-1)/k * 100
    print(f"{k:<5} {scores.mean():<10.4f} {scores.std():<10.4f} {elapsed:<10.3f} {train_pct:<12.1f}%")

# Leave-One-Out for comparison
print("-"*60)
start = time.time()
loo = LeaveOneOut()
scores_loo = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
elapsed = time.time() - start
print(f"{'LOO':<5} {scores_loo.mean():<10.4f} {scores_loo.std():<10.4f} {elapsed:<10.3f} {99.8:<12.1f}%")
Enter fullscreen mode Exit fullscreen mode

Output:

K-Fold Comparison
============================================================
K     Mean       Std        Time (s)   Training %  
------------------------------------------------------------
2     0.8620     0.0080     0.012      50.0%
3     0.8693     0.0125     0.015      66.7%
5     0.8740     0.0167     0.023      80.0%
10    0.8760     0.0312     0.042      90.0%
20    0.8750     0.0445     0.081      95.0%
------------------------------------------------------------
LOO   0.8760     0.3297     2.341      99.8%
Enter fullscreen mode Exit fullscreen mode

Observations:

K Mean Accuracy Std Insight
2 0.862 0.008 Biased low (only 50% training)
5 0.874 0.017 Good balance
10 0.876 0.031 Slightly better, higher variance
LOO 0.876 0.330 Same mean, HUGE variance per fold

Stratified K-Fold (For Classification)

Critical for imbalanced data!

from sklearn.model_selection import StratifiedKFold, cross_val_score

# Imbalanced dataset: 5% positive
X, y = make_classification(
    n_samples=1000, weights=[0.95, 0.05], random_state=42
)

# Regular KFold - DANGEROUS!
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Stratified KFold - SAFE!
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Compare class distribution in each fold
print("Regular KFold - Positive % in each test fold:")
for fold, (train_idx, test_idx) in enumerate(kf.split(X, y)):
    pct = y[test_idx].mean() * 100
    print(f"  Fold {fold+1}: {pct:.1f}%")

print("\nStratified KFold - Positive % in each test fold:")
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    pct = y[test_idx].mean() * 100
    print(f"  Fold {fold+1}: {pct:.1f}%")
Enter fullscreen mode Exit fullscreen mode

Output:

Regular KFold - Positive % in each test fold:
  Fold 1: 3.5%
  Fold 2: 6.5%
  Fold 3: 4.0%
  Fold 4: 7.0%   ← 40% more than expected!
  Fold 5: 4.0%

Stratified KFold - Positive % in each test fold:
  Fold 1: 5.0%
  Fold 2: 5.0%
  Fold 3: 5.0%
  Fold 4: 5.0%
  Fold 5: 5.0%   ← All exactly 5%!
Enter fullscreen mode Exit fullscreen mode

Rule: Always use StratifiedKFold for classification!


Repeated K-Fold (More Robust)

Run K-fold multiple times with different random shuffles:

from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

# 5-fold CV, repeated 10 times = 50 total evaluations
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)

scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

print(f"50 evaluations (5-fold × 10 repeats)")
print(f"Mean: {scores.mean():.4f}")
print(f"Std:  {scores.std():.4f}")
print(f"95% CI: [{scores.mean() - 1.96*scores.std():.4f}, {scores.mean() + 1.96*scores.std():.4f}]")
Enter fullscreen mode Exit fullscreen mode

Output:

50 evaluations (5-fold × 10 repeats)
Mean: 0.8742
Std:  0.0189
95% CI: [0.8372, 0.9112]
Enter fullscreen mode Exit fullscreen mode

Use Repeated K-Fold when:

  • You need a confidence interval
  • You want to reduce variance
  • Computation allows it

Time Series Cross-Validation

Regular K-fold BREAKS time series! (Future data leaks into training)

from sklearn.model_selection import TimeSeriesSplit

# DON'T DO THIS for time series:
# kf = KFold(n_splits=5, shuffle=True)  # ❌ Shuffling breaks time!

# DO THIS:
tscv = TimeSeriesSplit(n_splits=5)

# Visualization of splits
X = np.arange(100).reshape(-1, 1)  # 100 time points
y = np.random.randn(100)

print("Time Series Cross-Validation Splits:")
print("="*60)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {fold+1}:")
    print(f"  Train: indices {train_idx[0]:3d} to {train_idx[-1]:3d} ({len(train_idx)} samples)")
    print(f"  Test:  indices {test_idx[0]:3d} to {test_idx[-1]:3d} ({len(test_idx)} samples)")
Enter fullscreen mode Exit fullscreen mode

Output:

Time Series Cross-Validation Splits:
============================================================
Fold 1:
  Train: indices   0 to  49 (50 samples)
  Test:  indices  50 to  61 (12 samples)
Fold 2:
  Train: indices   0 to  61 (62 samples)
  Test:  indices  62 to  74 (13 samples)
Fold 3:
  Train: indices   0 to  74 (75 samples)
  Test:  indices  75 to  86 (12 samples)
Fold 4:
  Train: indices   0 to  86 (87 samples)
  Test:  indices  87 to  99 (13 samples)
Fold 5:
  Train: indices   0 to  99 (100 samples)
  Test:  indices 100 to 112 (13 samples)
Enter fullscreen mode Exit fullscreen mode

Key difference: Training ALWAYS uses past data, test ALWAYS uses future data.

Time Series CV Visualization:

Time →→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→

Fold 1: [====TRAIN====][TEST]
Fold 2: [======TRAIN======][TEST]
Fold 3: [========TRAIN========][TEST]
Fold 4: [==========TRAIN==========][TEST]
Fold 5: [============TRAIN============][TEST]

Training grows, but NEVER includes future test data!
Enter fullscreen mode Exit fullscreen mode

Nested Cross-Validation (For Hyperparameter Tuning)

Problem: If you tune hyperparameters using CV, then evaluate using the same CV, you're overfitting to your CV splits!

Solution: Nested CV — outer loop for evaluation, inner loop for tuning.

from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# Outer CV: for final performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Inner CV: for hyperparameter tuning
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Model with hyperparameters to tune
model = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None]
}

# GridSearchCV handles the inner loop
grid_search = GridSearchCV(
    model, param_grid, cv=inner_cv, scoring='accuracy', n_jobs=-1
)

# cross_val_score handles the outer loop
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy')

print("Nested Cross-Validation Results:")
print(f"Scores: {nested_scores.round(4)}")
print(f"Mean:   {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Nested Cross-Validation Results:
Scores: [0.9250 0.9150 0.9400 0.9100 0.9350]
Mean:   0.9250 ± 0.0114
Enter fullscreen mode Exit fullscreen mode
NESTED CV STRUCTURE:

OUTER LOOP (5 folds) - Estimates true performance
├── Fold 1: [=====Train=====][Test]
│   └── INNER LOOP (3 folds) - Tunes hyperparameters
│       ├── Inner Fold 1: tune
│       ├── Inner Fold 2: tune
│       └── Inner Fold 3: tune
│       → Best params found, evaluate on outer test
│
├── Fold 2: [Test][=====Train=====]
│   └── INNER LOOP (3 folds) - Tunes hyperparameters
│       → Best params found, evaluate on outer test
│
... (repeat for all outer folds)

Final: Average of 5 outer test scores = unbiased estimate!
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Using KFold Instead of StratifiedKFold for Classification

# ❌ WRONG for classification
from sklearn.model_selection import KFold
cv = KFold(n_splits=5)

# ✅ RIGHT for classification
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5)

# Or just use cv=5 in cross_val_score — it auto-detects!
scores = cross_val_score(classifier, X, y, cv=5)  # Uses StratifiedKFold
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Shuffling Time Series Data

# ❌ WRONG for time series
cv = KFold(n_splits=5, shuffle=True)  # Shuffling breaks temporal order!

# ✅ RIGHT for time series
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=5)
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Data Leakage in Preprocessing

# ❌ WRONG: Fit scaler on ALL data before CV
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)  # Leaks test info!
scores = cross_val_score(model, X_scaled, y, cv=5)

# ✅ RIGHT: Use Pipeline to fit scaler inside CV
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])
scores = cross_val_score(pipe, X, y, cv=5)  # Scaler fits only on train folds!
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Tuning AND Evaluating on Same CV

# ❌ WRONG: Optimistic estimate!
grid_search = GridSearchCV(model, params, cv=5)
grid_search.fit(X, y)
print(f"Best CV score: {grid_search.best_score_}")  # Overfitted to these folds!

# ✅ RIGHT: Use nested CV for unbiased estimate
nested_scores = cross_val_score(grid_search, X, y, cv=5)
print(f"True expected score: {nested_scores.mean()}")
Enter fullscreen mode Exit fullscreen mode

Mistake 5: Ignoring Variance

# ❌ WRONG: Only reporting mean
scores = cross_val_score(model, X, y, cv=5)
print(f"Accuracy: {scores.mean():.4f}")  # What about variance?

# ✅ RIGHT: Report mean AND standard deviation
print(f"Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
print(f"Scores: {scores.round(4)}")  # Show all scores!

# If std is high, your estimate is unreliable!
Enter fullscreen mode Exit fullscreen mode

Complete Example: The Right Way

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import (
    StratifiedKFold, cross_val_score, cross_validate,
    GridSearchCV, train_test_split
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score

# Create imbalanced dataset
X, y = make_classification(
    n_samples=2000, n_features=20, weights=[0.9, 0.1],
    random_state=42
)

print("="*60)
print("COMPLETE K-FOLD CROSS-VALIDATION WORKFLOW")
print("="*60)

# Step 1: Hold out a TRUE test set (never touched during CV!)
X_dev, X_test, y_dev, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"\n1. Data Split:")
print(f"   Development set: {len(X_dev)} samples")
print(f"   Final test set:  {len(X_test)} samples (untouched until end!)")

# Step 2: Define CV strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print(f"\n2. CV Strategy: 5-fold Stratified")

# Step 3: Create pipeline (preprocessing + model)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

# Step 4: Cross-validate with multiple metrics
scoring = {
    'accuracy': 'accuracy',
    'f1': 'f1',
    'precision': 'precision',
    'recall': 'recall'
}

cv_results = cross_validate(
    pipeline, X_dev, y_dev, cv=cv, scoring=scoring, return_train_score=True
)

print(f"\n3. Cross-Validation Results:")
print("-"*60)
print(f"{'Metric':<15} {'Train':<20} {'Validation':<20}")
print("-"*60)
for metric in ['accuracy', 'f1', 'precision', 'recall']:
    train_scores = cv_results[f'train_{metric}']
    val_scores = cv_results[f'test_{metric}']
    print(f"{metric:<15} {train_scores.mean():.3f} ± {train_scores.std():.3f}    {val_scores.mean():.3f} ± {val_scores.std():.3f}")

# Step 5: Check for overfitting
print(f"\n4. Overfitting Check:")
train_acc = cv_results['train_accuracy'].mean()
val_acc = cv_results['test_accuracy'].mean()
gap = train_acc - val_acc
print(f"   Train-Val Gap: {gap:.3f}")
if gap > 0.1:
    print("   ⚠️  Large gap! Consider regularization.")
else:
    print("   ✓ Gap is acceptable.")

# Step 6: Final evaluation on held-out test set
print(f"\n5. Final Test Evaluation:")
pipeline.fit(X_dev, y_dev)
test_accuracy = pipeline.score(X_test, y_test)
test_f1 = f1_score(y_test, pipeline.predict(X_test))
print(f"   Test Accuracy: {test_accuracy:.4f}")
print(f"   Test F1:       {test_f1:.4f}")

# Compare to CV estimate
print(f"\n6. Estimate Quality:")
print(f"   CV Accuracy Estimate:   {val_acc:.4f}")
print(f"   Actual Test Accuracy:   {test_accuracy:.4f}")
print(f"   Difference:             {abs(val_acc - test_accuracy):.4f}")
if abs(val_acc - test_accuracy) < 0.02:
    print("   ✓ CV estimate was reliable!")
else:
    print("   ⚠️  Some discrepancy — check data distribution")
Enter fullscreen mode Exit fullscreen mode

Output:

============================================================
COMPLETE K-FOLD CROSS-VALIDATION WORKFLOW
============================================================

1. Data Split:
   Development set: 1600 samples
   Final test set:  400 samples (untouched until end!)

2. CV Strategy: 5-fold Stratified

3. Cross-Validation Results:
------------------------------------------------------------
Metric          Train                Validation          
------------------------------------------------------------
accuracy        1.000 ± 0.000        0.944 ± 0.011
f1              1.000 ± 0.000        0.763 ± 0.055
precision       1.000 ± 0.000        0.848 ± 0.062
recall          1.000 ± 0.000        0.700 ± 0.071

4. Overfitting Check:
   Train-Val Gap: 0.056
   ✓ Gap is acceptable.

5. Final Test Evaluation:
   Test Accuracy: 0.9400
   Test F1:       0.7407

6. Estimate Quality:
   CV Accuracy Estimate:   0.9438
   Actual Test Accuracy:   0.9400
   Difference:             0.0038
   ✓ CV estimate was reliable!
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Choosing K

Dataset Size Recommended K Reason
< 100 10 or LOO Need max training data
100 - 1,000 10 Balance bias/variance
1,000 - 100,000 5 Standard choice
> 100,000 3 or holdout Plenty of data per fold

CV Variants

Variant Use Case
KFold Regression
StratifiedKFold Classification (always!)
RepeatedStratifiedKFold Need confidence intervals
TimeSeriesSplit Time series data
LeaveOneOut Very small datasets
GroupKFold Groups must stay together

The One-Liner

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"{scores.mean():.3f} ± {scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Single splits are unreliable — You might get lucky or unlucky

  2. K-fold tests on K different splits — Average is more reliable

  3. K=5 is the default — Good balance for most cases

  4. Always use StratifiedKFold for classification — Preserves class ratios

  5. Report mean AND standard deviation0.92 ± 0.03 tells the full story

  6. Use pipelines to prevent leakage — Preprocessing must happen inside CV

  7. Time series needs TimeSeriesSplit — No shuffling, no future leakage

  8. Nested CV for tuning + evaluation — Prevents overfitting to CV folds


The One-Sentence Summary

Marcus tested his comedy act at one club and thought he'd crush the tour, but he'd just gotten lucky with one tech-savvy audience — k-fold cross-validation is like testing at K different clubs and averaging the laughs, so you know your TRUE expected performance before you bet your career on it.


What's Next?

Now that you understand K-fold cross-validation, you're ready for:

  • Hyperparameter Tuning — Grid search, random search, Bayesian optimization
  • Learning Curves — Diagnosing bias vs variance
  • Model Selection — Statistical tests between models
  • Ensemble Methods — Combining multiple models

Follow me for the next article in this series!


Let's Connect!

If K-fold finally makes sense, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What K do you typically use? I default to 5, bump to 10 for important decisions, drop to 3 for deep learning. What about you?


The difference between "my model got 94% accuracy" and "my model gets 89-94% accuracy depending on the split"? K-fold cross-validation. One gives you false confidence. The other gives you the truth.


Share this with someone who keeps re-running their code until they get a good accuracy. They're fooling themselves — and K-fold will reveal it.

Happy validating! 🎭

Top comments (0)