The One-Line Summary: Training data is for learning, validation data is for tuning, and test data is for the final honest evaluation. Use training to grade your model and you're letting students grade their own exams. Use validation too many times and it becomes another training set.
The MasterChef Disaster
Chef Marcus was competing in the biggest cooking competition of his life.
For three months, he practiced. Every night, he'd cook his signature dish, taste it, adjust the seasoning, taste again, adjust the temperature, taste again, tweak the plating.
By competition day, he'd tasted that dish 347 times.
His internal rating: Perfect. 10/10. Flawless.
The judges took one bite.
"It's... overwhelming. The salt is too aggressive. The sauce is trying too hard."
Marcus was stunned. How could his perfect dish fail?
Here's what happened:
Marcus had been optimizing for his own taste buds.
After 347 tastings, he'd unconsciously adjusted everything to what HE liked. More salt (he loved salt). Bolder sauce (he craved intensity). He wasn't making a great dish — he was making a Marcus-flavored dish.
The judges had never tasted it before. They experienced it fresh. And fresh, it was overwhelming.
This is what happens when you evaluate your model on training data.
The model has "tasted" that data hundreds of times. It has adjusted itself to perfectly match those exact examples. It's not learning general patterns — it's memorizing Marcus's preferences.
Then it meets new data (the judges). And it fails.
The Three Audiences Every Chef Needs
To win the competition, Marcus needed THREE different audiences:
Audience 1: Himself (Training Set)
Purpose: Practice and learn
Frequency: Unlimited tastings
Feedback: Immediate, continuous
Can adjust based on feedback? YES — that's the whole point!
"Too bland → add salt"
"Sauce too thin → reduce longer"
"Meat overcooked → lower temperature"
This is where learning happens. Marcus experiments, fails, adjusts, fails again, adjusts more. The dish evolves.
But he can't trust his own rating. He's tasted it too many times. He's biased.
Audience 2: Trusted Friends (Validation Set)
Purpose: Get outside feedback to tune the dish
Frequency: A few times during preparation
Feedback: Independent opinions
Can adjust based on feedback? YES — but carefully!
Friend: "Needs more acidity"
Marcus: *adds lemon*
Friend: "Better! But maybe slightly less salt now"
Marcus: *reduces salt*
Friends haven't tasted it 347 times. They're fresher. Their feedback helps Marcus escape his own biases.
But there's a danger: If Marcus keeps adjusting based on these specific friends' preferences, eventually he's just optimizing for THEM. They become another version of "himself."
Audience 3: The Competition Judges (Test Set)
Purpose: Final, honest evaluation
Frequency: ONCE. That's it.
Feedback: The real score
Can adjust based on feedback? NO — it's too late!
Judge: "8.5/10"
Marcus: "Can I adjust and try again?"
Judge: "No. Next contestant."
The judges are completely fresh. They've never seen the dish. They have no history with Marcus. This is the TRUE test of whether the dish is good.
You only get ONE shot. If Marcus kept re-entering the competition with adjusted dishes until he won, he'd just be memorizing what those specific judges like — not making a universally great dish.
Translating to Machine Learning
| Cooking Competition | Machine Learning |
|---|---|
| Marcus tasting his own dish | Evaluating on training data |
| Friends giving feedback | Evaluating on validation data |
| Competition judges | Evaluating on test data |
| Marcus adjusting seasoning | Model learning weights |
| Adjusting based on friends | Tuning hyperparameters |
| Final score from judges | Reported model performance |
Why You Need All Three
Let me prove each one is necessary.
Why Training Data Isn't Enough for Evaluation
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Train a model
model = DecisionTreeClassifier(max_depth=None) # No limits = will memorize!
model.fit(X, y)
# Evaluate on TRAINING data (what Marcus did)
train_accuracy = model.score(X, y)
print(f"Training accuracy: {train_accuracy:.1%}")
# But wait... let's check on fresh data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f"\nTraining accuracy: {train_acc:.1%}")
print(f"Test accuracy: {test_acc:.1%}")
print(f"\nThe model LIED by {train_acc - test_acc:.1%}!")
Output:
Training accuracy: 100.0%
Training accuracy: 100.0%
Test accuracy: 88.5%
The model LIED by 11.5%!
100% on training, but only 88.5% on new data!
The model memorized the training data. It's Marcus thinking his dish is "perfect" because he's tasted it 347 times.
Why Two Sets Aren't Enough (Train + Test)
"Okay," you say, "I'll just use a test set!"
But what happens when you tune your model?
# Scenario: Tuning hyperparameters using test set
from sklearn.ensemble import RandomForestClassifier
# Split into train and test only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Try different hyperparameters, check on test set each time
best_score = 0
best_params = {}
for n_estimators in [10, 50, 100, 200]:
for max_depth in [3, 5, 10, 20, None]:
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
model.fit(X_train, y_train)
# ❌ WRONG: Using test set for tuning decisions!
test_score = model.score(X_test, y_test)
if test_score > best_score:
best_score = test_score
best_params = {'n_estimators': n_estimators, 'max_depth': max_depth}
print(f"Best test score: {best_score:.1%}")
print(f"Best params: {best_params}")
# Problem: This "test score" is now OPTIMISTIC!
# We've tuned specifically to do well on THIS test set.
# On truly new data, we'll likely do worse.
What happened?
By checking the test score 20 times and picking the best, we've leaked information from the test set into our model selection process.
The test set is no longer "fresh." It's now just another form of training data — we've optimized for it.
This is like Marcus calling the judges before the competition: "Hey, do you prefer more salt or less?" It's cheating!
The Three-Way Split: The Honest Approach
import numpy as np
from sklearn.model_selection import train_test_split
# Split into THREE sets
# First split: separate test set (final evaluation)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Second split: separate validation from training
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
# 0.25 of 0.8 = 0.2, so we get 60/20/20 split
print(f"Training set: {len(X_train)} samples ({len(X_train)/len(X):.0%})")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X):.0%})")
print(f"Test set: {len(X_test)} samples ({len(X_test)/len(X):.0%})")
Output:
Training set: 600 samples (60%)
Validation set: 200 samples (20%)
Test set: 200 samples (20%)
Now the workflow is honest:
from sklearn.ensemble import RandomForestClassifier
# Step 1: Tune hyperparameters using VALIDATION set
best_val_score = 0
best_params = {}
for n_estimators in [10, 50, 100, 200]:
for max_depth in [3, 5, 10, 20, None]:
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
model.fit(X_train, y_train)
# ✅ RIGHT: Use VALIDATION set for tuning!
val_score = model.score(X_val, y_val)
if val_score > best_val_score:
best_val_score = val_score
best_params = {'n_estimators': n_estimators, 'max_depth': max_depth}
print(f"Best validation score: {best_val_score:.1%}")
print(f"Best params: {best_params}")
# Step 2: Train final model with best params
final_model = RandomForestClassifier(**best_params, random_state=42)
final_model.fit(X_train, y_train)
# Step 3: Evaluate ONCE on test set (the judges!)
test_score = final_model.score(X_test, y_test)
print(f"\n🎯 Final test score: {test_score:.1%}")
print("This is the honest estimate of real-world performance!")
Output:
Best validation score: 91.5%
Best params: {'n_estimators': 100, 'max_depth': 10}
🎯 Final test score: 90.0%
This is the honest estimate of real-world performance!
Notice: Test score (90.0%) is close to validation score (91.5%). That's a good sign — our validation set gave us an honest estimate!
The Visual Guide
YOUR DATA
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
█████████████████████████████████████████████████████████████
AFTER SPLITTING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
████████████████████████████████████ ████████████ ████████████
TRAINING (60%) VALID (20%) TEST (20%)
│ │ │
│ │ │
▼ ▼ ▼
Learn patterns Tune & select Final grade
Adjust weights hyperparameters ONCE. NO PEEKING.
Many iterations Multiple times Report this score!
The Rules
Rule 1: Test Set Is Sacred
# ❌ NEVER do this
for epoch in range(100):
model.fit(X_train, y_train)
test_loss = model.evaluate(X_test, y_test) # Peeking!
if test_loss < best:
save_model() # Selecting based on test!
# ✅ Do this instead
for epoch in range(100):
model.fit(X_train, y_train)
val_loss = model.evaluate(X_val, y_val) # Validation only!
if val_loss < best:
save_model()
# Only at the very end:
final_score = model.evaluate(X_test, y_test)
Rule 2: Validation Can Be Reused (Carefully)
Unlike test, you CAN look at validation scores multiple times. That's its purpose!
But be aware: excessive tuning on validation can cause validation set overfitting.
# This is okay:
for params in parameter_grid:
model = train(params)
val_score = evaluate(model, X_val) # ✓
# But if you do this 10,000 times, you might overfit to validation too!
# Solution: Use cross-validation for more robust estimates
Rule 3: Keep Proportions Reasonable
Common splits:
├── 60% train / 20% validation / 20% test (balanced)
├── 70% train / 15% validation / 15% test (more training)
├── 80% train / 10% validation / 10% test (lots of data)
└── 50% train / 25% validation / 25% test (small dataset, need reliable estimates)
Rule of thumb:
├── Test set: At least 1,000 samples if possible
├── Validation set: At least 1,000 samples if possible
└── Smaller datasets: Use cross-validation instead!
Rule 4: Stratify for Classification
Preserve class proportions in all splits!
from sklearn.model_selection import train_test_split
# ❌ WRONG: Random split might give unbalanced classes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# ✅ RIGHT: Stratified split preserves class ratios
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
# Verify
print("Full data class distribution:")
print(pd.Series(y).value_counts(normalize=True))
print("\nTest set class distribution:")
print(pd.Series(y_test).value_counts(normalize=True))
# They should match!
Rule 5: Time Series Needs Special Handling
For time-based data, you can't randomly split — that's data leakage!
# ❌ WRONG: Random split on time series
X_train, X_test = train_test_split(stock_data)
# Training might include 2024 data, test might include 2020!
# ✅ RIGHT: Chronological split
train = data[data['date'] < '2023-01-01']
val = data[(data['date'] >= '2023-01-01') & (data['date'] < '2024-01-01')]
test = data[data['date'] >= '2024-01-01']
# Timeline:
# [---- TRAIN ----][-- VAL --][-- TEST --]
# 2015 2023 2024 2025
Cross-Validation: When Three Splits Aren't Enough
With small datasets, a single validation split might be unreliable. Enter k-fold cross-validation:
from sklearn.model_selection import cross_val_score
# Instead of one validation split, use 5 different ones!
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.1%} ± {scores.std():.1%}")
How it works:
5-Fold Cross-Validation:
Fold 1: [VAL][TRAIN][TRAIN][TRAIN][TRAIN] → Score 1
Fold 2: [TRAIN][VAL][TRAIN][TRAIN][TRAIN] → Score 2
Fold 3: [TRAIN][TRAIN][VAL][TRAIN][TRAIN] → Score 3
Fold 4: [TRAIN][TRAIN][TRAIN][VAL][TRAIN] → Score 4
Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][VAL] → Score 5
Final estimate = Average of all 5 scores
Still keep a separate test set! Cross-validation replaces the single validation split, not the test set.
# Complete workflow with CV
from sklearn.model_selection import cross_val_score, GridSearchCV
# 1. Hold out test set
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
# 2. Use cross-validation for hyperparameter tuning
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, None]}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy'
)
grid_search.fit(X_trainval, y_trainval)
print(f"Best CV score: {grid_search.best_score_:.1%}")
print(f"Best params: {grid_search.best_params_}")
# 3. Final evaluation on test set (ONCE!)
test_score = grid_search.score(X_test, y_test)
print(f"\n🎯 Final test score: {test_score:.1%}")
The Complete Picture
YOUR WORKFLOW
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌─────────────┐
│ ALL DATA │
└──────┬──────┘
│
┌────────────────┴────────────────┐
│ │
▼ ▼
┌───────────────┐ ┌──────────────┐
│ TRAIN + VAL │ │ TEST │
│ (80%) │ │ (20%) │
└───────┬───────┘ └──────────────┘
│ │
│ │
┌─────────────┼─────────────┐ │
│ │ │ │
▼ ▼ ▼ │
┌───────┐ ┌───────┐ ┌───────┐ │
│Fold 1 │ │Fold 2 │ │Fold 3 │ ... │
└───┬───┘ └───┬───┘ └───┬───┘ │
│ │ │ │
└────────────┼────────────┘ │
│ │
▼ │
┌─────────────────┐ │
│ Cross-Validation│ │
│ Scores │ │
│ (for tuning) │ │
└────────┬────────┘ │
│ │
│ Select best model │
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ BEST MODEL │ ──────────────▶│ FINAL EVALUATION │
│ (from tuning) │ Evaluate │ (Report this!) │
└─────────────────┘ ONCE └──────────────────┘
Common Mistakes
Mistake 1: Evaluating on Training Data
# ❌ WRONG: This tells you NOTHING useful
model.fit(X, y)
print(f"Accuracy: {model.score(X, y)}") # Meaningless!
# ✅ RIGHT: Evaluate on held-out data
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test)}")
Mistake 2: Tuning on Test Data
# ❌ WRONG: Test set is now contaminated
for params in grid:
model = Model(params)
model.fit(X_train, y_train)
if model.score(X_test, y_test) > best: # NO!
save(model)
# ✅ RIGHT: Use validation for tuning
for params in grid:
model = Model(params)
model.fit(X_train, y_train)
if model.score(X_val, y_val) > best: # YES!
save(model)
# Then evaluate on test ONCE at the end
Mistake 3: Peeking at Test Multiple Times
# ❌ WRONG: "Let me just check if this helps..."
model_v1 = train(config_1)
print(model_v1.score(X_test, y_test)) # Peek 1
model_v2 = train(config_2)
print(model_v2.score(X_test, y_test)) # Peek 2
model_v3 = train(config_3)
print(model_v3.score(X_test, y_test)) # Peek 3
# You've now tuned to the test set!
# ✅ RIGHT: Use validation for all comparisons
# Test only at the very end, ONCE
Mistake 4: Preprocessing Before Splitting
# ❌ WRONG: Data leakage!
X_scaled = scaler.fit_transform(X) # Learns from ALL data
X_train, X_test = split(X_scaled)
# ✅ RIGHT: Split first, then preprocess
X_train, X_test = split(X)
scaler.fit(X_train) # Learn from train only!
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Mistake 5: Random Split on Time Series
# ❌ WRONG: Future predicting past!
X_train, X_test = train_test_split(time_series_data, shuffle=True)
# ✅ RIGHT: Chronological split
X_train = data[data['date'] < cutoff_date]
X_test = data[data['date'] >= cutoff_date]
Quick Reference
| Set | Purpose | How Often to Use | Can Adjust Based On? |
|---|---|---|---|
| Training | Learn patterns | Every epoch/iteration | Yes — that's learning! |
| Validation | Tune hyperparameters | Multiple times | Yes — that's tuning! |
| Test | Final evaluation | ONCE | NO — too late! |
Key Takeaways
Training data is for learning — The model adjusts to fit these patterns
Validation data is for tuning — Use it to select hyperparameters and architectures
Test data is for final evaluation — Touch it ONCE, report that score
Training accuracy is meaningless — It only shows memorization ability
Tuning on test = cheating — You're optimizing for that specific data
Stratify for classification — Keep class proportions consistent
Time series needs chronological splits — No shuffling!
Use cross-validation for small datasets — More reliable than a single split
The One-Sentence Summary
Training data is Marcus tasting his own dish 347 times, validation data is his friends giving feedback, and test data is the competition judges he's never met — only one of these tells you if the dish is actually good.
What's Next?
Now that you understand the three-way split, you're ready for:
- Cross-Validation Deep Dive — K-fold, stratified, time series CV
- Overfitting vs Underfitting — Diagnosing model problems
- Learning Curves — Visualizing the train-validation gap
- Evaluation Metrics — Beyond accuracy
Follow me for the next article in this series!
Let's Connect!
If this finally explained why you need three datasets, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
Have you been burned by evaluating on training data? We've all been Marcus at some point. Share your stories!
The difference between a model that reports 99% accuracy and one that actually works in production? Understanding that the chef can't grade their own dish. Training is practice. Validation is dress rehearsal. Test is opening night — and you only get one shot.
Share this with someone who keeps checking test accuracy during training. They're tasting their own food 347 times and wondering why the judges disagree.
Happy splitting! 🍳
Top comments (0)