DEV Community

Cover image for Train/Validation/Test Split: Why Your Model Needs Practice, Dress Rehearsals, AND Opening Night
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Train/Validation/Test Split: Why Your Model Needs Practice, Dress Rehearsals, AND Opening Night

The One-Line Summary: Training data is for learning, validation data is for tuning, and test data is for the final honest evaluation. Use training to grade your model and you're letting students grade their own exams. Use validation too many times and it becomes another training set.


The MasterChef Disaster

Chef Marcus was competing in the biggest cooking competition of his life.

For three months, he practiced. Every night, he'd cook his signature dish, taste it, adjust the seasoning, taste again, adjust the temperature, taste again, tweak the plating.

By competition day, he'd tasted that dish 347 times.

His internal rating: Perfect. 10/10. Flawless.


The judges took one bite.

"It's... overwhelming. The salt is too aggressive. The sauce is trying too hard."

Marcus was stunned. How could his perfect dish fail?


Here's what happened:

Marcus had been optimizing for his own taste buds.

After 347 tastings, he'd unconsciously adjusted everything to what HE liked. More salt (he loved salt). Bolder sauce (he craved intensity). He wasn't making a great dish — he was making a Marcus-flavored dish.

The judges had never tasted it before. They experienced it fresh. And fresh, it was overwhelming.


This is what happens when you evaluate your model on training data.

The model has "tasted" that data hundreds of times. It has adjusted itself to perfectly match those exact examples. It's not learning general patterns — it's memorizing Marcus's preferences.

Then it meets new data (the judges). And it fails.


The Three Audiences Every Chef Needs

To win the competition, Marcus needed THREE different audiences:

Audience 1: Himself (Training Set)

Purpose: Practice and learn
Frequency: Unlimited tastings
Feedback: Immediate, continuous
Can adjust based on feedback? YES — that's the whole point!

"Too bland → add salt"
"Sauce too thin → reduce longer"
"Meat overcooked → lower temperature"
Enter fullscreen mode Exit fullscreen mode

This is where learning happens. Marcus experiments, fails, adjusts, fails again, adjusts more. The dish evolves.

But he can't trust his own rating. He's tasted it too many times. He's biased.


Audience 2: Trusted Friends (Validation Set)

Purpose: Get outside feedback to tune the dish
Frequency: A few times during preparation
Feedback: Independent opinions
Can adjust based on feedback? YES — but carefully!

Friend: "Needs more acidity"
Marcus: *adds lemon*
Friend: "Better! But maybe slightly less salt now"
Marcus: *reduces salt*
Enter fullscreen mode Exit fullscreen mode

Friends haven't tasted it 347 times. They're fresher. Their feedback helps Marcus escape his own biases.

But there's a danger: If Marcus keeps adjusting based on these specific friends' preferences, eventually he's just optimizing for THEM. They become another version of "himself."


Audience 3: The Competition Judges (Test Set)

Purpose: Final, honest evaluation
Frequency: ONCE. That's it.
Feedback: The real score
Can adjust based on feedback? NO — it's too late!

Judge: "8.5/10"
Marcus: "Can I adjust and try again?"
Judge: "No. Next contestant."
Enter fullscreen mode Exit fullscreen mode

The judges are completely fresh. They've never seen the dish. They have no history with Marcus. This is the TRUE test of whether the dish is good.

You only get ONE shot. If Marcus kept re-entering the competition with adjusted dishes until he won, he'd just be memorizing what those specific judges like — not making a universally great dish.


Translating to Machine Learning

Cooking Competition Machine Learning
Marcus tasting his own dish Evaluating on training data
Friends giving feedback Evaluating on validation data
Competition judges Evaluating on test data
Marcus adjusting seasoning Model learning weights
Adjusting based on friends Tuning hyperparameters
Final score from judges Reported model performance

Why You Need All Three

Let me prove each one is necessary.

Why Training Data Isn't Enough for Evaluation

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Train a model
model = DecisionTreeClassifier(max_depth=None)  # No limits = will memorize!
model.fit(X, y)

# Evaluate on TRAINING data (what Marcus did)
train_accuracy = model.score(X, y)
print(f"Training accuracy: {train_accuracy:.1%}")

# But wait... let's check on fresh data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print(f"\nTraining accuracy: {train_acc:.1%}")
print(f"Test accuracy: {test_acc:.1%}")
print(f"\nThe model LIED by {train_acc - test_acc:.1%}!")
Enter fullscreen mode Exit fullscreen mode

Output:

Training accuracy: 100.0%

Training accuracy: 100.0%
Test accuracy: 88.5%

The model LIED by 11.5%!
Enter fullscreen mode Exit fullscreen mode

100% on training, but only 88.5% on new data!

The model memorized the training data. It's Marcus thinking his dish is "perfect" because he's tasted it 347 times.


Why Two Sets Aren't Enough (Train + Test)

"Okay," you say, "I'll just use a test set!"

But what happens when you tune your model?

# Scenario: Tuning hyperparameters using test set

from sklearn.ensemble import RandomForestClassifier

# Split into train and test only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Try different hyperparameters, check on test set each time
best_score = 0
best_params = {}

for n_estimators in [10, 50, 100, 200]:
    for max_depth in [3, 5, 10, 20, None]:
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
        model.fit(X_train, y_train)

        # ❌ WRONG: Using test set for tuning decisions!
        test_score = model.score(X_test, y_test)

        if test_score > best_score:
            best_score = test_score
            best_params = {'n_estimators': n_estimators, 'max_depth': max_depth}

print(f"Best test score: {best_score:.1%}")
print(f"Best params: {best_params}")

# Problem: This "test score" is now OPTIMISTIC!
# We've tuned specifically to do well on THIS test set.
# On truly new data, we'll likely do worse.
Enter fullscreen mode Exit fullscreen mode

What happened?

By checking the test score 20 times and picking the best, we've leaked information from the test set into our model selection process.

The test set is no longer "fresh." It's now just another form of training data — we've optimized for it.

This is like Marcus calling the judges before the competition: "Hey, do you prefer more salt or less?" It's cheating!


The Three-Way Split: The Honest Approach

import numpy as np
from sklearn.model_selection import train_test_split

# Split into THREE sets
# First split: separate test set (final evaluation)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Second split: separate validation from training
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
# 0.25 of 0.8 = 0.2, so we get 60/20/20 split

print(f"Training set:   {len(X_train)} samples ({len(X_train)/len(X):.0%})")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X):.0%})")
print(f"Test set:       {len(X_test)} samples ({len(X_test)/len(X):.0%})")
Enter fullscreen mode Exit fullscreen mode

Output:

Training set:   600 samples (60%)
Validation set: 200 samples (20%)
Test set:       200 samples (20%)
Enter fullscreen mode Exit fullscreen mode

Now the workflow is honest:

from sklearn.ensemble import RandomForestClassifier

# Step 1: Tune hyperparameters using VALIDATION set
best_val_score = 0
best_params = {}

for n_estimators in [10, 50, 100, 200]:
    for max_depth in [3, 5, 10, 20, None]:
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
        model.fit(X_train, y_train)

        # ✅ RIGHT: Use VALIDATION set for tuning!
        val_score = model.score(X_val, y_val)

        if val_score > best_val_score:
            best_val_score = val_score
            best_params = {'n_estimators': n_estimators, 'max_depth': max_depth}

print(f"Best validation score: {best_val_score:.1%}")
print(f"Best params: {best_params}")

# Step 2: Train final model with best params
final_model = RandomForestClassifier(**best_params, random_state=42)
final_model.fit(X_train, y_train)

# Step 3: Evaluate ONCE on test set (the judges!)
test_score = final_model.score(X_test, y_test)
print(f"\n🎯 Final test score: {test_score:.1%}")
print("This is the honest estimate of real-world performance!")
Enter fullscreen mode Exit fullscreen mode

Output:

Best validation score: 91.5%
Best params: {'n_estimators': 100, 'max_depth': 10}

🎯 Final test score: 90.0%
This is the honest estimate of real-world performance!
Enter fullscreen mode Exit fullscreen mode

Notice: Test score (90.0%) is close to validation score (91.5%). That's a good sign — our validation set gave us an honest estimate!


The Visual Guide

YOUR DATA
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
█████████████████████████████████████████████████████████████


AFTER SPLITTING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

████████████████████████████████████ ████████████ ████████████
        TRAINING (60%)               VALID (20%)   TEST (20%)
              │                          │              │
              │                          │              │
              ▼                          ▼              ▼
         Learn patterns           Tune & select      Final grade
         Adjust weights          hyperparameters    ONCE. NO PEEKING.
         Many iterations         Multiple times      Report this score!
Enter fullscreen mode Exit fullscreen mode

The Rules

Rule 1: Test Set Is Sacred

# ❌ NEVER do this
for epoch in range(100):
    model.fit(X_train, y_train)
    test_loss = model.evaluate(X_test, y_test)  # Peeking!
    if test_loss < best:
        save_model()  # Selecting based on test!

# ✅ Do this instead
for epoch in range(100):
    model.fit(X_train, y_train)
    val_loss = model.evaluate(X_val, y_val)  # Validation only!
    if val_loss < best:
        save_model()

# Only at the very end:
final_score = model.evaluate(X_test, y_test)
Enter fullscreen mode Exit fullscreen mode

Rule 2: Validation Can Be Reused (Carefully)

Unlike test, you CAN look at validation scores multiple times. That's its purpose!

But be aware: excessive tuning on validation can cause validation set overfitting.

# This is okay:
for params in parameter_grid:
    model = train(params)
    val_score = evaluate(model, X_val)  # ✓

# But if you do this 10,000 times, you might overfit to validation too!
# Solution: Use cross-validation for more robust estimates
Enter fullscreen mode Exit fullscreen mode

Rule 3: Keep Proportions Reasonable

Common splits:
├── 60% train / 20% validation / 20% test (balanced)
├── 70% train / 15% validation / 15% test (more training)
├── 80% train / 10% validation / 10% test (lots of data)
└── 50% train / 25% validation / 25% test (small dataset, need reliable estimates)

Rule of thumb:
├── Test set: At least 1,000 samples if possible
├── Validation set: At least 1,000 samples if possible
└── Smaller datasets: Use cross-validation instead!
Enter fullscreen mode Exit fullscreen mode

Rule 4: Stratify for Classification

Preserve class proportions in all splits!

from sklearn.model_selection import train_test_split

# ❌ WRONG: Random split might give unbalanced classes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# ✅ RIGHT: Stratified split preserves class ratios
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# Verify
print("Full data class distribution:")
print(pd.Series(y).value_counts(normalize=True))

print("\nTest set class distribution:")
print(pd.Series(y_test).value_counts(normalize=True))

# They should match!
Enter fullscreen mode Exit fullscreen mode

Rule 5: Time Series Needs Special Handling

For time-based data, you can't randomly split — that's data leakage!

# ❌ WRONG: Random split on time series
X_train, X_test = train_test_split(stock_data)
# Training might include 2024 data, test might include 2020!

# ✅ RIGHT: Chronological split
train = data[data['date'] < '2023-01-01']
val = data[(data['date'] >= '2023-01-01') & (data['date'] < '2024-01-01')]
test = data[data['date'] >= '2024-01-01']

# Timeline:
# [---- TRAIN ----][-- VAL --][-- TEST --]
# 2015           2023       2024        2025
Enter fullscreen mode Exit fullscreen mode

Cross-Validation: When Three Splits Aren't Enough

With small datasets, a single validation split might be unreliable. Enter k-fold cross-validation:

from sklearn.model_selection import cross_val_score

# Instead of one validation split, use 5 different ones!
scores = cross_val_score(model, X_train, y_train, cv=5)

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.1%} ± {scores.std():.1%}")
Enter fullscreen mode Exit fullscreen mode

How it works:

5-Fold Cross-Validation:

Fold 1: [VAL][TRAIN][TRAIN][TRAIN][TRAIN] → Score 1
Fold 2: [TRAIN][VAL][TRAIN][TRAIN][TRAIN] → Score 2
Fold 3: [TRAIN][TRAIN][VAL][TRAIN][TRAIN] → Score 3
Fold 4: [TRAIN][TRAIN][TRAIN][VAL][TRAIN] → Score 4
Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][VAL] → Score 5

Final estimate = Average of all 5 scores
Enter fullscreen mode Exit fullscreen mode

Still keep a separate test set! Cross-validation replaces the single validation split, not the test set.

# Complete workflow with CV
from sklearn.model_selection import cross_val_score, GridSearchCV

# 1. Hold out test set
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# 2. Use cross-validation for hyperparameter tuning
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, None]}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy'
)
grid_search.fit(X_trainval, y_trainval)

print(f"Best CV score: {grid_search.best_score_:.1%}")
print(f"Best params: {grid_search.best_params_}")

# 3. Final evaluation on test set (ONCE!)
test_score = grid_search.score(X_test, y_test)
print(f"\n🎯 Final test score: {test_score:.1%}")
Enter fullscreen mode Exit fullscreen mode

The Complete Picture

                                YOUR WORKFLOW
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                              ┌─────────────┐
                              │  ALL DATA   │
                              └──────┬──────┘
                                     │
                    ┌────────────────┴────────────────┐
                    │                                 │
                    ▼                                 ▼
           ┌───────────────┐                 ┌──────────────┐
           │  TRAIN + VAL  │                 │     TEST     │
           │    (80%)      │                 │    (20%)     │
           └───────┬───────┘                 └──────────────┘
                   │                                 │
                   │                                 │
     ┌─────────────┼─────────────┐                   │
     │             │             │                   │
     ▼             ▼             ▼                   │
 ┌───────┐    ┌───────┐    ┌───────┐                 │
 │Fold 1 │    │Fold 2 │    │Fold 3 │   ...           │
 └───┬───┘    └───┬───┘    └───┬───┘                 │
     │            │            │                     │
     └────────────┼────────────┘                     │
                  │                                  │
                  ▼                                  │
         ┌─────────────────┐                         │
         │ Cross-Validation│                         │
         │     Scores      │                         │
         │  (for tuning)   │                         │
         └────────┬────────┘                         │
                  │                                  │
                  │    Select best model             │
                  │                                  │
                  ▼                                  ▼
         ┌─────────────────┐                ┌──────────────────┐
         │   BEST MODEL    │ ──────────────▶│  FINAL EVALUATION │
         │  (from tuning)  │   Evaluate     │   (Report this!)  │
         └─────────────────┘   ONCE         └──────────────────┘
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Evaluating on Training Data

# ❌ WRONG: This tells you NOTHING useful
model.fit(X, y)
print(f"Accuracy: {model.score(X, y)}")  # Meaningless!

# ✅ RIGHT: Evaluate on held-out data
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test)}")
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Tuning on Test Data

# ❌ WRONG: Test set is now contaminated
for params in grid:
    model = Model(params)
    model.fit(X_train, y_train)
    if model.score(X_test, y_test) > best:  # NO!
        save(model)

# ✅ RIGHT: Use validation for tuning
for params in grid:
    model = Model(params)
    model.fit(X_train, y_train)
    if model.score(X_val, y_val) > best:  # YES!
        save(model)
# Then evaluate on test ONCE at the end
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Peeking at Test Multiple Times

# ❌ WRONG: "Let me just check if this helps..."
model_v1 = train(config_1)
print(model_v1.score(X_test, y_test))  # Peek 1

model_v2 = train(config_2)
print(model_v2.score(X_test, y_test))  # Peek 2

model_v3 = train(config_3)
print(model_v3.score(X_test, y_test))  # Peek 3

# You've now tuned to the test set!

# ✅ RIGHT: Use validation for all comparisons
# Test only at the very end, ONCE
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Preprocessing Before Splitting

# ❌ WRONG: Data leakage!
X_scaled = scaler.fit_transform(X)  # Learns from ALL data
X_train, X_test = split(X_scaled)

# ✅ RIGHT: Split first, then preprocess
X_train, X_test = split(X)
scaler.fit(X_train)  # Learn from train only!
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Enter fullscreen mode Exit fullscreen mode

Mistake 5: Random Split on Time Series

# ❌ WRONG: Future predicting past!
X_train, X_test = train_test_split(time_series_data, shuffle=True)

# ✅ RIGHT: Chronological split
X_train = data[data['date'] < cutoff_date]
X_test = data[data['date'] >= cutoff_date]
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Set Purpose How Often to Use Can Adjust Based On?
Training Learn patterns Every epoch/iteration Yes — that's learning!
Validation Tune hyperparameters Multiple times Yes — that's tuning!
Test Final evaluation ONCE NO — too late!

Key Takeaways

  1. Training data is for learning — The model adjusts to fit these patterns

  2. Validation data is for tuning — Use it to select hyperparameters and architectures

  3. Test data is for final evaluation — Touch it ONCE, report that score

  4. Training accuracy is meaningless — It only shows memorization ability

  5. Tuning on test = cheating — You're optimizing for that specific data

  6. Stratify for classification — Keep class proportions consistent

  7. Time series needs chronological splits — No shuffling!

  8. Use cross-validation for small datasets — More reliable than a single split


The One-Sentence Summary

Training data is Marcus tasting his own dish 347 times, validation data is his friends giving feedback, and test data is the competition judges he's never met — only one of these tells you if the dish is actually good.


What's Next?

Now that you understand the three-way split, you're ready for:

  • Cross-Validation Deep Dive — K-fold, stratified, time series CV
  • Overfitting vs Underfitting — Diagnosing model problems
  • Learning Curves — Visualizing the train-validation gap
  • Evaluation Metrics — Beyond accuracy

Follow me for the next article in this series!


Let's Connect!

If this finally explained why you need three datasets, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Have you been burned by evaluating on training data? We've all been Marcus at some point. Share your stories!


The difference between a model that reports 99% accuracy and one that actually works in production? Understanding that the chef can't grade their own dish. Training is practice. Validation is dress rehearsal. Test is opening night — and you only get one shot.


Share this with someone who keeps checking test accuracy during training. They're tasting their own food 347 times and wondering why the judges disagree.

Happy splitting! 🍳

Top comments (0)