Sachin Kr. Rajput

Posted on Jan 21

Validation Set vs Test Set: The Student Who Aced Every Practice SAT But Bombed the Real One

#python #machinelearning #datascience #beginners

The One-Line Summary: The validation set is for tuning and decision-making during development (you CAN look at it repeatedly). The test set is for final evaluation only (you touch it ONCE, at the very end). Mixing them up is like memorizing practice SAT answers and expecting to ace the real exam.

The Tale of Two Students

Sarah and Mike are both preparing for the SAT.

Sarah's Strategy: "Practice Smart"

SARAH'S STUDY PLAN:

1. Learn the material (TRAINING)
   - Read textbooks
   - Do practice problems
   - Learn strategies

2. Take practice SATs (VALIDATION)
   - Take practice test #1 → Score: 1180
   - Analyze mistakes: "Weak on geometry"
   - Study geometry more

   - Take practice test #2 → Score: 1280
   - Analyze mistakes: "Time management issues"
   - Practice speed

   - Take practice test #3 → Score: 1350
   - "Getting better!"

3. Take the REAL SAT (TEST)
   - First and only time seeing these questions
   - Final score: 1340

✓ Practice score (1350) matched real score (1340)
✓ Sarah knew what to expect

Mike's Strategy: "Optimize for the Test"

MIKE'S STUDY PLAN:

1. Learn the material (TRAINING)
   - Same as Sarah

2. Get the real SAT somehow (SHOULD BE TEST, Mike uses as VALIDATION)
   - "I found a leaked copy of the actual exam!"

   - Attempt #1 → Score: 1150
   - Study the specific questions he missed

   - Attempt #2 → Score: 1320
   - Memorize the tricky ones

   - Attempt #3 → Score: 1480
   - "I'm going to crush this!"

3. Take the REAL SAT (Same questions he practiced!)
   - Wait... the exam was updated!
   - Different questions than what he memorized
   - Final score: 1090

✗ Practice score (1480) DIDN'T match real score (1090)
✗ Mike optimized for specific questions, not general knowledge

What Went Wrong?

Mike used his TEST set as a VALIDATION set.

He made decisions based on the test questions:

"I'll study this specific type of problem" (because it's on the test)
"I'll memorize this formula" (because this exact problem appears)
"I'll skip that topic" (because it's not on the test)

His "95th percentile" score was an illusion. He overfit to the test.

The Three Datasets Defined

┌─────────────────────────────────────────────────────────────────┐
│                        YOUR FULL DATA                           │
├─────────────────────┬──────────────────┬───────────────────────┤
│                     │                  │                       │
│   TRAINING SET      │  VALIDATION SET  │     TEST SET          │
│      (60-70%)       │    (15-20%)      │     (15-20%)          │
│                     │                  │                       │
├─────────────────────┼──────────────────┼───────────────────────┤
│                     │                  │                       │
│  Purpose:           │  Purpose:        │  Purpose:             │
│  LEARN patterns     │  TUNE & DECIDE   │  FINAL EVALUATION     │
│                     │                  │                       │
│  Used for:          │  Used for:       │  Used for:            │
│  • Training model   │  • Hyperparameter│  • Unbiased estimate  │
│  • Fitting weights  │    tuning        │    of real-world      │
│  • Learning         │  • Model select. │    performance        │
│                     │  • Early stopping│                       │
│                     │  • Architecture  │                       │
│                     │                  │                       │
│  How often used:    │  How often used: │  How often used:      │
│  Every epoch/iter   │  Many times      │  ONCE (at the end!)   │
│                     │                  │                       │
│  Can influence      │  Can influence   │  CANNOT influence     │
│  model? YES         │  model? YES      │  model? NO            │
│                     │                  │                       │
└─────────────────────┴──────────────────┴───────────────────────┘

The Key Differences

Validation Set: Your Study Buddy

THE VALIDATION SET IS FOR:

✓ Hyperparameter tuning
  "Learning rate 0.01 gives val_acc=0.89, learning rate 0.001 gives val_acc=0.92"
  → Choose 0.001

✓ Model selection
  "Random Forest: val_acc=0.88, XGBoost: val_acc=0.91, Neural Net: val_acc=0.87"
  → Choose XGBoost

✓ Early stopping
  "Epoch 10: val_loss=0.32, Epoch 11: val_loss=0.31, Epoch 12: val_loss=0.33"
  → Stop at epoch 11

✓ Architecture decisions
  "3 layers: val_acc=0.85, 5 layers: val_acc=0.89, 7 layers: val_acc=0.88"
  → Choose 5 layers

✓ Feature selection
  "With feature X: val_acc=0.90, Without feature X: val_acc=0.87"
  → Keep feature X

YOU CAN LOOK AT IT MANY TIMES.
YOUR DECISIONS ARE INFLUENCED BY IT.

Test Set: The Final Exam

THE TEST SET IS FOR:

✓ Final performance estimate
  "After all tuning, what accuracy should we expect in production?"

✓ Reporting results
  "Our model achieves 91% accuracy on held-out test data."

✓ Sanity check before deployment
  "Does test performance match validation performance?"

✗ NOT for tuning
✗ NOT for model selection  
✗ NOT for any decisions

YOU LOOK AT IT ONCE.
AFTER YOU LOOK, YOU'RE DONE.
NO GOING BACK TO TUNE.

The Danger of Test Set Leakage

When you use the test set for decisions, you're doing what Mike did:

# ❌ THE WRONG WAY (Test set leakage)

# Load data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Try different models
for model in [RandomForest(), XGBoost(), NeuralNet()]:
    model.fit(X_train, y_train)
    test_score = model.score(X_test, y_test)  # 👈 PEEKING at test!
    print(f"{model}: {test_score}")

# "XGBoost had best TEST score, let's use that!"
# This decision was INFLUENCED by the test set
# Your reported "test accuracy" is now OPTIMISTIC

# Try different hyperparameters
for lr in [0.001, 0.01, 0.1]:
    model = XGBoost(learning_rate=lr)
    model.fit(X_train, y_train)
    test_score = model.score(X_test, y_test)  # 👈 PEEKING again!
    print(f"lr={lr}: {test_score}")

# "lr=0.01 had best TEST score!"
# MORE leakage! Every peek contaminates your estimate.

Result: Your "test accuracy" of 94% is a lie. Real-world will be lower.

# ✅ THE RIGHT WAY (Test set protected)

# Load data - split into THREE sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25)
# Result: 60% train, 20% val, 20% test

# Try different models using VALIDATION set
for model in [RandomForest(), XGBoost(), NeuralNet()]:
    model.fit(X_train, y_train)
    val_score = model.score(X_val, y_val)  # 👈 Using VALIDATION
    print(f"{model}: {val_score}")

# "XGBoost had best VALIDATION score, let's use that!"
best_model = XGBoost()

# Tune hyperparameters using VALIDATION set
for lr in [0.001, 0.01, 0.1]:
    model = XGBoost(learning_rate=lr)
    model.fit(X_train, y_train)
    val_score = model.score(X_val, y_val)  # 👈 Using VALIDATION
    print(f"lr={lr}: {val_score}")

# "lr=0.01 had best VALIDATION score!"
best_model = XGBoost(learning_rate=0.01)

# FINAL: Train on train+val, evaluate on test ONCE
X_dev = np.vstack([X_train, X_val])
y_dev = np.hstack([y_train, y_val])
best_model.fit(X_dev, y_dev)

test_score = best_model.score(X_test, y_test)  # 👈 FIRST and ONLY peek!
print(f"Final test score: {test_score}")
# This is your TRUE expected performance

Visual: The Information Flow

                    VALIDATION SET                    TEST SET
                    ──────────────                    ────────
                          │                               │
                          ▼                               │
               ┌─────────────────────┐                   │
               │ Can I see results?  │                   │
               │        YES          │                   │
               └─────────────────────┘                   │
                          │                               │
                          ▼                               │
               ┌─────────────────────┐                   │
               │ Can I make changes  │                   │
               │ based on results?   │                   │
               │        YES          │                   │
               └─────────────────────┘                   │
                          │                               │
                          ▼                               ▼
               ┌─────────────────────┐       ┌─────────────────────┐
               │ Loop back and try   │       │ Can I see results?  │
               │ something new?      │       │        YES          │
               │        YES          │       │   (but only ONCE)   │
               └─────────────────────┘       └─────────────────────┘
                          │                               │
                          │                               ▼
                          │                  ┌─────────────────────┐
                          │                  │ Can I make changes? │
                          │                  │         NO!         │
                          │                  │    GAME OVER        │
                          │                  └─────────────────────┘
                          │                               │
                          ▼                               ▼
               ┌─────────────────────┐       ┌─────────────────────┐
               │   DECISIONS MADE:   │       │   DECISION MADE:    │
               │ • Best model        │       │ • Deploy or not     │
               │ • Best hyperparams  │       │   (nothing else!)   │
               │ • Best features     │       │                     │
               │ • When to stop      │       │                     │
               └─────────────────────┘       └─────────────────────┘

Why Validation Performance ≠ Test Performance

Even with proper separation, validation and test scores differ. Here's why:

Reason 1: Validation Overfitting

# You tried 100 hyperparameter combinations
# You picked the ONE with best validation score
# By random chance, some combo looked better than it truly is

# It's like flipping 100 coins and reporting the best streak
# "I got 8 heads in a row!" — not skill, just variance

Reason 2: Distribution Differences

# Even with random splits, val and test have slight differences
# Your tuning optimized for val's specific quirks

# Example: Val set has 52% class A, Test has 48%
# Model tuned slightly toward val's distribution

Reason 3: Multiple Comparisons

# Every decision based on validation is a "peek"
# Model A vs B? → Peek 1
# Layers 3 vs 5 vs 7? → Peek 2
# Learning rate 0.001 vs 0.01? → Peek 3
# Feature X included or not? → Peek 4

# Each peek leaks a tiny bit of validation info into your model
# 50 decisions = 50 tiny leaks = noticeable optimism

This is why you need a separate test set — to measure AFTER all the peeking is done.

The Workflow: Step by Step

import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report

# ============================================================
# STEP 1: Create the three splits
# ============================================================
# First, separate test set (this will be LOCKED AWAY)
X_dev, X_test, y_dev, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Then, split development into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_dev, y_dev, test_size=0.25, stratify=y_dev, random_state=42
)

print("Dataset splits:")
print(f"  Training:   {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"  Validation: {len(X_val)} samples ({len(X_val)/len(X)*100:.0f}%)")
print(f"  Test:       {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")
print(f"\n🔒 Test set is now LOCKED. Do not touch until the end!")

# ============================================================
# STEP 2: Model selection using VALIDATION set
# ============================================================
print("\n" + "="*60)
print("STEP 2: Model Selection (using validation set)")
print("="*60)

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}

val_scores = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    val_score = model.score(X_val, y_val)
    val_scores[name] = val_score
    print(f"  {name}: val_accuracy = {val_score:.4f}")

best_model_name = max(val_scores, key=val_scores.get)
print(f"\n✓ Best model: {best_model_name}")

# ============================================================
# STEP 3: Hyperparameter tuning using VALIDATION set
# ============================================================
print("\n" + "="*60)
print("STEP 3: Hyperparameter Tuning (using validation set)")
print("="*60)

# Using cross-validation on training set for more robust tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,  # 5-fold CV on training set
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"  Best params: {grid_search.best_params_}")
print(f"  Best CV score: {grid_search.best_score_:.4f}")

# Validate on validation set
val_score_tuned = grid_search.score(X_val, y_val)
print(f"  Validation score: {val_score_tuned:.4f}")

# ============================================================
# STEP 4: Final training on ALL development data
# ============================================================
print("\n" + "="*60)
print("STEP 4: Final Training")
print("="*60)

final_model = RandomForestClassifier(
    **grid_search.best_params_,
    random_state=42
)
final_model.fit(X_dev, y_dev)  # Train on train + val combined
print("  ✓ Model trained on full development set (train + val)")

# ============================================================
# STEP 5: FINAL evaluation on TEST set (ONCE!)
# ============================================================
print("\n" + "="*60)
print("STEP 5: Final Test Evaluation (ONE TIME ONLY!)")
print("="*60)
print("🔓 Unlocking test set...")

y_pred = final_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
test_f1 = f1_score(y_test, y_pred, average='weighted')

print(f"\n  FINAL TEST RESULTS:")
print(f"  Accuracy: {test_accuracy:.4f}")
print(f"  F1 Score: {test_f1:.4f}")

print(f"\n  Validation estimate was: {val_score_tuned:.4f}")
print(f"  Test result is:          {test_accuracy:.4f}")
print(f"  Difference:              {val_score_tuned - test_accuracy:+.4f}")

if abs(val_score_tuned - test_accuracy) < 0.02:
    print("\n  ✓ Validation estimate was reliable!")
else:
    print("\n  ⚠️ Gap detected — possible validation overfitting")

print("\n🔒 Test set used. No more changes allowed!")

When Can You Skip the Validation Set?

Option 1: Cross-Validation as Validation

Instead of a held-out validation set, use K-fold CV on training data:

from sklearn.model_selection import cross_val_score

# Split only into dev and test
X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=0.2)

# Use cross-validation for all tuning decisions
scores = cross_val_score(model, X_dev, y_dev, cv=5)
print(f"CV score: {scores.mean():.4f} ± {scores.std():.4f}")

# Grid search also uses CV internally
grid_search = GridSearchCV(model, params, cv=5)
grid_search.fit(X_dev, y_dev)

# Final test evaluation (still only ONCE!)
final_score = grid_search.score(X_test, y_test)

Advantage: More training data (no separate validation holdout)
Disadvantage: Slower (K training runs per evaluation)

Option 2: Very Large Datasets

With millions of samples, a single validation set is fine:

# 10 million samples
# 70/15/15 split still gives:
# - 7M training samples
# - 1.5M validation samples (plenty!)
# - 1.5M test samples (plenty!)

# Random variation is tiny with 1.5M samples
# Single validation set is statistically robust

The Mistakes That Kill Models

Mistake 1: "I'll Just Peek at Test Once"

# ❌ The slippery slope

# "Let me just check test score to see if I'm on the right track"
test_score = model.score(X_test, y_test)  # Peek 1
# "Hmm, not great. Let me tune more..."

# (tunes hyperparameters)

# "Let me check if that helped"
test_score = model.score(X_test, y_test)  # Peek 2
# "Better! But maybe I can do more..."

# (changes architecture)

# "One more check..."
test_score = model.score(X_test, y_test)  # Peek 3

# You've now made 3 decisions based on test set
# Your "test score" is no longer unbiased

Mistake 2: "Validation and Test Are the Same Thing"

# ❌ WRONG
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Using "test" for everything
for params in param_grid:
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)  # This is validation, not test!

# "My test accuracy is 94%!"  
# No, your VALIDATION accuracy is 94%
# You have NO IDEA what your test accuracy is

Mistake 3: "I'll Retrain If Test Score Is Bad"

# ❌ The moment you do this, test becomes validation

test_score = model.score(X_test, y_test)
# "85%? I expected 90%. Let me try a different model."

# STOP! You just used test set to make a decision
# Now you need a NEW test set for unbiased evaluation
# But you probably don't have one...

Mistake 4: "Test Set for Early Stopping"

# ❌ WRONG - Test set leakage!
for epoch in range(1000):
    train_one_epoch()
    test_loss = evaluate(X_test, y_test)
    if test_loss > best_test_loss:
        early_stop()  # Decision based on test!

# ✅ RIGHT - Use validation for early stopping
for epoch in range(1000):
    train_one_epoch()
    val_loss = evaluate(X_val, y_val)  # Validation!
    if val_loss > best_val_loss:
        early_stop()

Quick Comparison Table

Aspect	Validation Set	Test Set
Purpose	Tune, decide, iterate	Final evaluation
When used	During development	After development complete
How often	Many times	ONCE
Influences model?	YES (that's the point)	NO (must not)
If score is bad?	Go back and improve	Too late, ship it or restart
Alternative	K-fold cross-validation	None (need held-out data)
Typical size	15-20%	15-20%

The Golden Rules

RULE 1: Validation is for LEARNING what works
        Test is for MEASURING final performance

RULE 2: Every peek at validation → may change your decisions
        Every peek at test → contaminates your estimate

RULE 3: Validation score will be OPTIMISTIC
        (you cherry-picked what worked on it)
        Test score is your REALITY CHECK

RULE 4: If you touch test more than once → it's now validation
        And you need a NEW test set

RULE 5: When in doubt, don't look at test
        You can always look later
        You can never UN-look

Key Takeaways

Validation = for tuning, Test = for final evaluation — Different purposes!
Validation can be used repeatedly — That's how you improve
Test should be used ONCE — After all decisions are made
Using test for decisions = leakage — Your test score becomes optimistic
Validation score > Test score (usually) — Because you optimized for validation
Cross-validation can replace validation set — More data efficient
Lock your test set away — Pretend it doesn't exist until the end
Once you peek at test, you're done — No going back to tune

The One-Sentence Summary

Sarah used practice SATs (validation) to find her weaknesses and improve, saving the real SAT (test) for final measurement — Mike used a leaked copy of the real SAT to practice and got a false 1480 that crashed to 1090 on the actual (different) exam, because optimizing for specific test questions isn't the same as actually understanding the material.

What's Next?

Now that you understand validation vs test sets, you're ready for:

Data Leakage — When information from the future sneaks into training
Nested Cross-Validation — Unbiased evaluation when tuning
Time-Based Splits — When random splits don't work
Production Monitoring — When test set isn't enough

Follow me for the next article in this series!

Let's Connect!

If the validation vs test distinction finally clicked, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Have you ever accidentally leaked test data? I once tuned on "test" for a week before realizing my mistake. Had to re-run everything with proper splits! 😅

The difference between "this model should get 94% in production" and "this model got 94% on questions I practiced on"? Understanding that validation is for learning and test is for measuring. One prepares you for reality. The other gives you a false sense of confidence.

Share this with someone who uses "validation" and "test" interchangeably. They're playing a dangerous game with their model's reliability.

Happy validating! 📝

DEV Community