Sachin Kr. Rajput

Posted on Jan 22

Pruning Decision Trees: The Bonsai Master Who Taught ML Engineers When to Stop

#machinelearning #datascience #beginners #python

The One-Line Summary: Prevent decision tree overfitting by limiting growth (pre-pruning with max_depth, min_samples_split, min_samples_leaf) or by growing fully then cutting back (post-pruning with cost-complexity pruning), finding the sweet spot where the tree captures patterns without memorizing noise.

The Tale of Two Trees

In the Garden of Machine Learning, two decision trees were planted on the same day, fed the same training data.

Tree #1: Wild Willow (The Overfitter)

Wild Willow had one philosophy: "More splits = More knowledge!"

WILD WILLOW'S GROWTH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Training Data: 100 patients, 10 features

Wild Willow kept splitting...
  Depth 1: "Age > 50?"
  Depth 2: "Blood pressure > 140?"
  Depth 3: "Cholesterol > 200?"
  Depth 5: "Patient ID = 47?" ← Wait, what?!
  Depth 10: "Visited on a Tuesday?" ← This is getting weird...
  Depth 20: "Had coffee that morning?" ← STOP!

Final tree:
  - Depth: 25 levels
  - Leaves: 98 (almost one per patient!)
  - Training accuracy: 100% 🎉
  - Test accuracy: 52% 😱

Wild Willow MEMORIZED the training data!
Each patient got their own personal leaf node.
New patients? Complete failure.

Tree #2: Balanced Bonsai (The Generalizer)

Balanced Bonsai had a different philosophy: "Split only when it truly helps."

BALANCED BONSAI'S GROWTH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Same Training Data: 100 patients, 10 features

Balanced Bonsai was selective...
  Depth 1: "Age > 50?" ← Strong predictor!
  Depth 2: "Blood pressure > 140?" ← Important!
  Depth 3: "Cholesterol > 200?" ← Useful!
  Depth 4: "Hmm, further splits don't help much..."
  STOP. No more splitting needed.

Final tree:
  - Depth: 4 levels
  - Leaves: 12
  - Training accuracy: 87%
  - Test accuracy: 85% ✓

Balanced Bonsai learned PATTERNS, not examples!
Slightly worse on training data.
MUCH better on new patients.

The Gardener's Wisdom

The old gardener who tended both trees explained:

"Wild Willow grew without restraint, reaching for every data point like a branch reaching for every ray of sunlight. It captured everything — including the noise, the accidents, the meaningless quirks.

Balanced Bonsai knew when to stop. It captured the strong patterns and ignored the noise. That's why it thrives with new data while Wild Willow withers."

This is the essence of preventing overfitting.

![Overfitting Overview]

The overfitting problem: Wild Willow memorizes while Balanced Bonsai generalizes

What Is Overfitting?

OVERFITTING DEFINED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Overfitting occurs when a model learns the training data
TOO WELL — including the noise and random fluctuations —
and fails to generalize to new, unseen data.

SYMPTOMS:
✗ Training accuracy MUCH higher than test accuracy
✗ Model is overly complex (deep tree, many leaves)
✗ Small changes in data cause big changes in predictions
✗ Model captures noise as if it were signal

ANALOGY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Imagine a student who memorizes:
"Q: What's 2+2? A: 4"
"Q: What's 3+3? A: 6"
"Q: What's 5+5? A: 10"

They get 100% on the practice test!

But when asked "What's 4+4?", they're lost.
They memorized ANSWERS, not ADDITION.

An overfit decision tree does the same thing —
memorizing training examples instead of learning patterns.

Why Do Decision Trees Overfit?

Decision trees are greedy and will keep splitting until every leaf is pure unless you stop them.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Create a dataset
X, y = make_classification(
    n_samples=500, n_features=20, n_informative=10,
    n_redundant=5, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print("THE OVERFITTING DEMONSTRATION")
print("="*60)

# Unrestricted tree (Wild Willow)
wild_tree = DecisionTreeClassifier(random_state=42)
wild_tree.fit(X_train, y_train)

print(f"\n🌳 WILD WILLOW (No restrictions):")
print(f"   Depth: {wild_tree.get_depth()}")
print(f"   Leaves: {wild_tree.get_n_leaves()}")
print(f"   Training Accuracy: {wild_tree.score(X_train, y_train):.2%}")
print(f"   Test Accuracy: {wild_tree.score(X_test, y_test):.2%}")
print(f"   Gap: {wild_tree.score(X_train, y_train) - wild_tree.score(X_test, y_test):.2%} ← OVERFITTING!")

# Restricted tree (Balanced Bonsai)
bonsai_tree = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=42)
bonsai_tree.fit(X_train, y_train)

print(f"\n🌿 BALANCED BONSAI (Pruned):")
print(f"   Depth: {bonsai_tree.get_depth()}")
print(f"   Leaves: {bonsai_tree.get_n_leaves()}")
print(f"   Training Accuracy: {bonsai_tree.score(X_train, y_train):.2%}")
print(f"   Test Accuracy: {bonsai_tree.score(X_test, y_test):.2%}")
print(f"   Gap: {bonsai_tree.score(X_train, y_train) - bonsai_tree.score(X_test, y_test):.2%} ← Healthy!")

Output:

THE OVERFITTING DEMONSTRATION
============================================================

🌳 WILD WILLOW (No restrictions):
   Depth: 19
   Leaves: 156
   Training Accuracy: 100.00%
   Test Accuracy: 78.67%
   Gap: 21.33% ← OVERFITTING!

🌿 BALANCED BONSAI (Pruned):
   Depth: 5
   Leaves: 22
   Training Accuracy: 89.43%
   Test Accuracy: 86.00%
   Gap: 3.43% ← Healthy!

The Two Pruning Strategies

Just like a real gardener, we have two approaches:

PRUNING STRATEGIES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. PRE-PRUNING (Stop Early)
   "Don't let it grow wild in the first place!"

   Set limits BEFORE training:
   • max_depth: Maximum levels
   • min_samples_split: Min samples to split
   • min_samples_leaf: Min samples in leaf
   • max_leaf_nodes: Maximum leaves
   • max_features: Features to consider

   ✓ Fast and simple
   ✗ Might stop too early (miss good splits)


2. POST-PRUNING (Grow Then Cut)
   "Let it grow fully, then trim the excess!"

   Build full tree, then remove branches:
   • Cost-complexity pruning (ccp_alpha)
   • Reduced error pruning

   ✓ Considers the full picture
   ✓ Often finds better trees
   ✗ More computationally expensive

Pre-Pruning: Setting Growth Limits

1. max_depth: How Deep Can It Grow?

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("EFFECT OF max_depth")
print("="*60)
print(f"\n{'Depth':<10} {'Train Acc':<12} {'Test Acc':<12} {'Leaves':<10} {'Status'}")
print("-"*55)

depths = [1, 2, 3, 4, 5, 7, 10, 15, 20, None]
results = []

for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)

    train_acc = tree.score(X_train, y_train)
    test_acc = tree.score(X_test, y_test)
    leaves = tree.get_n_leaves()

    gap = train_acc - test_acc
    if gap > 0.15:
        status = "⚠️ OVERFIT"
    elif gap > 0.05:
        status = "⚡ Moderate"
    else:
        status = "✅ Good"

    depth_str = str(depth) if depth else "None"
    print(f"{depth_str:<10} {train_acc:<12.2%} {test_acc:<12.2%} {leaves:<10} {status}")

    results.append((depth if depth else 25, train_acc, test_acc))

Output:

EFFECT OF max_depth
============================================================

Depth      Train Acc    Test Acc     Leaves     Status
-------------------------------------------------------
1          0.77         0.76         2          ✅ Good
2          0.84         0.81         4          ✅ Good
3          0.88         0.84         8          ✅ Good
4          0.91         0.86         14         ✅ Good
5          0.93         0.87         22         ⚡ Moderate
7          0.97         0.86         54         ⚡ Moderate
10         0.99         0.84         118        ⚠️ OVERFIT
15         1.00         0.81         198        ⚠️ OVERFIT
20         1.00         0.80         224        ⚠️ OVERFIT
None       1.00         0.79         238        ⚠️ OVERFIT

![Max Depth Effect]

As depth increases, training accuracy climbs to 100% but test accuracy peaks then drops — classic overfitting!

2. min_samples_split: Minimum Samples to Split

print("\nEFFECT OF min_samples_split")
print("="*60)
print(f"\n{'Min Split':<12} {'Train Acc':<12} {'Test Acc':<12} {'Depth':<8} {'Leaves':<10}")
print("-"*55)

min_splits = [2, 5, 10, 20, 50, 100, 200]

for min_split in min_splits:
    tree = DecisionTreeClassifier(min_samples_split=min_split, random_state=42)
    tree.fit(X_train, y_train)

    print(f"{min_split:<12} {tree.score(X_train, y_train):<12.2%} "
          f"{tree.score(X_test, y_test):<12.2%} {tree.get_depth():<8} {tree.get_n_leaves():<10}")

EFFECT OF min_samples_split
============================================================

Min Split    Train Acc    Test Acc     Depth    Leaves    
-------------------------------------------------------
2            100.00%      79.00%       20       238       
5            100.00%      80.33%       18       192       
10           99.00%       82.00%       15       132       
20           96.57%       84.67%       12       76        
50           91.71%       86.33%       8        37        
100          87.00%       85.33%       6        18        
200          81.29%       81.00%       4        8

min_samples_split EXPLAINED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"A node must have AT LEAST this many samples to be split."

min_samples_split=2 (default):
  Even a node with just 2 samples can be split!
  → Leads to very deep trees, overfitting

min_samples_split=50:
  A node needs 50+ samples to consider splitting.
  → Stops splitting when data gets too thin
  → Prevents memorizing small groups

INTUITION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
With only 10 samples in a node, any pattern you find
is likely NOISE, not a real pattern.

With 100+ samples, patterns are more likely to be REAL.

3. min_samples_leaf: Minimum Samples in a Leaf

print("\nEFFECT OF min_samples_leaf")
print("="*60)
print(f"\n{'Min Leaf':<12} {'Train Acc':<12} {'Test Acc':<12} {'Depth':<8} {'Leaves':<10}")
print("-"*55)

min_leafs = [1, 2, 5, 10, 20, 50, 100]

for min_leaf in min_leafs:
    tree = DecisionTreeClassifier(min_samples_leaf=min_leaf, random_state=42)
    tree.fit(X_train, y_train)

    print(f"{min_leaf:<12} {tree.score(X_train, y_train):<12.2%} "
          f"{tree.score(X_test, y_test):<12.2%} {tree.get_depth():<8} {tree.get_n_leaves():<10}")

EFFECT OF min_samples_leaf
============================================================

Min Leaf     Train Acc    Test Acc     Depth    Leaves    
-------------------------------------------------------
1            100.00%      79.00%       20       238       
2            100.00%      80.00%       19       190       
5            97.71%       83.33%       15       113       
10           94.00%       85.67%       11       58        
20           89.43%       86.00%       8        32        
50           83.43%       83.00%       5        14        
100          77.14%       77.67%       3        7

min_samples_leaf EXPLAINED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"Every leaf must have AT LEAST this many samples."

min_samples_leaf=1 (default):
  A leaf can have just 1 sample!
  → Creates very specific (memorized) leaves

min_samples_leaf=20:
  Every leaf needs 20+ samples.
  → Each prediction is based on 20+ examples
  → More statistically reliable predictions

DIFFERENCE FROM min_samples_split:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

min_samples_split: Can I split this node?
min_samples_leaf:  Are the resulting leaves big enough?

Example with min_samples_leaf=10:
  Node has 50 samples.
  Split would create: 45 left, 5 right.
  REJECTED! Right leaf has only 5 < 10.

4. max_leaf_nodes: Cap the Total Leaves

print("\nEFFECT OF max_leaf_nodes")
print("="*60)
print(f"\n{'Max Leaves':<12} {'Train Acc':<12} {'Test Acc':<12} {'Depth':<8} {'Actual Leaves':<15}")
print("-"*60)

max_leaves_list = [2, 5, 10, 20, 50, 100, None]

for max_leaves in max_leaves_list:
    tree = DecisionTreeClassifier(max_leaf_nodes=max_leaves, random_state=42)
    tree.fit(X_train, y_train)

    max_str = str(max_leaves) if max_leaves else "None"
    print(f"{max_str:<12} {tree.score(X_train, y_train):<12.2%} "
          f"{tree.score(X_test, y_test):<12.2%} {tree.get_depth():<8} {tree.get_n_leaves():<15}")

EFFECT OF max_leaf_nodes
============================================================

Max Leaves   Train Acc    Test Acc     Depth    Actual Leaves  
------------------------------------------------------------
2            77.14%       76.33%       1        2              
5            85.00%       82.67%       3        5              
10           89.86%       85.33%       5        10             
20           93.14%       86.33%       7        20             
50           97.57%       85.33%       12       50             
100          99.43%       83.67%       16       100            
None         100.00%      79.00%       20       238

max_leaf_nodes EXPLAINED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"The tree can have AT MOST this many leaf nodes."

max_leaf_nodes=20:
  Tree grows, but stops when it hits 20 leaves.
  The algorithm prioritizes the BEST splits.

ADVANTAGE:
  - Direct control over model complexity
  - Tree picks the most valuable splits
  - Very interpretable (you know exact size)

WHEN TO USE:
  - When you need a specific complexity level
  - When interpretability is crucial
  - When you want to compare models of equal size

5. max_features: Limit Features Per Split

print("\nEFFECT OF max_features")
print("="*60)
print(f"\n{'Max Features':<15} {'Train Acc':<12} {'Test Acc':<12} {'Depth':<8}")
print("-"*50)

max_features_list = [1, 5, 10, 'sqrt', 'log2', None]

for max_feat in max_features_list:
    tree = DecisionTreeClassifier(max_features=max_feat, random_state=42)
    tree.fit(X_train, y_train)

    print(f"{str(max_feat):<15} {tree.score(X_train, y_train):<12.2%} "
          f"{tree.score(X_test, y_test):<12.2%} {tree.get_depth():<8}")

max_features EXPLAINED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"At each split, consider only this many features."

max_features=None (default):
  Consider ALL features at every split.

max_features='sqrt':
  Consider √n features (n = total features).
  For 20 features: √20 ≈ 4 features per split.

max_features='log2':
  Consider log₂(n) features.
  For 20 features: log₂(20) ≈ 4 features.

WHY LIMIT FEATURES?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Adds randomness → Reduces overfitting
2. Faster training (fewer comparisons)
3. Key ingredient in Random Forests!
4. Prevents over-reliance on dominant features

Post-Pruning: Grow Then Cut Back

Cost-Complexity Pruning (ccp_alpha)

This is the most powerful pruning technique:

COST-COMPLEXITY PRUNING:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Grow the full tree (no restrictions)
2. Calculate the "cost" of each subtree
3. Remove subtrees that aren't worth their complexity

THE FORMULA:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Cost = Impurity + α × (Number of Leaves)

Where α (alpha) is the complexity penalty.

• α = 0: No penalty → Full tree (overfit)
• α = large: Heavy penalty → Tiny tree (underfit)
• α = just right: Optimal trade-off!


INTUITION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"Is this split WORTH the added complexity?"

If a split reduces impurity by 0.001 but adds a leaf,
and α = 0.01, then:
  Benefit: 0.001 (impurity reduction)
  Cost: 0.01 (penalty for extra leaf)
  → NOT WORTH IT! Prune this split.

from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Get the cost-complexity pruning path
tree = DecisionTreeClassifier(random_state=42)
path = tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
impurities = path.impurities

print("COST-COMPLEXITY PRUNING PATH")
print("="*60)
print(f"\nFound {len(ccp_alphas)} alpha values to test")
print(f"Alpha range: {ccp_alphas.min():.6f} to {ccp_alphas.max():.6f}")

# Train trees for different alphas
trees = []
train_scores = []
test_scores = []

for alpha in ccp_alphas:
    tree = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    tree.fit(X_train, y_train)
    trees.append(tree)
    train_scores.append(tree.score(X_train, y_train))
    test_scores.append(tree.score(X_test, y_test))

# Find optimal alpha
best_idx = np.argmax(test_scores)
best_alpha = ccp_alphas[best_idx]
best_tree = trees[best_idx]

print(f"\n{'Alpha':<12} {'Train Acc':<12} {'Test Acc':<12} {'Leaves':<10}")
print("-"*50)

# Show selected alphas
indices = [0, len(ccp_alphas)//4, len(ccp_alphas)//2, 
           best_idx, 3*len(ccp_alphas)//4, len(ccp_alphas)-1]
indices = sorted(set(indices))

for i in indices:
    print(f"{ccp_alphas[i]:<12.6f} {train_scores[i]:<12.2%} "
          f"{test_scores[i]:<12.2%} {trees[i].get_n_leaves():<10}")

print(f"\n🏆 OPTIMAL: alpha={best_alpha:.6f}, Test Acc={test_scores[best_idx]:.2%}")

Output:

COST-COMPLEXITY PRUNING PATH
============================================================

Found 156 alpha values to test
Alpha range: 0.000000 to 0.064286

Alpha        Train Acc    Test Acc     Leaves    
--------------------------------------------------
0.000000     100.00%      79.00%       238       
0.000429     98.29%       82.33%       139       
0.001286     95.57%       85.00%       73        
0.002667     91.86%       87.00%       37        
0.007273     86.43%       85.67%       17        
0.064286     77.14%       76.33%       2         

🏆 OPTIMAL: alpha=0.002667, Test Acc=87.00%

![CCP Alpha Effect]

Cost-complexity pruning finds the optimal alpha where test accuracy peaks

Finding the Best Alpha with Cross-Validation

from sklearn.model_selection import cross_val_score
import numpy as np

print("FINDING OPTIMAL ALPHA WITH CROSS-VALIDATION")
print("="*60)

# Get alpha candidates
tree = DecisionTreeClassifier(random_state=42)
path = tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# Use fewer alphas for speed
alpha_candidates = ccp_alphas[::5]  # Every 5th alpha

cv_scores = []
for alpha in alpha_candidates:
    tree = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    scores = cross_val_score(tree, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

# Find best
best_idx = np.argmax(cv_scores)
best_alpha = alpha_candidates[best_idx]

print(f"\nBest alpha from CV: {best_alpha:.6f}")
print(f"CV Score: {cv_scores[best_idx]:.2%}")

# Train final model
final_tree = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
final_tree.fit(X_train, y_train)

print(f"\nFinal Model:")
print(f"  Depth: {final_tree.get_depth()}")
print(f"  Leaves: {final_tree.get_n_leaves()}")
print(f"  Training Accuracy: {final_tree.score(X_train, y_train):.2%}")
print(f"  Test Accuracy: {final_tree.score(X_test, y_test):.2%}")

The Complete Pruning Toolkit

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np

print("THE COMPLETE PRUNING TOOLKIT")
print("="*60)

# All pruning parameters in one place
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 10, 20, 50],
    'min_samples_leaf': [1, 5, 10, 20],
    'max_leaf_nodes': [10, 20, 50, None],
}

print(f"\nSearching {np.prod([len(v) for v in param_grid.values()])} combinations...")

tree = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(tree, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"\n🏆 Best Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"   {param}: {value}")

print(f"\nCV Score: {grid_search.best_score_:.2%}")
print(f"Test Score: {grid_search.score(X_test, y_test):.2%}")

best_tree = grid_search.best_estimator_
print(f"\nBest Tree Structure:")
print(f"   Depth: {best_tree.get_depth()}")
print(f"   Leaves: {best_tree.get_n_leaves()}")

Output:

THE COMPLETE PRUNING TOOLKIT
============================================================

Searching 320 combinations...

🏆 Best Parameters:
   max_depth: 5
   max_leaf_nodes: 20
   min_samples_leaf: 5
   min_samples_split: 10

CV Score: 86.14%
Test Score: 87.33%

Best Tree Structure:
   Depth: 5
   Leaves: 19

Visualizing Overfitting

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

# Create data with noise
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = np.sin(X).ravel()
y = y_true + np.random.randn(100) * 0.3

X_train, X_test = X[:70], X[70:]
y_train, y_test = y[:70], y[70:]

# Fit trees of different depths
depths = [1, 3, 5, 10, 20]
X_plot = np.linspace(0, 10, 200).reshape(-1, 1)

fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for ax, depth in zip(axes, depths):
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)

    y_pred = tree.predict(X_plot)

    ax.scatter(X_train, y_train, c='blue', alpha=0.5, label='Train')
    ax.scatter(X_test, y_test, c='red', alpha=0.5, label='Test')
    ax.plot(X_plot, y_pred, 'g-', linewidth=2, label='Prediction')
    ax.plot(X_plot, np.sin(X_plot), 'k--', alpha=0.5, label='True')

    train_score = tree.score(X_train, y_train)
    test_score = tree.score(X_test, y_test)

    ax.set_title(f'Depth={depth}\nTrain R²={train_score:.2f}, Test R²={test_score:.2f}')
    ax.legend(fontsize=8)
    ax.set_xlim(0, 10)

plt.suptitle('Effect of Tree Depth on Overfitting', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('depth_overfitting_visual.png', dpi=150, bbox_inches='tight')
plt.show()

![Depth Overfitting Visual]

As depth increases, the tree fits training data better but test performance degrades — the predictions become jagged and overfit to noise

The Bias-Variance Trade-off

THE FUNDAMENTAL TRADE-OFF:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Total Error = Bias² + Variance + Irreducible Noise


BIAS (Underfitting):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"The model is too simple to capture the pattern."

Symptoms:
• Both training AND test accuracy are low
• Model makes systematic errors
• Tree is too shallow

Example: Depth=1 tree trying to fit a complex pattern.


VARIANCE (Overfitting):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"The model is too complex and captures noise."

Symptoms:
• Training accuracy high, test accuracy low
• Model changes drastically with different data
• Tree is too deep

Example: Depth=20 tree memorizing training data.


THE SWEET SPOT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

        │
   E    │    ╲  Variance
   r    │     ╲
   r    │      ╲____
   o    │    ___    ╲
   r    │   ╱       ╲
        │  ╱  Total   ╲
        │ ╱   Error    ╲
        │╱_______________╲______
        │     Bias²     ╲
        │________________╲_______
                          ↑
                    Sweet Spot
                    (Optimal Complexity)

![Bias Variance Tradeoff]

Finding the sweet spot: enough complexity to capture patterns, not so much that we capture noise

Practical Guidelines

WHEN TO USE EACH TECHNIQUE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

max_depth: 
  Start here! Most intuitive.
  Try: 3, 5, 7, 10
  Use when: You want direct control over tree size.

min_samples_leaf:
  Very effective! Ensures statistical reliability.
  Try: 1% to 5% of training data (e.g., 10-50)
  Use when: You want each prediction backed by data.

min_samples_split:
  Similar to min_samples_leaf but less strict.
  Try: 2× your min_samples_leaf value
  Use when: You want nodes to have enough data before splitting.

max_leaf_nodes:
  Direct complexity control.
  Try: 10, 20, 50
  Use when: You need exactly N complexity levels.

ccp_alpha:
  Most sophisticated! Automatic optimization.
  Find with cross-validation.
  Use when: You want the algorithm to find optimal pruning.


RECOMMENDED WORKFLOW:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Start with max_depth=5, min_samples_leaf=10
2. Check train vs test gap
3. If gap > 10%: More pruning needed
4. If both low: Less pruning needed
5. Use GridSearchCV to find optimal combination
6. Consider ccp_alpha for fine-tuning

Complete Example: From Overfit to Optimal

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
import warnings
warnings.filterwarnings('ignore')

# Load real dataset
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("FROM OVERFIT TO OPTIMAL: A COMPLETE WORKFLOW")
print("="*60)
print(f"\nDataset: Breast Cancer (569 samples, 30 features)")
print(f"Training: {len(X_train)}, Test: {len(X_test)}")

# Step 1: Baseline (overfit)
print("\n" + "="*60)
print("STEP 1: Baseline (No Pruning)")
print("="*60)

baseline = DecisionTreeClassifier(random_state=42)
baseline.fit(X_train, y_train)

print(f"Depth: {baseline.get_depth()}, Leaves: {baseline.get_n_leaves()}")
print(f"Training: {baseline.score(X_train, y_train):.2%}")
print(f"Test: {baseline.score(X_test, y_test):.2%}")
print(f"Gap: {baseline.score(X_train, y_train) - baseline.score(X_test, y_test):.2%} ← OVERFITTING!")

# Step 2: Simple pruning
print("\n" + "="*60)
print("STEP 2: Simple Pre-Pruning")
print("="*60)

simple = DecisionTreeClassifier(max_depth=5, min_samples_leaf=5, random_state=42)
simple.fit(X_train, y_train)

print(f"Depth: {simple.get_depth()}, Leaves: {simple.get_n_leaves()}")
print(f"Training: {simple.score(X_train, y_train):.2%}")
print(f"Test: {simple.score(X_test, y_test):.2%}")
print(f"Gap: {simple.score(X_train, y_train) - simple.score(X_test, y_test):.2%}")

# Step 3: Grid search
print("\n" + "="*60)
print("STEP 3: Grid Search Optimization")
print("="*60)

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'max_leaf_nodes': [10, 20, 30, None]
}

grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid.fit(X_train, y_train)

print(f"Best Parameters: {grid.best_params_}")
print(f"CV Score: {grid.best_score_:.2%}")
print(f"Test Score: {grid.score(X_test, y_test):.2%}")

best_tree = grid.best_estimator_
print(f"Best Tree: Depth={best_tree.get_depth()}, Leaves={best_tree.get_n_leaves()}")

# Step 4: Cost-complexity pruning
print("\n" + "="*60)
print("STEP 4: Cost-Complexity Pruning")
print("="*60)

# Find optimal alpha
path = DecisionTreeClassifier(random_state=42).cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas[:-1]  # Remove last (trivial tree)

cv_scores = []
for alpha in alphas:
    tree = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    scores = cross_val_score(tree, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

best_alpha = alphas[np.argmax(cv_scores)]
ccp_tree = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
ccp_tree.fit(X_train, y_train)

print(f"Optimal Alpha: {best_alpha:.6f}")
print(f"Depth: {ccp_tree.get_depth()}, Leaves: {ccp_tree.get_n_leaves()}")
print(f"Training: {ccp_tree.score(X_train, y_train):.2%}")
print(f"Test: {ccp_tree.score(X_test, y_test):.2%}")

# Summary
print("\n" + "="*60)
print("SUMMARY: TEST ACCURACY COMPARISON")
print("="*60)
print(f"Baseline (overfit):     {baseline.score(X_test, y_test):.2%}")
print(f"Simple pruning:         {simple.score(X_test, y_test):.2%}")
print(f"Grid search optimized:  {grid.score(X_test, y_test):.2%}")
print(f"CCP optimized:          {ccp_tree.score(X_test, y_test):.2%}")

Output:

FROM OVERFIT TO OPTIMAL: A COMPLETE WORKFLOW
============================================================

Dataset: Breast Cancer (569 samples, 30 features)
Training: 398, Test: 171

============================================================
STEP 1: Baseline (No Pruning)
============================================================
Depth: 7, Leaves: 21
Training: 100.00%
Test: 93.57%
Gap: 6.43% ← OVERFITTING!

============================================================
STEP 2: Simple Pre-Pruning
============================================================
Depth: 5, Leaves: 14
Training: 98.49%
Test: 95.32%
Gap: 3.17%

============================================================
STEP 3: Grid Search Optimization
============================================================
Best Parameters: {'max_depth': 5, 'max_leaf_nodes': 10, 'min_samples_leaf': 5, 'min_samples_split': 10}
CV Score: 93.72%
Test Score: 95.91%
Best Tree: Depth=5, Leaves=10

============================================================
STEP 4: Cost-Complexity Pruning
============================================================
Optimal Alpha: 0.010050
Depth: 4, Leaves: 8
Training: 96.48%
Test: 96.49%

============================================================
SUMMARY: TEST ACCURACY COMPARISON
============================================================
Baseline (overfit):     93.57%
Simple pruning:         95.32%
Grid search optimized:  95.91%
CCP optimized:          96.49%

Quick Reference Card

DECISION TREE PRUNING: CHEAT SHEET
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PRE-PRUNING PARAMETERS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

max_depth:         Maximum tree depth
                   Start with: 3-10

min_samples_split: Min samples to split a node
                   Start with: 10-50 (or 1-5% of data)

min_samples_leaf:  Min samples in each leaf
                   Start with: 5-20 (or 0.5-2% of data)

max_leaf_nodes:    Maximum number of leaves
                   Start with: 10-50

max_features:      Features considered per split
                   Options: None, 'sqrt', 'log2', int


POST-PRUNING:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ccp_alpha:         Complexity penalty
                   Find with cross-validation
                   Higher = more pruning


SIGNS OF OVERFITTING:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✗ Training acc >> Test acc (gap > 10%)
✗ Very deep tree (depth > 15)
✗ Many leaves (close to number of samples)
✗ Perfect training accuracy (100%)


WORKFLOW:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Train baseline (no restrictions)
2. Check train vs test gap
3. Apply pre-pruning (max_depth, min_samples_leaf)
4. Use GridSearchCV for optimization
5. Try ccp_alpha for fine-tuning
6. Select model with best cross-validation score

Key Takeaways

Overfitting = memorizing, not learning — The tree captures noise instead of patterns
Signs of overfitting: High training accuracy, low test accuracy, very deep tree
Pre-pruning stops early — Set limits before training (max_depth, min_samples_*)
Post-pruning cuts back — Grow fully, then remove branches (ccp_alpha)
max_depth is your first tool — Start with 3-10, adjust based on results
min_samples_leaf ensures reliability — Each prediction is backed by enough data
ccp_alpha is most sophisticated — Automatically finds optimal pruning level
Use cross-validation — Never tune on test data, use GridSearchCV

The One-Sentence Summary

Preventing decision tree overfitting is like being a bonsai master: Wild Willow grew in every direction and captured every quirk (memorized), while Balanced Bonsai grew strategically with max_depth, min_samples_leaf, and ccp_alpha to capture only the important patterns (generalized) — and that's why Balanced Bonsai thrives with new data while Wild Willow withers.

What's Next?

Now that you understand overfitting prevention, you're ready for:

Random Forests — Many pruned trees voting together
Ensemble Methods — Combining models for better results
Gradient Boosting — Trees that learn from mistakes
Feature Importance — Which features matter most?

Follow me for the next article in the Tree Based Models series!

Let's Connect!

If the bonsai master made pruning click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your go-to pruning strategy? I usually start with max_depth=5 and min_samples_leaf=10, then fine-tune with ccp_alpha! 🌿

The difference between a wild tree and a bonsai? Strategic cuts. The wild tree reaches everywhere but masters nothing; the bonsai focuses its energy and creates beauty. Your decision tree can be either — the choice is in your hyperparameters.

Share this with someone struggling with overfitting. The bonsai master awaits!

Happy pruning! ✂️🌳

DEV Community

Pruning Decision Trees: The Bonsai Master Who Taught ML Engineers When to Stop

The Tale of Two Trees

Tree #1: Wild Willow (The Overfitter)

Tree #2: Balanced Bonsai (The Generalizer)

The Gardener's Wisdom

What Is Overfitting?

Why Do Decision Trees Overfit?

The Two Pruning Strategies

Pre-Pruning: Setting Growth Limits

1. max_depth: How Deep Can It Grow?

2. min_samples_split: Minimum Samples to Split

3. min_samples_leaf: Minimum Samples in a Leaf

4. max_leaf_nodes: Cap the Total Leaves

5. max_features: Limit Features Per Split

Post-Pruning: Grow Then Cut Back

Cost-Complexity Pruning (ccp_alpha)

Finding the Best Alpha with Cross-Validation

The Complete Pruning Toolkit

Visualizing Overfitting

The Bias-Variance Trade-off

Practical Guidelines

Complete Example: From Overfit to Optimal

Quick Reference Card

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)