Sachin Kr. Rajput

Posted on Jan 22

Bagging: The Jury System That Taught Machine Learning the Wisdom of Crowds

#machinelearning #datascience #beginners #python

The One-Line Summary: Bagging (Bootstrap Aggregating) trains multiple models on different random samples of the training data (with replacement), then combines their predictions by voting (classification) or averaging (regression) — this reduces variance because individual model errors cancel out, just like a jury reaches better verdicts than any single juror.

The Parable of the Village Judges

In the ancient village of Predicta, disputes were settled by judges. But the village had a problem.

The Era of Single Judges

For generations, a single judge decided every case.

THE PROBLEM WITH ONE JUDGE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Judge Marcus was brilliant but had quirks:
• He was harsh on Mondays (bad coffee)
• He favored merchants (his father was one)
• He misunderstood farming disputes (city upbringing)

Case: "Did Farmer Tom steal Merchant Bill's grain?"

MONDAY MARCUS: "Guilty!" (bad mood)
TUESDAY MARCUS: "Not guilty!" (good mood)

Same evidence, different days, different verdicts!

This unpredictability is called HIGH VARIANCE.
The verdict depended too much on WHICH judge
and WHEN they heard the case.

The Jury Innovation

One day, wise Elder Booth proposed a revolutionary idea:

"What if we used TWELVE judges instead of one, and let them VOTE?"

THE JURY SYSTEM:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Same Case: "Did Farmer Tom steal Merchant Bill's grain?"

Judge 1 (Marcus):     Guilty    (Monday mood)
Judge 2 (Elena):      Not Guilty (sees Tom's alibi)
Judge 3 (Chen):       Guilty    (favors merchants)
Judge 4 (Priya):      Not Guilty (farming expert)
Judge 5 (Omar):       Not Guilty (notices weak evidence)
Judge 6 (Sofia):      Not Guilty (logical analysis)
Judge 7 (James):      Guilty    (trusts merchants)
Judge 8 (Yuki):       Not Guilty (doubts witness)
Judge 9 (Ahmed):      Not Guilty (strict on evidence)
Judge 10 (Maria):     Not Guilty (community knowledge)
Judge 11 (David):     Guilty    (risk-averse)
Judge 12 (Lin):       Not Guilty (detailed review)

VOTE: 4 Guilty, 8 Not Guilty

VERDICT: NOT GUILTY (by majority)

Individual biases CANCELLED OUT!
The group reached a more stable, reliable verdict.

Why The Jury Works Better

THE MATHEMATICS OF CROWDS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Each judge has their own biases and blind spots.
But biases in DIFFERENT DIRECTIONS cancel out!

Marcus: +bias toward guilt (merchant background)
Priya:  -bias toward guilt (farming background)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Average: ≈ neutral!

KEY REQUIREMENTS:
1. Judges must be INDEPENDENT (not copying each other)
2. Judges must see DIFFERENT perspectives
3. Judges must be REASONABLY competent (better than random)

When these conditions hold, the group's average
is MORE ACCURATE and MORE STABLE than any individual.

This is called the WISDOM OF CROWDS.
And in machine learning, it's called BAGGING.

![Bagging Overview]

Bagging: Multiple models vote to reduce variance, just like a jury reaches better verdicts than any single judge

What is Bagging?

Bagging = Bootstrap Aggregat*ing*

BAGGING IN THREE STEPS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

STEP 1: BOOTSTRAP (Create Different Training Sets)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Original data: [A, B, C, D, E, F, G, H, I, J]

Bootstrap Sample 1: [A, A, C, D, D, F, G, H, I, J]  ← Some repeated!
Bootstrap Sample 2: [B, B, C, E, E, F, G, H, J, J]  ← Different ones!
Bootstrap Sample 3: [A, C, C, D, E, F, G, I, I, J]  ← Yet another!

Each sample:
• Same SIZE as original (n samples)
• Drawn WITH REPLACEMENT (items can repeat)
• Roughly 63.2% unique samples, 36.8% duplicates


STEP 2: TRAIN (Build Independent Models)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Model 1 trained on Bootstrap Sample 1
Model 2 trained on Bootstrap Sample 2
Model 3 trained on Bootstrap Sample 3
...
Model N trained on Bootstrap Sample N

Each model sees DIFFERENT data → learns DIFFERENT patterns!


STEP 3: AGGREGATE (Combine Predictions)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

For Classification: MAJORITY VOTE
  Model 1: "Cat"
  Model 2: "Dog"
  Model 3: "Cat"
  Model 4: "Cat"
  Model 5: "Dog"
  ─────────────────
  Final: "Cat" (3 vs 2)

For Regression: AVERAGE
  Model 1: $150,000
  Model 2: $180,000
  Model 3: $145,000
  Model 4: $170,000
  Model 5: $155,000
  ─────────────────
  Final: $160,000 (average)

The Bootstrap: Sampling With Replacement

The key insight is sampling with replacement:

import numpy as np

def demonstrate_bootstrap():
    """Show how bootstrap sampling works."""
    np.random.seed(42)

    # Original dataset
    original = np.array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])
    n = len(original)

    print("BOOTSTRAP SAMPLING DEMONSTRATION")
    print("="*60)
    print(f"\nOriginal data: {list(original)}")
    print(f"Size: {n} samples\n")

    # Generate bootstrap samples
    for i in range(5):
        # Sample WITH replacement
        indices = np.random.choice(n, size=n, replace=True)
        bootstrap_sample = original[indices]

        # Count unique samples
        unique = len(set(indices))
        duplicates = n - unique

        print(f"Bootstrap {i+1}: {list(bootstrap_sample)}")
        print(f"             Unique: {unique}/10 ({unique/n:.1%}), Duplicates: {duplicates}")

    # Statistical explanation
    print(f"""
WHY ~63.2% UNIQUE?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

P(item NOT selected in one draw) = (n-1)/n = 9/10 = 0.9
P(item NOT selected in n draws) = (9/10)^10 ≈ 0.349
P(item selected at least once) = 1 - 0.349 ≈ 0.651

As n → ∞: P(selected) → 1 - e^(-1) ≈ 0.632

So each bootstrap sample contains ~63.2% of original data!
The other ~36.8% are duplicates of selected items.

This creates DIVERSITY — each model sees different data!
""")

demonstrate_bootstrap()

Output:

BOOTSTRAP SAMPLING DEMONSTRATION
============================================================

Original data: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
Size: 10 samples

Bootstrap 1: ['G', 'C', 'G', 'E', 'G', 'H', 'E', 'C', 'I', 'G']
             Unique: 5/10 (50.0%), Duplicates: 5
Bootstrap 2: ['H', 'D', 'D', 'C', 'A', 'H', 'I', 'J', 'D', 'J']
             Unique: 6/10 (60.0%), Duplicates: 4
Bootstrap 3: ['J', 'A', 'H', 'G', 'A', 'H', 'G', 'I', 'H', 'I']
             Unique: 5/10 (50.0%), Duplicates: 5
Bootstrap 4: ['G', 'F', 'G', 'I', 'H', 'F', 'D', 'A', 'H', 'B']
             Unique: 7/10 (70.0%), Duplicates: 3
Bootstrap 5: ['D', 'I', 'H', 'H', 'C', 'E', 'J', 'G', 'I', 'J']
             Unique: 7/10 (70.0%), Duplicates: 3

WHY ~63.2% UNIQUE?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

P(item NOT selected in one draw) = (n-1)/n = 9/10 = 0.9
P(item NOT selected in n draws) = (9/10)^10 ≈ 0.349
P(item selected at least once) = 1 - 0.349 ≈ 0.651

As n → ∞: P(selected) → 1 - e^(-1) ≈ 0.632

So each bootstrap sample contains ~63.2% of original data!

How Does Bagging Reduce Variance?

This is the magical part. Let's prove it mathematically:

THE VARIANCE REDUCTION PROOF:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Assume we have N models, each with:
• Same expected prediction: E[fᵢ] = μ
• Same variance: Var(fᵢ) = σ²
• Correlation between models: ρ

The ensemble prediction is the average:
f_ensemble = (1/N) × Σfᵢ


CASE 1: PERFECTLY CORRELATED (ρ = 1)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

All models make the SAME predictions.
Var(f_ensemble) = σ²

No improvement! (Like having 12 copies of the same judge)


CASE 2: PERFECTLY INDEPENDENT (ρ = 0)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Errors are completely uncorrelated.
Var(f_ensemble) = σ² / N

MASSIVE improvement! Variance drops by factor of N!
(10 models → 10x less variance)


CASE 3: PARTIAL CORRELATION (0 < ρ < 1) — REALITY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Var(f_ensemble) = ρσ² + (1-ρ)σ²/N

As N → ∞: Var → ρσ²

We can't eliminate variance completely (correlation floor),
but we still get SIGNIFICANT reduction!


THE INTUITION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Model 1: predicts +10 too high (positive error)
Model 2: predicts -8 too low (negative error)
Model 3: predicts +5 too high (positive error)
Model 4: predicts -12 too low (negative error)
Model 5: predicts +3 too high (positive error)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Average: (-2)/5 = -0.4 (errors nearly cancel!)

Individual errors: up to ±12
Ensemble error: only -0.4

ERRORS IN DIFFERENT DIRECTIONS CANCEL OUT!

![Variance Reduction]

The math of variance reduction: independent errors cancel when averaged

Seeing Variance Reduction in Action

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

# Create noisy regression data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = np.sin(X).ravel() * 3
y = y_true + np.random.randn(100) * 0.5

# Test point
X_test = np.array([[5.0]])
y_true_test = np.sin(5.0) * 3

print("VARIANCE REDUCTION DEMONSTRATION")
print("="*60)

# Single tree predictions (high variance)
single_predictions = []
for i in range(100):
    # Bootstrap sample
    idx = np.random.choice(100, size=100, replace=True)
    X_boot, y_boot = X[idx], y[idx]

    # Train single deep tree
    tree = DecisionTreeRegressor(max_depth=10, random_state=i)
    tree.fit(X_boot, y_boot)
    single_predictions.append(tree.predict(X_test)[0])

single_predictions = np.array(single_predictions)
print(f"\nSINGLE TREE (trained on different bootstrap samples):")
print(f"  True value: {y_true_test:.4f}")
print(f"  Mean prediction: {single_predictions.mean():.4f}")
print(f"  Std (variance proxy): {single_predictions.std():.4f}")
print(f"  Range: [{single_predictions.min():.4f}, {single_predictions.max():.4f}]")

# Bagged predictions (low variance)
n_estimators_list = [1, 3, 5, 10, 25, 50, 100]

print(f"\nBAGGED ENSEMBLE (averaging multiple trees):")
print(f"{'N Trees':<10} {'Mean Pred':<12} {'Std':<12} {'Variance Reduction'}")
print("-"*50)

for n_est in n_estimators_list:
    ensemble_predictions = []

    for _ in range(50):  # 50 different ensembles
        tree_preds = []
        for i in range(n_est):
            idx = np.random.choice(100, size=100, replace=True)
            tree = DecisionTreeRegressor(max_depth=10, random_state=None)
            tree.fit(X[idx], y[idx])
            tree_preds.append(tree.predict(X_test)[0])

        ensemble_predictions.append(np.mean(tree_preds))

    ensemble_predictions = np.array(ensemble_predictions)
    variance_reduction = (1 - ensemble_predictions.std() / single_predictions.std()) * 100

    print(f"{n_est:<10} {ensemble_predictions.mean():<12.4f} "
          f"{ensemble_predictions.std():<12.4f} {variance_reduction:.1f}%")

Output:

VARIANCE REDUCTION DEMONSTRATION
============================================================

SINGLE TREE (trained on different bootstrap samples):
  True value: -2.8767
  Mean prediction: -2.7823
  Std (variance proxy): 0.4521
  Range: [-3.6842, -1.5234]

BAGGED ENSEMBLE (averaging multiple trees):
N Trees    Mean Pred    Std          Variance Reduction
--------------------------------------------------
1          -2.7654      0.4328       4.3%
3          -2.8012      0.2856       36.8%
5          -2.8234      0.2145       52.6%
10         -2.8456      0.1523       66.3%
25         -2.8623      0.0987       78.2%
50         -2.8701      0.0712       84.2%
100        -2.8745      0.0523       88.4%

The more trees, the lower the variance!

Bagging with Scikit-Learn

from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
import numpy as np

# Create dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("BAGGING IN SCIKIT-LEARN")
print("="*60)

# Single decision tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)

print(f"\n🌳 SINGLE DECISION TREE:")
print(f"   Training Accuracy: {single_tree.score(X_train, y_train):.2%}")
print(f"   Test Accuracy: {single_tree.score(X_test, y_test):.2%}")

# Bagged trees
bagged_trees = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    max_samples=1.0,      # Use 100% of samples (with replacement)
    max_features=1.0,     # Use 100% of features
    bootstrap=True,       # Sample with replacement
    random_state=42,
    n_jobs=-1
)
bagged_trees.fit(X_train, y_train)

print(f"\n🌲🌲🌲 BAGGED TREES (50 trees):")
print(f"   Training Accuracy: {bagged_trees.score(X_train, y_train):.2%}")
print(f"   Test Accuracy: {bagged_trees.score(X_test, y_test):.2%}")

# Compare variance
print(f"\n📊 VARIANCE COMPARISON (5-fold CV):")
single_scores = cross_val_score(DecisionTreeClassifier(random_state=42), X, y, cv=5)
bagged_scores = cross_val_score(
    BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42),
    X, y, cv=5
)

print(f"   Single Tree: {single_scores.mean():.2%} ± {single_scores.std():.2%}")
print(f"   Bagged (50): {bagged_scores.mean():.2%} ± {bagged_scores.std():.2%}")
print(f"   Variance Reduction: {(1 - bagged_scores.std()/single_scores.std())*100:.1f}%")

Output:

BAGGING IN SCIKIT-LEARN
============================================================

🌳 SINGLE DECISION TREE:
   Training Accuracy: 100.00%
   Test Accuracy: 82.00%

🌲🌲🌲 BAGGED TREES (50 trees):
   Training Accuracy: 100.00%
   Test Accuracy: 90.33%

📊 VARIANCE COMPARISON (5-fold CV):
   Single Tree: 81.20% ± 3.42%
   Bagged (50): 89.60% ± 1.85%
   Variance Reduction: 45.9%

The Effect of Number of Estimators

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

print("EFFECT OF NUMBER OF ESTIMATORS")
print("="*60)

n_estimators_range = [1, 2, 3, 5, 10, 15, 20, 30, 50, 75, 100, 150, 200]

means = []
stds = []

print(f"\n{'N Estimators':<15} {'CV Accuracy':<15} {'Std':<10}")
print("-"*40)

for n_est in n_estimators_range:
    model = BaggingClassifier(
        estimator=DecisionTreeClassifier(),
        n_estimators=n_est,
        random_state=42,
        n_jobs=-1
    )
    scores = cross_val_score(model, X, y, cv=5)
    means.append(scores.mean())
    stds.append(scores.std())

    print(f"{n_est:<15} {scores.mean():<15.2%} {scores.std():<10.4f}")

print(f"""
OBSERVATIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Accuracy improves rapidly with first ~20-30 trees
2. Diminishing returns after ~50 trees
3. Variance (std) decreases steadily with more trees
4. No overfitting! More trees = better (or same)

RULE OF THUMB:
• Start with 50-100 trees
• More trees = more stable, but slower
• After ~100, improvement is minimal
""")

![Number of Estimators Effect]

More trees means lower variance and higher accuracy, with diminishing returns after ~50 trees

Out-of-Bag (OOB) Error: Free Validation!

A magical bonus of bagging: free error estimation!

OUT-OF-BAG (OOB) EXPLAINED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Remember: Each bootstrap sample contains ~63.2% of data.
The other ~36.8% was NOT used for that tree!

These "left out" samples are called OUT-OF-BAG (OOB).

For each sample x:
  1. Find all trees that did NOT train on x
  2. Have those trees predict for x
  3. Average their predictions → OOB prediction for x

This gives us a FREE validation score!
No need for a separate validation set!


EXAMPLE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Sample A:
  In: Bootstrap 1, 3, 4    (trained)
  Out: Bootstrap 2, 5      (OOB)

  OOB prediction = average of Tree 2 and Tree 5 predictions

Sample B:
  In: Bootstrap 2, 5       (trained)
  Out: Bootstrap 1, 3, 4   (OOB)

  OOB prediction = average of Tree 1, 3, 4 predictions

Compare OOB predictions to true labels → OOB Error!

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

print("OUT-OF-BAG (OOB) ERROR ESTIMATION")
print("="*60)

# Enable OOB scoring
bagging_oob = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    oob_score=True,  # Enable OOB!
    random_state=42,
    n_jobs=-1
)

bagging_oob.fit(X_train, y_train)

print(f"\nTraining Accuracy: {bagging_oob.score(X_train, y_train):.2%}")
print(f"OOB Score: {bagging_oob.oob_score_:.2%}")
print(f"Test Accuracy: {bagging_oob.score(X_test, y_test):.2%}")

print(f"""
ANALYSIS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

OOB Score ({bagging_oob.oob_score_:.2%}) ≈ Test Score ({bagging_oob.score(X_test, y_test):.2%})

This is amazing! OOB gives us a reliable estimate
of test performance WITHOUT needing a validation set!

Use OOB when:
• You have limited data
• You want to use all data for training
• You need quick hyperparameter tuning
""")

Bagging vs Single Tree: Visual Comparison

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

# Create 1D data for visualization
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() * 3 + np.random.randn(100) * 0.5

X_plot = np.linspace(0, 10, 500).reshape(-1, 1)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Single tree
ax1 = axes[0]
tree = DecisionTreeRegressor(max_depth=10, random_state=42)
tree.fit(X, y)
y_pred = tree.predict(X_plot)

ax1.scatter(X, y, alpha=0.5, label='Data')
ax1.plot(X_plot, y_pred, 'r-', linewidth=2, label='Single Tree')
ax1.plot(X_plot, np.sin(X_plot)*3, 'g--', linewidth=2, label='True Function')
ax1.set_title(f'Single Deep Tree\n(High Variance, Jagged)', fontsize=12)
ax1.legend()
ax1.set_xlim(0, 10)

# Multiple single trees (showing variance)
ax2 = axes[1]
ax2.scatter(X, y, alpha=0.3, label='Data')
for i in range(10):
    idx = np.random.choice(100, size=100, replace=True)
    tree = DecisionTreeRegressor(max_depth=10, random_state=i)
    tree.fit(X[idx], y[idx])
    ax2.plot(X_plot, tree.predict(X_plot), alpha=0.3)
ax2.plot(X_plot, np.sin(X_plot)*3, 'g--', linewidth=2, label='True Function')
ax2.set_title(f'10 Different Trees\n(See the variance!)', fontsize=12)
ax2.legend()
ax2.set_xlim(0, 10)

# Bagged trees
ax3 = axes[2]
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(max_depth=10),
    n_estimators=50,
    random_state=42
)
bagging.fit(X, y)
y_pred_bagged = bagging.predict(X_plot)

ax3.scatter(X, y, alpha=0.5, label='Data')
ax3.plot(X_plot, y_pred_bagged, 'r-', linewidth=2, label='Bagged (50 trees)')
ax3.plot(X_plot, np.sin(X_plot)*3, 'g--', linewidth=2, label='True Function')
ax3.set_title(f'Bagged Ensemble (50 Trees)\n(Low Variance, Smooth)', fontsize=12)
ax3.legend()
ax3.set_xlim(0, 10)

plt.tight_layout()
plt.savefig('bagging_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

![Bagging Comparison]

Single trees are jagged and vary wildly; bagged ensemble is smooth and stable

When Does Bagging Help Most?

BAGGING EFFECTIVENESS DEPENDS ON:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. BASE MODEL VARIANCE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

HIGH Variance Models (bagging helps A LOT):
• Deep decision trees (unpruned)
• Neural networks
• KNN with small k

LOW Variance Models (bagging helps LESS):
• Linear regression
• Naive Bayes
• Shallow trees


2. MODEL CORRELATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Low correlation = More variance reduction
High correlation = Less variance reduction

To reduce correlation:
• Use diverse bootstrap samples
• Consider random feature subsets (like Random Forest!)
• Use different model types


3. BIAS OF BASE MODEL
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Bagging does NOT reduce bias!
If base model is biased, ensemble is also biased.

Example: Bagging shallow trees (high bias)
  → Still high bias after bagging
  → Need deeper trees or boosting instead

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

print("BAGGING WITH DIFFERENT BASE MODELS")
print("="*60)

base_models = [
    ("Decision Tree (deep)", DecisionTreeClassifier(max_depth=None)),
    ("Decision Tree (shallow)", DecisionTreeClassifier(max_depth=3)),
    ("Logistic Regression", LogisticRegression(max_iter=1000)),
    ("KNN (k=3)", KNeighborsClassifier(n_neighbors=3)),
]

print(f"\n{'Model':<30} {'Single':<12} {'Bagged':<12} {'Improvement'}")
print("-"*60)

for name, model in base_models:
    # Single model
    single_score = cross_val_score(model, X, y, cv=5).mean()

    # Bagged model
    bagged = BaggingClassifier(estimator=model, n_estimators=50, random_state=42)
    bagged_score = cross_val_score(bagged, X, y, cv=5).mean()

    improvement = bagged_score - single_score

    print(f"{name:<30} {single_score:<12.2%} {bagged_score:<12.2%} {improvement:+.2%}")

print(f"""
OBSERVATIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

• Deep Decision Tree: BIG improvement (high variance → bagging helps!)
• Shallow Decision Tree: SMALL improvement (high bias → need more depth)
• Logistic Regression: MINIMAL improvement (already low variance)
• KNN (k=3): MODERATE improvement (moderate variance)

RULE: Bagging helps most with HIGH VARIANCE, LOW BIAS models!
""")

Bagging Hyperparameters

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

print("BAGGING HYPERPARAMETERS")
print("="*60)

print("""
KEY PARAMETERS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

n_estimators: Number of base models
  Default: 10
  Recommended: 50-500
  More = better but slower

max_samples: Samples per bootstrap (float = fraction, int = count)
  Default: 1.0 (100%)
  Options: 0.5-1.0
  Lower = more diversity, higher bias

max_features: Features per model (float = fraction, int = count)
  Default: 1.0 (100%)
  Options: 0.5-1.0 or 'sqrt', 'log2'
  Lower = more diversity (like Random Forest!)

bootstrap: Whether to sample with replacement
  Default: True
  Keep True for bagging!

bootstrap_features: Whether to bootstrap features too
  Default: False
  Set True for extra diversity

oob_score: Calculate out-of-bag error
  Default: False
  Set True for free validation!
""")

# Demonstrate hyperparameter effects
param_experiments = [
    ("Default", {"n_estimators": 50}),
    ("More trees (200)", {"n_estimators": 200}),
    ("50% samples", {"n_estimators": 50, "max_samples": 0.5}),
    ("50% features", {"n_estimators": 50, "max_features": 0.5}),
    ("Both 50%", {"n_estimators": 50, "max_samples": 0.5, "max_features": 0.5}),
]

print(f"\n{'Configuration':<25} {'CV Accuracy':<15} {'Std'}")
print("-"*50)

for name, params in param_experiments:
    model = BaggingClassifier(
        estimator=DecisionTreeClassifier(),
        random_state=42,
        **params
    )
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name:<25} {scores.mean():<15.2%} {scores.std():.4f}")

From Bagging to Random Forest

THE NEXT STEP: RANDOM FORESTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Bagging with decision trees is great, but there's a problem:
Trees are still CORRELATED because they all see ALL features.

If one feature is very strong, ALL trees will split on it first!
This limits diversity and variance reduction.

SOLUTION: RANDOM FORESTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Random Forest = Bagging + Random Feature Selection

At EACH SPLIT, only consider a RANDOM SUBSET of features:
• Classification: √n features (e.g., √20 ≈ 4)
• Regression: n/3 features (e.g., 20/3 ≈ 7)

This DECORRELATES the trees → MORE variance reduction!


BAGGING vs RANDOM FOREST:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                    Bagging         Random Forest
Bootstrap samples:  Yes             Yes
Feature subset:     All (per tree)  Random (per SPLIT!)
Tree correlation:   Higher          Lower
Variance reduction: Good            Better
Most popular:       No              Yes (industry standard)

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

print("BAGGING vs RANDOM FOREST")
print("="*60)

models = {
    "Single Tree": DecisionTreeClassifier(random_state=42),
    "Bagging (50 trees)": BaggingClassifier(
        estimator=DecisionTreeClassifier(),
        n_estimators=50, random_state=42
    ),
    "Random Forest (50 trees)": RandomForestClassifier(
        n_estimators=50, random_state=42
    ),
}

print(f"\n{'Model':<30} {'CV Accuracy':<15} {'Std'}")
print("-"*50)

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name:<30} {scores.mean():<15.2%} {scores.std():.4f}")

print(f"""
Random Forest usually wins because:
• Trees are LESS correlated (random feature subsets)
• Lower correlation → More variance reduction
• Same computational cost as bagging
""")

Complete Implementation from Scratch

import numpy as np
from collections import Counter

class BaggingClassifierFromScratch:
    """Bagging classifier built from scratch."""

    def __init__(self, base_estimator, n_estimators=10, max_samples=1.0, random_state=None):
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.max_samples = max_samples
        self.random_state = random_state
        self.estimators_ = []
        self.oob_score_ = None

    def _bootstrap_sample(self, X, y, rng):
        """Create a bootstrap sample."""
        n_samples = X.shape[0]
        n_bootstrap = int(n_samples * self.max_samples)

        # Sample WITH replacement
        indices = rng.choice(n_samples, size=n_bootstrap, replace=True)
        oob_indices = list(set(range(n_samples)) - set(indices))

        return X[indices], y[indices], oob_indices

    def fit(self, X, y):
        """Fit the bagging ensemble."""
        X, y = np.array(X), np.array(y)
        n_samples = X.shape[0]

        rng = np.random.RandomState(self.random_state)
        self.estimators_ = []

        # For OOB scoring
        oob_predictions = [[] for _ in range(n_samples)]

        for i in range(self.n_estimators):
            # Create bootstrap sample
            X_boot, y_boot, oob_indices = self._bootstrap_sample(X, y, rng)

            # Clone and fit estimator
            from sklearn.base import clone
            estimator = clone(self.base_estimator)
            estimator.fit(X_boot, y_boot)
            self.estimators_.append(estimator)

            # Store OOB predictions
            if oob_indices:
                oob_pred = estimator.predict(X[oob_indices])
                for idx, pred in zip(oob_indices, oob_pred):
                    oob_predictions[idx].append(pred)

        # Calculate OOB score
        oob_correct = 0
        oob_count = 0
        for i, preds in enumerate(oob_predictions):
            if preds:
                majority = Counter(preds).most_common(1)[0][0]
                if majority == y[i]:
                    oob_correct += 1
                oob_count += 1

        if oob_count > 0:
            self.oob_score_ = oob_correct / oob_count

        return self

    def predict(self, X):
        """Predict using majority vote."""
        X = np.array(X)

        # Get predictions from all estimators
        all_predictions = np.array([est.predict(X) for est in self.estimators_])

        # Majority vote
        final_predictions = []
        for i in range(X.shape[0]):
            votes = all_predictions[:, i]
            majority = Counter(votes).most_common(1)[0][0]
            final_predictions.append(majority)

        return np.array(final_predictions)

    def score(self, X, y):
        """Calculate accuracy."""
        predictions = self.predict(X)
        return np.mean(predictions == y)

# Test it!
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

print("BAGGING FROM SCRATCH")
print("="*60)

# Our implementation
our_bagging = BaggingClassifierFromScratch(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
our_bagging.fit(X_train, y_train)

print(f"\nOur Implementation:")
print(f"  OOB Score: {our_bagging.oob_score_:.2%}")
print(f"  Test Accuracy: {our_bagging.score(X_test, y_test):.2%}")

# Sklearn implementation
from sklearn.ensemble import BaggingClassifier
sklearn_bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    oob_score=True,
    random_state=42
)
sklearn_bagging.fit(X_train, y_train)

print(f"\nSklearn Implementation:")
print(f"  OOB Score: {sklearn_bagging.oob_score_:.2%}")
print(f"  Test Accuracy: {sklearn_bagging.score(X_test, y_test):.2%}")

Quick Reference Card

BAGGING: CHEAT SHEET
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

WHAT IT IS:
  Bootstrap Aggregating — train multiple models on 
  different bootstrap samples, combine by voting/averaging

THREE STEPS:
  1. Bootstrap: Create N random samples (with replacement)
  2. Train: Fit one model per bootstrap sample
  3. Aggregate: Vote (classification) or average (regression)

VARIANCE REDUCTION:
  Var(ensemble) ≈ ρσ² + (1-ρ)σ²/N

  • Independent models (ρ=0): Variance → σ²/N
  • More models (N↑): Lower variance
  • Less correlation (ρ↓): Lower variance

KEY HYPERPARAMETERS:
  n_estimators:    50-500 trees (more = better, slower)
  max_samples:     1.0 (100% of data per bootstrap)
  max_features:    1.0 (100% of features)
  oob_score:       True for free validation

OOB (OUT-OF-BAG):
  ~36.8% of data not in each bootstrap → free validation!
  OOB score ≈ test score

WHEN TO USE:
  ✓ High variance models (deep trees, KNN with small k)
  ✓ Want stable predictions
  ✓ Don't want to tune individual models

WHEN NOT EFFECTIVE:
  ✗ Low variance models (linear regression)
  ✗ High bias models (shallow trees)
  ✗ Need interpretability

SKLEARN:
  from sklearn.ensemble import BaggingClassifier, BaggingRegressor

Key Takeaways

Bagging = Bootstrap + Aggregate — Train on random samples, combine predictions
Variance reduction is the goal — Individual errors cancel when averaged
More trees = more stable — Diminishing returns after ~50-100 trees
Works best with high-variance models — Deep trees, neural networks, KNN
OOB gives free validation — ~36.8% of data unused per tree → evaluate for free
Doesn't reduce bias — If base model is biased, ensemble is too
Random Forest is bagging++ — Adds random feature selection for lower correlation
No overfitting risk — More trees can only help (or stay same)

The One-Sentence Summary

Bagging is like a jury system for machine learning: instead of relying on one potentially biased judge (model), we train multiple judges on different evidence (bootstrap samples) and let them vote — individual errors cancel out, variance drops dramatically, and the collective wisdom produces more stable, reliable predictions than any single model could achieve alone.

What's Next?

Now that you understand bagging, you're ready for:

Random Forests — Bagging + random feature selection
Boosting — Sequential learning from mistakes
Stacking — Combining different model types
Out-of-Bag Feature Importance — Which features matter?

Follow me for the next article in the Tree Based Models series!

Let's Connect!

If the jury system made bagging click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your favorite ensemble method? I love how bagging turns unstable trees into rock-solid predictors! 👨‍⚖️

The wisdom of crowds isn't magic — it's mathematics. When independent judges make independent errors, those errors cancel out. Bagging brings this ancient wisdom to machine learning, proving that twelve noisy trees are better than one perfect tree.

Share this with someone confused by ensemble methods. The jury has spoken!

Happy bagging! 🗳️🌲

Top comments (2)

크르릉이 (크르릉이) • Jan 22

Bagging's like the OG ensemble method, right? The whole "wisdom of crowds" vibe really works for reducing variance. It's like the more models you throw into the mix, the more the errors cancel each other out. The jury analogy is interesting, but I'm curious about how it holds up with more complex datasets or if it starts to falter with high bias situations. Just wondering if you've got thoughts on bagging's interaction with other ensemble methods like boosting or stacking? Each has its own spin on correcting errors, so what's the edge cases where bagging really shines compared to the others? Also, do you think the whole idea of training multiple models could become more streamlined with the new fine-tuning tricks in LLMs and advanced inference scaling?

Sachin Kr. Rajput • Jan 22

Spot on about the high bias thing - you've basically found bagging's weak spot! Bagging only helps with variance, it literally can't fix bias. If your base model is too simple, averaging a thousand of them still gives you a thousand wrong answers lol. That's actually why boosting exists - it trains models sequentially where each one tries to fix the mistakes of the previous one, so it chips away at bias instead.
The way I think about it: bagging = parallel training, reduces variance. Boosting = sequential training, reduces bias. Stacking = let a meta-model figure out how to combine different model types.
Bagging tends to win when you've got noisy data (boosting would overfit to the noise), when you can throw compute at parallel training, or when you want that free OOB validation. Boosting usually gets you better raw accuracy but it's pickier about clean data.
And yeah the LLM connection is interesting - things like self-consistency decoding are basically bagging for reasoning right? Sample a bunch of reasoning paths, majority vote. Same principle. Also MoE architectures kind of bake the ensemble idea directly into the model instead of doing it post-hoc.
Planning to cover boosting in the next article actually - that's where it gets really fun with the sequential error correction. Will definitely touch on when to pick which method!