The One-Line Summary: Bagging (Bootstrap Aggregating) trains multiple models on different random samples of the training data (with replacement), then combines their predictions by voting (classification) or averaging (regression) — this reduces variance because individual model errors cancel out, just like a jury reaches better verdicts than any single juror.
The Parable of the Village Judges
In the ancient village of Predicta, disputes were settled by judges. But the village had a problem.
The Era of Single Judges
For generations, a single judge decided every case.
THE PROBLEM WITH ONE JUDGE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Judge Marcus was brilliant but had quirks:
• He was harsh on Mondays (bad coffee)
• He favored merchants (his father was one)
• He misunderstood farming disputes (city upbringing)
Case: "Did Farmer Tom steal Merchant Bill's grain?"
MONDAY MARCUS: "Guilty!" (bad mood)
TUESDAY MARCUS: "Not guilty!" (good mood)
Same evidence, different days, different verdicts!
This unpredictability is called HIGH VARIANCE.
The verdict depended too much on WHICH judge
and WHEN they heard the case.
The Jury Innovation
One day, wise Elder Booth proposed a revolutionary idea:
"What if we used TWELVE judges instead of one, and let them VOTE?"
THE JURY SYSTEM:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Same Case: "Did Farmer Tom steal Merchant Bill's grain?"
Judge 1 (Marcus): Guilty (Monday mood)
Judge 2 (Elena): Not Guilty (sees Tom's alibi)
Judge 3 (Chen): Guilty (favors merchants)
Judge 4 (Priya): Not Guilty (farming expert)
Judge 5 (Omar): Not Guilty (notices weak evidence)
Judge 6 (Sofia): Not Guilty (logical analysis)
Judge 7 (James): Guilty (trusts merchants)
Judge 8 (Yuki): Not Guilty (doubts witness)
Judge 9 (Ahmed): Not Guilty (strict on evidence)
Judge 10 (Maria): Not Guilty (community knowledge)
Judge 11 (David): Guilty (risk-averse)
Judge 12 (Lin): Not Guilty (detailed review)
VOTE: 4 Guilty, 8 Not Guilty
VERDICT: NOT GUILTY (by majority)
Individual biases CANCELLED OUT!
The group reached a more stable, reliable verdict.
Why The Jury Works Better
THE MATHEMATICS OF CROWDS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Each judge has their own biases and blind spots.
But biases in DIFFERENT DIRECTIONS cancel out!
Marcus: +bias toward guilt (merchant background)
Priya: -bias toward guilt (farming background)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Average: ≈ neutral!
KEY REQUIREMENTS:
1. Judges must be INDEPENDENT (not copying each other)
2. Judges must see DIFFERENT perspectives
3. Judges must be REASONABLY competent (better than random)
When these conditions hold, the group's average
is MORE ACCURATE and MORE STABLE than any individual.
This is called the WISDOM OF CROWDS.
And in machine learning, it's called BAGGING.
Bagging: Multiple models vote to reduce variance, just like a jury reaches better verdicts than any single judge
What is Bagging?
Bagging = Bootstrap Aggregat*ing*
BAGGING IN THREE STEPS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STEP 1: BOOTSTRAP (Create Different Training Sets)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Original data: [A, B, C, D, E, F, G, H, I, J]
Bootstrap Sample 1: [A, A, C, D, D, F, G, H, I, J] ← Some repeated!
Bootstrap Sample 2: [B, B, C, E, E, F, G, H, J, J] ← Different ones!
Bootstrap Sample 3: [A, C, C, D, E, F, G, I, I, J] ← Yet another!
Each sample:
• Same SIZE as original (n samples)
• Drawn WITH REPLACEMENT (items can repeat)
• Roughly 63.2% unique samples, 36.8% duplicates
STEP 2: TRAIN (Build Independent Models)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Model 1 trained on Bootstrap Sample 1
Model 2 trained on Bootstrap Sample 2
Model 3 trained on Bootstrap Sample 3
...
Model N trained on Bootstrap Sample N
Each model sees DIFFERENT data → learns DIFFERENT patterns!
STEP 3: AGGREGATE (Combine Predictions)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
For Classification: MAJORITY VOTE
Model 1: "Cat"
Model 2: "Dog"
Model 3: "Cat"
Model 4: "Cat"
Model 5: "Dog"
─────────────────
Final: "Cat" (3 vs 2)
For Regression: AVERAGE
Model 1: $150,000
Model 2: $180,000
Model 3: $145,000
Model 4: $170,000
Model 5: $155,000
─────────────────
Final: $160,000 (average)
The Bootstrap: Sampling With Replacement
The key insight is sampling with replacement:
import numpy as np
def demonstrate_bootstrap():
"""Show how bootstrap sampling works."""
np.random.seed(42)
# Original dataset
original = np.array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])
n = len(original)
print("BOOTSTRAP SAMPLING DEMONSTRATION")
print("="*60)
print(f"\nOriginal data: {list(original)}")
print(f"Size: {n} samples\n")
# Generate bootstrap samples
for i in range(5):
# Sample WITH replacement
indices = np.random.choice(n, size=n, replace=True)
bootstrap_sample = original[indices]
# Count unique samples
unique = len(set(indices))
duplicates = n - unique
print(f"Bootstrap {i+1}: {list(bootstrap_sample)}")
print(f" Unique: {unique}/10 ({unique/n:.1%}), Duplicates: {duplicates}")
# Statistical explanation
print(f"""
WHY ~63.2% UNIQUE?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
P(item NOT selected in one draw) = (n-1)/n = 9/10 = 0.9
P(item NOT selected in n draws) = (9/10)^10 ≈ 0.349
P(item selected at least once) = 1 - 0.349 ≈ 0.651
As n → ∞: P(selected) → 1 - e^(-1) ≈ 0.632
So each bootstrap sample contains ~63.2% of original data!
The other ~36.8% are duplicates of selected items.
This creates DIVERSITY — each model sees different data!
""")
demonstrate_bootstrap()
Output:
BOOTSTRAP SAMPLING DEMONSTRATION
============================================================
Original data: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
Size: 10 samples
Bootstrap 1: ['G', 'C', 'G', 'E', 'G', 'H', 'E', 'C', 'I', 'G']
Unique: 5/10 (50.0%), Duplicates: 5
Bootstrap 2: ['H', 'D', 'D', 'C', 'A', 'H', 'I', 'J', 'D', 'J']
Unique: 6/10 (60.0%), Duplicates: 4
Bootstrap 3: ['J', 'A', 'H', 'G', 'A', 'H', 'G', 'I', 'H', 'I']
Unique: 5/10 (50.0%), Duplicates: 5
Bootstrap 4: ['G', 'F', 'G', 'I', 'H', 'F', 'D', 'A', 'H', 'B']
Unique: 7/10 (70.0%), Duplicates: 3
Bootstrap 5: ['D', 'I', 'H', 'H', 'C', 'E', 'J', 'G', 'I', 'J']
Unique: 7/10 (70.0%), Duplicates: 3
WHY ~63.2% UNIQUE?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
P(item NOT selected in one draw) = (n-1)/n = 9/10 = 0.9
P(item NOT selected in n draws) = (9/10)^10 ≈ 0.349
P(item selected at least once) = 1 - 0.349 ≈ 0.651
As n → ∞: P(selected) → 1 - e^(-1) ≈ 0.632
So each bootstrap sample contains ~63.2% of original data!
How Does Bagging Reduce Variance?
This is the magical part. Let's prove it mathematically:
THE VARIANCE REDUCTION PROOF:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Assume we have N models, each with:
• Same expected prediction: E[fᵢ] = μ
• Same variance: Var(fᵢ) = σ²
• Correlation between models: ρ
The ensemble prediction is the average:
f_ensemble = (1/N) × Σfᵢ
CASE 1: PERFECTLY CORRELATED (ρ = 1)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
All models make the SAME predictions.
Var(f_ensemble) = σ²
No improvement! (Like having 12 copies of the same judge)
CASE 2: PERFECTLY INDEPENDENT (ρ = 0)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Errors are completely uncorrelated.
Var(f_ensemble) = σ² / N
MASSIVE improvement! Variance drops by factor of N!
(10 models → 10x less variance)
CASE 3: PARTIAL CORRELATION (0 < ρ < 1) — REALITY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Var(f_ensemble) = ρσ² + (1-ρ)σ²/N
As N → ∞: Var → ρσ²
We can't eliminate variance completely (correlation floor),
but we still get SIGNIFICANT reduction!
THE INTUITION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Model 1: predicts +10 too high (positive error)
Model 2: predicts -8 too low (negative error)
Model 3: predicts +5 too high (positive error)
Model 4: predicts -12 too low (negative error)
Model 5: predicts +3 too high (positive error)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Average: (-2)/5 = -0.4 (errors nearly cancel!)
Individual errors: up to ±12
Ensemble error: only -0.4
ERRORS IN DIFFERENT DIRECTIONS CANCEL OUT!
The math of variance reduction: independent errors cancel when averaged
Seeing Variance Reduction in Action
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
# Create noisy regression data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = np.sin(X).ravel() * 3
y = y_true + np.random.randn(100) * 0.5
# Test point
X_test = np.array([[5.0]])
y_true_test = np.sin(5.0) * 3
print("VARIANCE REDUCTION DEMONSTRATION")
print("="*60)
# Single tree predictions (high variance)
single_predictions = []
for i in range(100):
# Bootstrap sample
idx = np.random.choice(100, size=100, replace=True)
X_boot, y_boot = X[idx], y[idx]
# Train single deep tree
tree = DecisionTreeRegressor(max_depth=10, random_state=i)
tree.fit(X_boot, y_boot)
single_predictions.append(tree.predict(X_test)[0])
single_predictions = np.array(single_predictions)
print(f"\nSINGLE TREE (trained on different bootstrap samples):")
print(f" True value: {y_true_test:.4f}")
print(f" Mean prediction: {single_predictions.mean():.4f}")
print(f" Std (variance proxy): {single_predictions.std():.4f}")
print(f" Range: [{single_predictions.min():.4f}, {single_predictions.max():.4f}]")
# Bagged predictions (low variance)
n_estimators_list = [1, 3, 5, 10, 25, 50, 100]
print(f"\nBAGGED ENSEMBLE (averaging multiple trees):")
print(f"{'N Trees':<10} {'Mean Pred':<12} {'Std':<12} {'Variance Reduction'}")
print("-"*50)
for n_est in n_estimators_list:
ensemble_predictions = []
for _ in range(50): # 50 different ensembles
tree_preds = []
for i in range(n_est):
idx = np.random.choice(100, size=100, replace=True)
tree = DecisionTreeRegressor(max_depth=10, random_state=None)
tree.fit(X[idx], y[idx])
tree_preds.append(tree.predict(X_test)[0])
ensemble_predictions.append(np.mean(tree_preds))
ensemble_predictions = np.array(ensemble_predictions)
variance_reduction = (1 - ensemble_predictions.std() / single_predictions.std()) * 100
print(f"{n_est:<10} {ensemble_predictions.mean():<12.4f} "
f"{ensemble_predictions.std():<12.4f} {variance_reduction:.1f}%")
Output:
VARIANCE REDUCTION DEMONSTRATION
============================================================
SINGLE TREE (trained on different bootstrap samples):
True value: -2.8767
Mean prediction: -2.7823
Std (variance proxy): 0.4521
Range: [-3.6842, -1.5234]
BAGGED ENSEMBLE (averaging multiple trees):
N Trees Mean Pred Std Variance Reduction
--------------------------------------------------
1 -2.7654 0.4328 4.3%
3 -2.8012 0.2856 36.8%
5 -2.8234 0.2145 52.6%
10 -2.8456 0.1523 66.3%
25 -2.8623 0.0987 78.2%
50 -2.8701 0.0712 84.2%
100 -2.8745 0.0523 88.4%
The more trees, the lower the variance!
Bagging with Scikit-Learn
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
import numpy as np
# Create dataset
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("BAGGING IN SCIKIT-LEARN")
print("="*60)
# Single decision tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
print(f"\n🌳 SINGLE DECISION TREE:")
print(f" Training Accuracy: {single_tree.score(X_train, y_train):.2%}")
print(f" Test Accuracy: {single_tree.score(X_test, y_test):.2%}")
# Bagged trees
bagged_trees = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=50,
max_samples=1.0, # Use 100% of samples (with replacement)
max_features=1.0, # Use 100% of features
bootstrap=True, # Sample with replacement
random_state=42,
n_jobs=-1
)
bagged_trees.fit(X_train, y_train)
print(f"\n🌲🌲🌲 BAGGED TREES (50 trees):")
print(f" Training Accuracy: {bagged_trees.score(X_train, y_train):.2%}")
print(f" Test Accuracy: {bagged_trees.score(X_test, y_test):.2%}")
# Compare variance
print(f"\n📊 VARIANCE COMPARISON (5-fold CV):")
single_scores = cross_val_score(DecisionTreeClassifier(random_state=42), X, y, cv=5)
bagged_scores = cross_val_score(
BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42),
X, y, cv=5
)
print(f" Single Tree: {single_scores.mean():.2%} ± {single_scores.std():.2%}")
print(f" Bagged (50): {bagged_scores.mean():.2%} ± {bagged_scores.std():.2%}")
print(f" Variance Reduction: {(1 - bagged_scores.std()/single_scores.std())*100:.1f}%")
Output:
BAGGING IN SCIKIT-LEARN
============================================================
🌳 SINGLE DECISION TREE:
Training Accuracy: 100.00%
Test Accuracy: 82.00%
🌲🌲🌲 BAGGED TREES (50 trees):
Training Accuracy: 100.00%
Test Accuracy: 90.33%
📊 VARIANCE COMPARISON (5-fold CV):
Single Tree: 81.20% ± 3.42%
Bagged (50): 89.60% ± 1.85%
Variance Reduction: 45.9%
The Effect of Number of Estimators
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
print("EFFECT OF NUMBER OF ESTIMATORS")
print("="*60)
n_estimators_range = [1, 2, 3, 5, 10, 15, 20, 30, 50, 75, 100, 150, 200]
means = []
stds = []
print(f"\n{'N Estimators':<15} {'CV Accuracy':<15} {'Std':<10}")
print("-"*40)
for n_est in n_estimators_range:
model = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=n_est,
random_state=42,
n_jobs=-1
)
scores = cross_val_score(model, X, y, cv=5)
means.append(scores.mean())
stds.append(scores.std())
print(f"{n_est:<15} {scores.mean():<15.2%} {scores.std():<10.4f}")
print(f"""
OBSERVATIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Accuracy improves rapidly with first ~20-30 trees
2. Diminishing returns after ~50 trees
3. Variance (std) decreases steadily with more trees
4. No overfitting! More trees = better (or same)
RULE OF THUMB:
• Start with 50-100 trees
• More trees = more stable, but slower
• After ~100, improvement is minimal
""")
![Number of Estimators Effect]

More trees means lower variance and higher accuracy, with diminishing returns after ~50 trees
Out-of-Bag (OOB) Error: Free Validation!
A magical bonus of bagging: free error estimation!
OUT-OF-BAG (OOB) EXPLAINED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Remember: Each bootstrap sample contains ~63.2% of data.
The other ~36.8% was NOT used for that tree!
These "left out" samples are called OUT-OF-BAG (OOB).
For each sample x:
1. Find all trees that did NOT train on x
2. Have those trees predict for x
3. Average their predictions → OOB prediction for x
This gives us a FREE validation score!
No need for a separate validation set!
EXAMPLE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Sample A:
In: Bootstrap 1, 3, 4 (trained)
Out: Bootstrap 2, 5 (OOB)
OOB prediction = average of Tree 2 and Tree 5 predictions
Sample B:
In: Bootstrap 2, 5 (trained)
Out: Bootstrap 1, 3, 4 (OOB)
OOB prediction = average of Tree 1, 3, 4 predictions
Compare OOB predictions to true labels → OOB Error!
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
print("OUT-OF-BAG (OOB) ERROR ESTIMATION")
print("="*60)
# Enable OOB scoring
bagging_oob = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=100,
oob_score=True, # Enable OOB!
random_state=42,
n_jobs=-1
)
bagging_oob.fit(X_train, y_train)
print(f"\nTraining Accuracy: {bagging_oob.score(X_train, y_train):.2%}")
print(f"OOB Score: {bagging_oob.oob_score_:.2%}")
print(f"Test Accuracy: {bagging_oob.score(X_test, y_test):.2%}")
print(f"""
ANALYSIS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OOB Score ({bagging_oob.oob_score_:.2%}) ≈ Test Score ({bagging_oob.score(X_test, y_test):.2%})
This is amazing! OOB gives us a reliable estimate
of test performance WITHOUT needing a validation set!
Use OOB when:
• You have limited data
• You want to use all data for training
• You need quick hyperparameter tuning
""")
Bagging vs Single Tree: Visual Comparison
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
# Create 1D data for visualization
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() * 3 + np.random.randn(100) * 0.5
X_plot = np.linspace(0, 10, 500).reshape(-1, 1)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Single tree
ax1 = axes[0]
tree = DecisionTreeRegressor(max_depth=10, random_state=42)
tree.fit(X, y)
y_pred = tree.predict(X_plot)
ax1.scatter(X, y, alpha=0.5, label='Data')
ax1.plot(X_plot, y_pred, 'r-', linewidth=2, label='Single Tree')
ax1.plot(X_plot, np.sin(X_plot)*3, 'g--', linewidth=2, label='True Function')
ax1.set_title(f'Single Deep Tree\n(High Variance, Jagged)', fontsize=12)
ax1.legend()
ax1.set_xlim(0, 10)
# Multiple single trees (showing variance)
ax2 = axes[1]
ax2.scatter(X, y, alpha=0.3, label='Data')
for i in range(10):
idx = np.random.choice(100, size=100, replace=True)
tree = DecisionTreeRegressor(max_depth=10, random_state=i)
tree.fit(X[idx], y[idx])
ax2.plot(X_plot, tree.predict(X_plot), alpha=0.3)
ax2.plot(X_plot, np.sin(X_plot)*3, 'g--', linewidth=2, label='True Function')
ax2.set_title(f'10 Different Trees\n(See the variance!)', fontsize=12)
ax2.legend()
ax2.set_xlim(0, 10)
# Bagged trees
ax3 = axes[2]
bagging = BaggingRegressor(
estimator=DecisionTreeRegressor(max_depth=10),
n_estimators=50,
random_state=42
)
bagging.fit(X, y)
y_pred_bagged = bagging.predict(X_plot)
ax3.scatter(X, y, alpha=0.5, label='Data')
ax3.plot(X_plot, y_pred_bagged, 'r-', linewidth=2, label='Bagged (50 trees)')
ax3.plot(X_plot, np.sin(X_plot)*3, 'g--', linewidth=2, label='True Function')
ax3.set_title(f'Bagged Ensemble (50 Trees)\n(Low Variance, Smooth)', fontsize=12)
ax3.legend()
ax3.set_xlim(0, 10)
plt.tight_layout()
plt.savefig('bagging_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
Single trees are jagged and vary wildly; bagged ensemble is smooth and stable
When Does Bagging Help Most?
BAGGING EFFECTIVENESS DEPENDS ON:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. BASE MODEL VARIANCE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
HIGH Variance Models (bagging helps A LOT):
• Deep decision trees (unpruned)
• Neural networks
• KNN with small k
LOW Variance Models (bagging helps LESS):
• Linear regression
• Naive Bayes
• Shallow trees
2. MODEL CORRELATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Low correlation = More variance reduction
High correlation = Less variance reduction
To reduce correlation:
• Use diverse bootstrap samples
• Consider random feature subsets (like Random Forest!)
• Use different model types
3. BIAS OF BASE MODEL
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bagging does NOT reduce bias!
If base model is biased, ensemble is also biased.
Example: Bagging shallow trees (high bias)
→ Still high bias after bagging
→ Need deeper trees or boosting instead
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
print("BAGGING WITH DIFFERENT BASE MODELS")
print("="*60)
base_models = [
("Decision Tree (deep)", DecisionTreeClassifier(max_depth=None)),
("Decision Tree (shallow)", DecisionTreeClassifier(max_depth=3)),
("Logistic Regression", LogisticRegression(max_iter=1000)),
("KNN (k=3)", KNeighborsClassifier(n_neighbors=3)),
]
print(f"\n{'Model':<30} {'Single':<12} {'Bagged':<12} {'Improvement'}")
print("-"*60)
for name, model in base_models:
# Single model
single_score = cross_val_score(model, X, y, cv=5).mean()
# Bagged model
bagged = BaggingClassifier(estimator=model, n_estimators=50, random_state=42)
bagged_score = cross_val_score(bagged, X, y, cv=5).mean()
improvement = bagged_score - single_score
print(f"{name:<30} {single_score:<12.2%} {bagged_score:<12.2%} {improvement:+.2%}")
print(f"""
OBSERVATIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Deep Decision Tree: BIG improvement (high variance → bagging helps!)
• Shallow Decision Tree: SMALL improvement (high bias → need more depth)
• Logistic Regression: MINIMAL improvement (already low variance)
• KNN (k=3): MODERATE improvement (moderate variance)
RULE: Bagging helps most with HIGH VARIANCE, LOW BIAS models!
""")
Bagging Hyperparameters
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
print("BAGGING HYPERPARAMETERS")
print("="*60)
print("""
KEY PARAMETERS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
n_estimators: Number of base models
Default: 10
Recommended: 50-500
More = better but slower
max_samples: Samples per bootstrap (float = fraction, int = count)
Default: 1.0 (100%)
Options: 0.5-1.0
Lower = more diversity, higher bias
max_features: Features per model (float = fraction, int = count)
Default: 1.0 (100%)
Options: 0.5-1.0 or 'sqrt', 'log2'
Lower = more diversity (like Random Forest!)
bootstrap: Whether to sample with replacement
Default: True
Keep True for bagging!
bootstrap_features: Whether to bootstrap features too
Default: False
Set True for extra diversity
oob_score: Calculate out-of-bag error
Default: False
Set True for free validation!
""")
# Demonstrate hyperparameter effects
param_experiments = [
("Default", {"n_estimators": 50}),
("More trees (200)", {"n_estimators": 200}),
("50% samples", {"n_estimators": 50, "max_samples": 0.5}),
("50% features", {"n_estimators": 50, "max_features": 0.5}),
("Both 50%", {"n_estimators": 50, "max_samples": 0.5, "max_features": 0.5}),
]
print(f"\n{'Configuration':<25} {'CV Accuracy':<15} {'Std'}")
print("-"*50)
for name, params in param_experiments:
model = BaggingClassifier(
estimator=DecisionTreeClassifier(),
random_state=42,
**params
)
scores = cross_val_score(model, X, y, cv=5)
print(f"{name:<25} {scores.mean():<15.2%} {scores.std():.4f}")
From Bagging to Random Forest
THE NEXT STEP: RANDOM FORESTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bagging with decision trees is great, but there's a problem:
Trees are still CORRELATED because they all see ALL features.
If one feature is very strong, ALL trees will split on it first!
This limits diversity and variance reduction.
SOLUTION: RANDOM FORESTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Random Forest = Bagging + Random Feature Selection
At EACH SPLIT, only consider a RANDOM SUBSET of features:
• Classification: √n features (e.g., √20 ≈ 4)
• Regression: n/3 features (e.g., 20/3 ≈ 7)
This DECORRELATES the trees → MORE variance reduction!
BAGGING vs RANDOM FOREST:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bagging Random Forest
Bootstrap samples: Yes Yes
Feature subset: All (per tree) Random (per SPLIT!)
Tree correlation: Higher Lower
Variance reduction: Good Better
Most popular: No Yes (industry standard)
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
print("BAGGING vs RANDOM FOREST")
print("="*60)
models = {
"Single Tree": DecisionTreeClassifier(random_state=42),
"Bagging (50 trees)": BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=50, random_state=42
),
"Random Forest (50 trees)": RandomForestClassifier(
n_estimators=50, random_state=42
),
}
print(f"\n{'Model':<30} {'CV Accuracy':<15} {'Std'}")
print("-"*50)
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5)
print(f"{name:<30} {scores.mean():<15.2%} {scores.std():.4f}")
print(f"""
Random Forest usually wins because:
• Trees are LESS correlated (random feature subsets)
• Lower correlation → More variance reduction
• Same computational cost as bagging
""")
Complete Implementation from Scratch
import numpy as np
from collections import Counter
class BaggingClassifierFromScratch:
"""Bagging classifier built from scratch."""
def __init__(self, base_estimator, n_estimators=10, max_samples=1.0, random_state=None):
self.base_estimator = base_estimator
self.n_estimators = n_estimators
self.max_samples = max_samples
self.random_state = random_state
self.estimators_ = []
self.oob_score_ = None
def _bootstrap_sample(self, X, y, rng):
"""Create a bootstrap sample."""
n_samples = X.shape[0]
n_bootstrap = int(n_samples * self.max_samples)
# Sample WITH replacement
indices = rng.choice(n_samples, size=n_bootstrap, replace=True)
oob_indices = list(set(range(n_samples)) - set(indices))
return X[indices], y[indices], oob_indices
def fit(self, X, y):
"""Fit the bagging ensemble."""
X, y = np.array(X), np.array(y)
n_samples = X.shape[0]
rng = np.random.RandomState(self.random_state)
self.estimators_ = []
# For OOB scoring
oob_predictions = [[] for _ in range(n_samples)]
for i in range(self.n_estimators):
# Create bootstrap sample
X_boot, y_boot, oob_indices = self._bootstrap_sample(X, y, rng)
# Clone and fit estimator
from sklearn.base import clone
estimator = clone(self.base_estimator)
estimator.fit(X_boot, y_boot)
self.estimators_.append(estimator)
# Store OOB predictions
if oob_indices:
oob_pred = estimator.predict(X[oob_indices])
for idx, pred in zip(oob_indices, oob_pred):
oob_predictions[idx].append(pred)
# Calculate OOB score
oob_correct = 0
oob_count = 0
for i, preds in enumerate(oob_predictions):
if preds:
majority = Counter(preds).most_common(1)[0][0]
if majority == y[i]:
oob_correct += 1
oob_count += 1
if oob_count > 0:
self.oob_score_ = oob_correct / oob_count
return self
def predict(self, X):
"""Predict using majority vote."""
X = np.array(X)
# Get predictions from all estimators
all_predictions = np.array([est.predict(X) for est in self.estimators_])
# Majority vote
final_predictions = []
for i in range(X.shape[0]):
votes = all_predictions[:, i]
majority = Counter(votes).most_common(1)[0][0]
final_predictions.append(majority)
return np.array(final_predictions)
def score(self, X, y):
"""Calculate accuracy."""
predictions = self.predict(X)
return np.mean(predictions == y)
# Test it!
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
print("BAGGING FROM SCRATCH")
print("="*60)
# Our implementation
our_bagging = BaggingClassifierFromScratch(
base_estimator=DecisionTreeClassifier(),
n_estimators=50,
random_state=42
)
our_bagging.fit(X_train, y_train)
print(f"\nOur Implementation:")
print(f" OOB Score: {our_bagging.oob_score_:.2%}")
print(f" Test Accuracy: {our_bagging.score(X_test, y_test):.2%}")
# Sklearn implementation
from sklearn.ensemble import BaggingClassifier
sklearn_bagging = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=50,
oob_score=True,
random_state=42
)
sklearn_bagging.fit(X_train, y_train)
print(f"\nSklearn Implementation:")
print(f" OOB Score: {sklearn_bagging.oob_score_:.2%}")
print(f" Test Accuracy: {sklearn_bagging.score(X_test, y_test):.2%}")
Quick Reference Card
BAGGING: CHEAT SHEET
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WHAT IT IS:
Bootstrap Aggregating — train multiple models on
different bootstrap samples, combine by voting/averaging
THREE STEPS:
1. Bootstrap: Create N random samples (with replacement)
2. Train: Fit one model per bootstrap sample
3. Aggregate: Vote (classification) or average (regression)
VARIANCE REDUCTION:
Var(ensemble) ≈ ρσ² + (1-ρ)σ²/N
• Independent models (ρ=0): Variance → σ²/N
• More models (N↑): Lower variance
• Less correlation (ρ↓): Lower variance
KEY HYPERPARAMETERS:
n_estimators: 50-500 trees (more = better, slower)
max_samples: 1.0 (100% of data per bootstrap)
max_features: 1.0 (100% of features)
oob_score: True for free validation
OOB (OUT-OF-BAG):
~36.8% of data not in each bootstrap → free validation!
OOB score ≈ test score
WHEN TO USE:
✓ High variance models (deep trees, KNN with small k)
✓ Want stable predictions
✓ Don't want to tune individual models
WHEN NOT EFFECTIVE:
✗ Low variance models (linear regression)
✗ High bias models (shallow trees)
✗ Need interpretability
SKLEARN:
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
Key Takeaways
Bagging = Bootstrap + Aggregate — Train on random samples, combine predictions
Variance reduction is the goal — Individual errors cancel when averaged
More trees = more stable — Diminishing returns after ~50-100 trees
Works best with high-variance models — Deep trees, neural networks, KNN
OOB gives free validation — ~36.8% of data unused per tree → evaluate for free
Doesn't reduce bias — If base model is biased, ensemble is too
Random Forest is bagging++ — Adds random feature selection for lower correlation
No overfitting risk — More trees can only help (or stay same)
The One-Sentence Summary
Bagging is like a jury system for machine learning: instead of relying on one potentially biased judge (model), we train multiple judges on different evidence (bootstrap samples) and let them vote — individual errors cancel out, variance drops dramatically, and the collective wisdom produces more stable, reliable predictions than any single model could achieve alone.
What's Next?
Now that you understand bagging, you're ready for:
- Random Forests — Bagging + random feature selection
- Boosting — Sequential learning from mistakes
- Stacking — Combining different model types
- Out-of-Bag Feature Importance — Which features matter?
Follow me for the next article in the Tree Based Models series!
Let's Connect!
If the jury system made bagging click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's your favorite ensemble method? I love how bagging turns unstable trees into rock-solid predictors! 👨⚖️
The wisdom of crowds isn't magic — it's mathematics. When independent judges make independent errors, those errors cancel out. Bagging brings this ancient wisdom to machine learning, proving that twelve noisy trees are better than one perfect tree.
Share this with someone confused by ensemble methods. The jury has spoken!
Happy bagging! 🗳️🌲



Top comments (2)
Bagging's like the OG ensemble method, right? The whole "wisdom of crowds" vibe really works for reducing variance. It's like the more models you throw into the mix, the more the errors cancel each other out. The jury analogy is interesting, but I'm curious about how it holds up with more complex datasets or if it starts to falter with high bias situations. Just wondering if you've got thoughts on bagging's interaction with other ensemble methods like boosting or stacking? Each has its own spin on correcting errors, so what's the edge cases where bagging really shines compared to the others? Also, do you think the whole idea of training multiple models could become more streamlined with the new fine-tuning tricks in LLMs and advanced inference scaling?
Spot on about the high bias thing - you've basically found bagging's weak spot! Bagging only helps with variance, it literally can't fix bias. If your base model is too simple, averaging a thousand of them still gives you a thousand wrong answers lol. That's actually why boosting exists - it trains models sequentially where each one tries to fix the mistakes of the previous one, so it chips away at bias instead.
The way I think about it: bagging = parallel training, reduces variance. Boosting = sequential training, reduces bias. Stacking = let a meta-model figure out how to combine different model types.
Bagging tends to win when you've got noisy data (boosting would overfit to the noise), when you can throw compute at parallel training, or when you want that free OOB validation. Boosting usually gets you better raw accuracy but it's pickier about clean data.
And yeah the LLM connection is interesting - things like self-consistency decoding are basically bagging for reasoning right? Sample a bunch of reasoning paths, majority vote. Same principle. Also MoE architectures kind of bake the ensemble idea directly into the model instead of doing it post-hoc.
Planning to cover boosting in the next article actually - that's where it gets really fun with the sequential error correction. Will definitely touch on when to pick which method!