You saw in the last post that decision trees overfit easily. Change a few training examples and the whole tree changes. That instability is the core problem.
The fix is almost embarrassingly simple. Don't build one tree. Build hundreds of them. Make each one slightly different. Then have them all vote on the answer.
That's Random Forest. And it's one of the most reliable, battle-tested algorithms in all of machine learning.
What You'll Learn Here
- Why one tree fails and how combining many fixes it
- What bagging is and how it creates diversity
- What feature randomness is and why it matters
- How to build a Random Forest and tune it
- Out-of-bag error, a free validation trick
- Feature importance from a forest vs a single tree
The Wisdom of Crowds
Here's an experiment that actually happened.
A crowd of 800 people at a county fair were asked to guess the weight of an ox. Most individual guesses were off. But the average of all 800 guesses was 1,197 pounds. The actual weight was 1,198 pounds.
The crowd was more accurate than almost every individual.
That's the idea behind Random Forest. Each tree makes mistakes. But different trees make different mistakes. When you average their predictions, the mistakes cancel out and the correct signal gets stronger.
This only works if the trees are different from each other. If every tree makes the same mistakes, averaging does nothing. Random Forest creates diversity in two ways: bagging and feature randomness.
How Diversity Is Created
Method 1: Bagging (Bootstrap Aggregating)
Each tree in the forest is trained on a different random sample of your training data. The sampling is done with replacement, meaning the same example can appear multiple times in one sample and not at all in another.
import numpy as np
# Simulate bagging: 10 training examples, sample with replacement
training_data = list(range(10)) # examples 0 through 9
np.random.seed(42)
for tree_num in range(5):
bootstrap_sample = np.random.choice(training_data, size=10, replace=True)
out_of_bag = set(training_data) - set(bootstrap_sample)
print(f"Tree {tree_num + 1}: trained on {sorted(bootstrap_sample)}")
print(f" out-of-bag: {sorted(out_of_bag)}\n")
Output:
Tree 1: trained on [0, 0, 2, 2, 3, 4, 6, 7, 8, 9]
out-of-bag: [1, 5]
Tree 2: trained on [0, 1, 3, 4, 6, 7, 7, 8, 9, 9]
out-of-bag: [2, 5]
Tree 3: trained on [0, 1, 1, 2, 3, 5, 6, 6, 7, 9]
out-of-bag: [4, 8]
...
Each tree sees a different version of the data. So each tree makes somewhat different errors.
Method 2: Feature Randomness
At each split inside each tree, only a random subset of features is considered. By default, scikit-learn uses sqrt(n_features) features per split.
This stops all trees from always splitting on the same best feature. Even if one feature is very powerful, some trees won't use it at certain splits. That forces trees to find other patterns.
Together, bagging and feature randomness make trees that are correlated as little as possible with each other. Low correlation between trees = better ensemble.
Building Your First Random Forest
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 100 trees, default settings
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.3f}")
Output:
Random Forest Accuracy: 0.974
Now compare that to a single decision tree on the same data:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
print(f"Single Tree Accuracy: {accuracy_score(y_test, tree.predict(X_test)):.3f}")
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf.predict(X_test)):.3f}")
Output:
Single Tree Accuracy: 0.930
Random Forest Accuracy: 0.974
The forest beats the single tree without any tuning at all. That's typical.
Watching the Accuracy Grow With More Trees
One useful thing to check: how many trees do you actually need? Accuracy improves as you add trees but eventually levels off.
import matplotlib.pyplot as plt
import numpy as np
n_trees_list = [1, 5, 10, 20, 50, 100, 200, 500]
train_scores = []
test_scores = []
for n in n_trees_list:
rf_n = RandomForestClassifier(n_estimators=n, random_state=42)
rf_n.fit(X_train, y_train)
train_scores.append(accuracy_score(y_train, rf_n.predict(X_train)))
test_scores.append(accuracy_score(y_test, rf_n.predict(X_test)))
plt.figure(figsize=(9, 5))
plt.plot(n_trees_list, train_scores, label='Train accuracy', color='blue', marker='o')
plt.plot(n_trees_list, test_scores, label='Test accuracy', color='orange', marker='o')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('rf_n_trees.png', dpi=100)
plt.show()
for n, tr, te in zip(n_trees_list, train_scores, test_scores):
print(f"Trees: {n:<5} Train: {tr:.3f} Test: {te:.3f}")
Output:
Trees: 1 Train: 1.000 Test: 0.912
Trees: 5 Train: 1.000 Test: 0.956
Trees: 10 Train: 1.000 Test: 0.965
Trees: 20 Train: 1.000 Test: 0.965
Trees: 50 Train: 1.000 Test: 0.974
Trees: 100 Train: 1.000 Test: 0.974
Trees: 200 Train: 1.000 Test: 0.974
Trees: 500 Train: 1.000 Test: 0.974
Test accuracy levels off around 100 trees here. Adding more trees after that doesn't hurt, but it slows training for no gain. 100 to 300 is a reasonable range for most problems.
Out-of-Bag Error: Free Validation
Remember that each tree only sees about 63% of the training data due to bootstrapping. The other 37% (out-of-bag examples) can be used to validate each tree without needing a separate test set.
scikit-learn does this automatically with oob_score=True.
rf_oob = RandomForestClassifier(
n_estimators=100,
oob_score=True, # enable out-of-bag scoring
random_state=42
)
rf_oob.fit(X_train, y_train)
print(f"OOB Score: {rf_oob.oob_score_:.3f}")
print(f"Test Score: {accuracy_score(y_test, rf_oob.predict(X_test)):.3f}")
Output:
OOB Score: 0.967
Test Score: 0.974
OOB score is very close to the real test score. This is useful when you have limited data and don't want to sacrifice a big chunk for validation.
Feature Importance: More Reliable Than a Single Tree
A single tree's feature importance depends heavily on which tree structure happened to form. Random Forest averages importance across all trees, making it much more stable.
import pandas as pd
import matplotlib.pyplot as plt
importance_df = pd.DataFrame({
'Feature': data.feature_names,
'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)
print("Top 10 most important features:")
print(importance_df.head(10).to_string(index=False))
# Plot
plt.figure(figsize=(10, 6))
plt.barh(
importance_df['Feature'].head(15)[::-1],
importance_df['Importance'].head(15)[::-1],
color='steelblue'
)
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.savefig('rf_feature_importance.png', dpi=100)
plt.show()
Output:
Top 10 most important features:
Feature Importance
worst concave points 0.148
worst radius 0.134
worst perimeter 0.112
mean concave 0.101
worst area 0.098
...
These scores tell you what fraction of total information gain came from each feature across all trees and all splits. More reliable than a single tree's estimate.
Tuning a Random Forest
The main knobs to turn:
from sklearn.model_selection import cross_val_score
# The key hyperparameters
configs = [
{'n_estimators': 100, 'max_depth': None, 'max_features': 'sqrt'}, # default
{'n_estimators': 100, 'max_depth': 10, 'max_features': 'sqrt'}, # limit depth
{'n_estimators': 100, 'max_depth': None, 'max_features': 'log2'}, # fewer features/split
{'n_estimators': 200, 'max_depth': 10, 'max_features': 'sqrt'}, # more trees + limit
{'n_estimators': 100, 'max_depth': None, 'min_samples_leaf': 4}, # bigger leaves
]
print(f"{'Config':<5} {'CV Mean':<10} {'CV Std'}")
print("-" * 30)
for i, config in enumerate(configs):
rf_c = RandomForestClassifier(**config, random_state=42)
scores = cross_val_score(rf_c, X_train, y_train, cv=5)
print(f"{i+1:<5} {scores.mean():.3f} {scores.std():.3f}")
Key hyperparameters explained:
-
n_estimators: number of trees. More = better but slower. Start at 100. -
max_depth: limits tree depth. Helps with speed. Less effect than in single trees. -
max_features: features considered per split.'sqrt'for classification,'log2'for large feature sets. -
min_samples_leaf: minimum samples in a leaf. Higher values = smoother, less overfit. -
n_jobs=-1: use all CPU cores to train in parallel. Always set this.
# Always add n_jobs=-1 in practice
rf_fast = RandomForestClassifier(
n_estimators=100,
n_jobs=-1, # parallelize across all cores
random_state=42
)
Random Forest for Regression
Random Forest works for regression too. Same idea, but instead of voting on a class, trees average their numeric predictions.
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_h, y_h, test_size=0.2, random_state=42
)
rf_reg = RandomForestRegressor(n_estimators=100, n_jobs=-1, random_state=42)
rf_reg.fit(X_train_h, y_train_h)
y_pred_h = rf_reg.predict(X_test_h)
print(f"R2: {r2_score(y_test_h, y_pred_h):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test_h, y_pred_h)):.3f}")
Output:
R2: 0.805
RMSE: 0.503
Compare that to linear regression's R2 of 0.576 on the same dataset. Random Forest gets 0.805 with zero preprocessing, zero feature engineering, and zero tuning.
Single Tree vs Random Forest: Side by Side
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
datasets = {
'Breast Cancer': load_breast_cancer(),
}
for name, data in datasets.items():
X_d, y_d = data.data, data.target
tree_scores = cross_val_score(
DecisionTreeClassifier(random_state=42), X_d, y_d, cv=5
)
rf_scores = cross_val_score(
RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
X_d, y_d, cv=5
)
print(f"\n{name}:")
print(f" Single Tree: {tree_scores.mean():.3f} +/- {tree_scores.std():.3f}")
print(f" Random Forest: {rf_scores.mean():.3f} +/- {rf_scores.std():.3f}")
Output:
Breast Cancer:
Single Tree: 0.930 +/- 0.017
Random Forest: 0.962 +/- 0.014
Random Forest wins on both accuracy and stability (lower std). This pattern holds on almost every dataset you'll work with.
The Things Everyone Gets Wrong
Mistake 1: Using only 10 trees
Ten trees is not enough. The default in scikit-learn used to be 10, and a lot of old tutorials still use it. Start at 100 minimum.
Mistake 2: Not setting n_jobs=-1
Training 100+ trees is slow on one core. Set n_jobs=-1 and use all your cores. Training time can drop by 4x to 8x.
Mistake 3: Thinking more trees always helps
After a certain point (usually 100 to 300), adding more trees doesn't improve accuracy. It just costs time and memory. Use the accuracy-vs-n-trees plot to find the plateau.
Mistake 4: Using feature importance to make final decisions blindly
Random Forest feature importance has a known bias toward features with many unique values (continuous features over categorical ones). For serious feature selection, combine it with permutation importance or domain knowledge.
# Permutation importance: more reliable but slower
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
perm_df = pd.DataFrame({
'Feature': data.feature_names,
'Importance': result.importances_mean
}).sort_values('Importance', ascending=False)
print("Permutation Importance (top 5):")
print(perm_df.head().to_string(index=False))
Quick Cheat Sheet
| Task | Code |
|---|---|
| Train classifier | RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42) |
| Train regressor | RandomForestRegressor(n_estimators=100, n_jobs=-1, random_state=42) |
| Feature importance | rf.feature_importances_ |
| Free validation |
RandomForestClassifier(oob_score=True) then rf.oob_score_
|
| Speed up training | n_jobs=-1 |
| Reduce overfitting |
max_depth, min_samples_leaf
|
| Predict probability | rf.predict_proba(X_test) |
Practice Challenges
Level 1:
Train a Random Forest on load_wine(). Compare accuracy to a single decision tree. Print the top 5 most important features.
Level 2:
On the breast cancer dataset, plot test accuracy vs number of trees from 1 to 500. Where does accuracy stop improving? Is it worth using 500 trees?
Level 3:
Use oob_score=True on the California housing dataset with a RandomForestRegressor. Compare the OOB R2 to the actual test R2. How close are they? Now try the same with only 20 trees. Does OOB become less reliable?
References
- Scikit-learn: RandomForestClassifier
- Scikit-learn: Random Forest
- StatQuest: Random Forests (YouTube)
- Permutation Importance docs
Next up, Post 59: XGBoost: The Algorithm That Wins Competitions. We move from parallel trees to sequential trees, learn what gradient boosting actually does, and build the model that dominates Kaggle.
Top comments (0)