DEV Community

Cover image for 58. Random Forest: Why One Tree Isn't Enough
Akhilesh
Akhilesh

Posted on

58. Random Forest: Why One Tree Isn't Enough

You saw in the last post that decision trees overfit easily. Change a few training examples and the whole tree changes. That instability is the core problem.

The fix is almost embarrassingly simple. Don't build one tree. Build hundreds of them. Make each one slightly different. Then have them all vote on the answer.

That's Random Forest. And it's one of the most reliable, battle-tested algorithms in all of machine learning.


What You'll Learn Here

  • Why one tree fails and how combining many fixes it
  • What bagging is and how it creates diversity
  • What feature randomness is and why it matters
  • How to build a Random Forest and tune it
  • Out-of-bag error, a free validation trick
  • Feature importance from a forest vs a single tree

The Wisdom of Crowds

Here's an experiment that actually happened.

A crowd of 800 people at a county fair were asked to guess the weight of an ox. Most individual guesses were off. But the average of all 800 guesses was 1,197 pounds. The actual weight was 1,198 pounds.

The crowd was more accurate than almost every individual.

That's the idea behind Random Forest. Each tree makes mistakes. But different trees make different mistakes. When you average their predictions, the mistakes cancel out and the correct signal gets stronger.

This only works if the trees are different from each other. If every tree makes the same mistakes, averaging does nothing. Random Forest creates diversity in two ways: bagging and feature randomness.


How Diversity Is Created

Method 1: Bagging (Bootstrap Aggregating)

Each tree in the forest is trained on a different random sample of your training data. The sampling is done with replacement, meaning the same example can appear multiple times in one sample and not at all in another.

import numpy as np

# Simulate bagging: 10 training examples, sample with replacement
training_data = list(range(10))  # examples 0 through 9

np.random.seed(42)
for tree_num in range(5):
    bootstrap_sample = np.random.choice(training_data, size=10, replace=True)
    out_of_bag = set(training_data) - set(bootstrap_sample)
    print(f"Tree {tree_num + 1}: trained on {sorted(bootstrap_sample)}")
    print(f"         out-of-bag:  {sorted(out_of_bag)}\n")
Enter fullscreen mode Exit fullscreen mode

Output:

Tree 1: trained on [0, 0, 2, 2, 3, 4, 6, 7, 8, 9]
         out-of-bag:  [1, 5]

Tree 2: trained on [0, 1, 3, 4, 6, 7, 7, 8, 9, 9]
         out-of-bag:  [2, 5]

Tree 3: trained on [0, 1, 1, 2, 3, 5, 6, 6, 7, 9]
         out-of-bag:  [4, 8]
...
Enter fullscreen mode Exit fullscreen mode

Each tree sees a different version of the data. So each tree makes somewhat different errors.

Method 2: Feature Randomness

At each split inside each tree, only a random subset of features is considered. By default, scikit-learn uses sqrt(n_features) features per split.

This stops all trees from always splitting on the same best feature. Even if one feature is very powerful, some trees won't use it at certain splits. That forces trees to find other patterns.

Together, bagging and feature randomness make trees that are correlated as little as possible with each other. Low correlation between trees = better ensemble.


Building Your First Random Forest

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 100 trees, default settings
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Random Forest Accuracy: 0.974
Enter fullscreen mode Exit fullscreen mode

Now compare that to a single decision tree on the same data:

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

print(f"Single Tree Accuracy:   {accuracy_score(y_test, tree.predict(X_test)):.3f}")
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf.predict(X_test)):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Single Tree Accuracy:   0.930
Random Forest Accuracy: 0.974
Enter fullscreen mode Exit fullscreen mode

The forest beats the single tree without any tuning at all. That's typical.


Watching the Accuracy Grow With More Trees

One useful thing to check: how many trees do you actually need? Accuracy improves as you add trees but eventually levels off.

import matplotlib.pyplot as plt
import numpy as np

n_trees_list = [1, 5, 10, 20, 50, 100, 200, 500]
train_scores = []
test_scores  = []

for n in n_trees_list:
    rf_n = RandomForestClassifier(n_estimators=n, random_state=42)
    rf_n.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, rf_n.predict(X_train)))
    test_scores.append(accuracy_score(y_test,  rf_n.predict(X_test)))

plt.figure(figsize=(9, 5))
plt.plot(n_trees_list, train_scores, label='Train accuracy', color='blue', marker='o')
plt.plot(n_trees_list, test_scores,  label='Test accuracy',  color='orange', marker='o')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('rf_n_trees.png', dpi=100)
plt.show()

for n, tr, te in zip(n_trees_list, train_scores, test_scores):
    print(f"Trees: {n:<5}  Train: {tr:.3f}  Test: {te:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Trees: 1      Train: 1.000  Test: 0.912
Trees: 5      Train: 1.000  Test: 0.956
Trees: 10     Train: 1.000  Test: 0.965
Trees: 20     Train: 1.000  Test: 0.965
Trees: 50     Train: 1.000  Test: 0.974
Trees: 100    Train: 1.000  Test: 0.974
Trees: 200    Train: 1.000  Test: 0.974
Trees: 500    Train: 1.000  Test: 0.974
Enter fullscreen mode Exit fullscreen mode

Test accuracy levels off around 100 trees here. Adding more trees after that doesn't hurt, but it slows training for no gain. 100 to 300 is a reasonable range for most problems.


Out-of-Bag Error: Free Validation

Remember that each tree only sees about 63% of the training data due to bootstrapping. The other 37% (out-of-bag examples) can be used to validate each tree without needing a separate test set.

scikit-learn does this automatically with oob_score=True.

rf_oob = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,     # enable out-of-bag scoring
    random_state=42
)
rf_oob.fit(X_train, y_train)

print(f"OOB Score:  {rf_oob.oob_score_:.3f}")
print(f"Test Score: {accuracy_score(y_test, rf_oob.predict(X_test)):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

OOB Score:  0.967
Test Score: 0.974
Enter fullscreen mode Exit fullscreen mode

OOB score is very close to the real test score. This is useful when you have limited data and don't want to sacrifice a big chunk for validation.


Feature Importance: More Reliable Than a Single Tree

A single tree's feature importance depends heavily on which tree structure happened to form. Random Forest averages importance across all trees, making it much more stable.

import pandas as pd
import matplotlib.pyplot as plt

importance_df = pd.DataFrame({
    'Feature':    data.feature_names,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("Top 10 most important features:")
print(importance_df.head(10).to_string(index=False))

# Plot
plt.figure(figsize=(10, 6))
plt.barh(
    importance_df['Feature'].head(15)[::-1],
    importance_df['Importance'].head(15)[::-1],
    color='steelblue'
)
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.savefig('rf_feature_importance.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Top 10 most important features:
               Feature  Importance
   worst concave points    0.148
         worst radius    0.134
        worst perimeter    0.112
           mean concave    0.101
          worst area    0.098
...
Enter fullscreen mode Exit fullscreen mode

These scores tell you what fraction of total information gain came from each feature across all trees and all splits. More reliable than a single tree's estimate.


Tuning a Random Forest

The main knobs to turn:

from sklearn.model_selection import cross_val_score

# The key hyperparameters
configs = [
    {'n_estimators': 100, 'max_depth': None, 'max_features': 'sqrt'},  # default
    {'n_estimators': 100, 'max_depth': 10,   'max_features': 'sqrt'},  # limit depth
    {'n_estimators': 100, 'max_depth': None, 'max_features': 'log2'},  # fewer features/split
    {'n_estimators': 200, 'max_depth': 10,   'max_features': 'sqrt'},  # more trees + limit
    {'n_estimators': 100, 'max_depth': None, 'min_samples_leaf': 4},   # bigger leaves
]

print(f"{'Config':<5} {'CV Mean':<10} {'CV Std'}")
print("-" * 30)

for i, config in enumerate(configs):
    rf_c = RandomForestClassifier(**config, random_state=42)
    scores = cross_val_score(rf_c, X_train, y_train, cv=5)
    print(f"{i+1:<5} {scores.mean():.3f}      {scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode

Key hyperparameters explained:

  • n_estimators: number of trees. More = better but slower. Start at 100.
  • max_depth: limits tree depth. Helps with speed. Less effect than in single trees.
  • max_features: features considered per split. 'sqrt' for classification, 'log2' for large feature sets.
  • min_samples_leaf: minimum samples in a leaf. Higher values = smoother, less overfit.
  • n_jobs=-1: use all CPU cores to train in parallel. Always set this.
# Always add n_jobs=-1 in practice
rf_fast = RandomForestClassifier(
    n_estimators=100,
    n_jobs=-1,          # parallelize across all cores
    random_state=42
)
Enter fullscreen mode Exit fullscreen mode

Random Forest for Regression

Random Forest works for regression too. Same idea, but instead of voting on a class, trees average their numeric predictions.

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

rf_reg = RandomForestRegressor(n_estimators=100, n_jobs=-1, random_state=42)
rf_reg.fit(X_train_h, y_train_h)

y_pred_h = rf_reg.predict(X_test_h)
print(f"R2:   {r2_score(y_test_h, y_pred_h):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test_h, y_pred_h)):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

R2:   0.805
RMSE: 0.503
Enter fullscreen mode Exit fullscreen mode

Compare that to linear regression's R2 of 0.576 on the same dataset. Random Forest gets 0.805 with zero preprocessing, zero feature engineering, and zero tuning.


Single Tree vs Random Forest: Side by Side

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

datasets = {
    'Breast Cancer': load_breast_cancer(),
}

for name, data in datasets.items():
    X_d, y_d = data.data, data.target

    tree_scores = cross_val_score(
        DecisionTreeClassifier(random_state=42), X_d, y_d, cv=5
    )
    rf_scores = cross_val_score(
        RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
        X_d, y_d, cv=5
    )

    print(f"\n{name}:")
    print(f"  Single Tree:   {tree_scores.mean():.3f} +/- {tree_scores.std():.3f}")
    print(f"  Random Forest: {rf_scores.mean():.3f} +/- {rf_scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Breast Cancer:
  Single Tree:   0.930 +/- 0.017
  Random Forest: 0.962 +/- 0.014
Enter fullscreen mode Exit fullscreen mode

Random Forest wins on both accuracy and stability (lower std). This pattern holds on almost every dataset you'll work with.


The Things Everyone Gets Wrong

Mistake 1: Using only 10 trees

Ten trees is not enough. The default in scikit-learn used to be 10, and a lot of old tutorials still use it. Start at 100 minimum.

Mistake 2: Not setting n_jobs=-1

Training 100+ trees is slow on one core. Set n_jobs=-1 and use all your cores. Training time can drop by 4x to 8x.

Mistake 3: Thinking more trees always helps

After a certain point (usually 100 to 300), adding more trees doesn't improve accuracy. It just costs time and memory. Use the accuracy-vs-n-trees plot to find the plateau.

Mistake 4: Using feature importance to make final decisions blindly

Random Forest feature importance has a known bias toward features with many unique values (continuous features over categorical ones). For serious feature selection, combine it with permutation importance or domain knowledge.

# Permutation importance: more reliable but slower
from sklearn.inspection import permutation_importance

result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
perm_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': result.importances_mean
}).sort_values('Importance', ascending=False)

print("Permutation Importance (top 5):")
print(perm_df.head().to_string(index=False))
Enter fullscreen mode Exit fullscreen mode

Quick Cheat Sheet

Task Code
Train classifier RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
Train regressor RandomForestRegressor(n_estimators=100, n_jobs=-1, random_state=42)
Feature importance rf.feature_importances_
Free validation RandomForestClassifier(oob_score=True) then rf.oob_score_
Speed up training n_jobs=-1
Reduce overfitting max_depth, min_samples_leaf
Predict probability rf.predict_proba(X_test)

Practice Challenges

Level 1:
Train a Random Forest on load_wine(). Compare accuracy to a single decision tree. Print the top 5 most important features.

Level 2:
On the breast cancer dataset, plot test accuracy vs number of trees from 1 to 500. Where does accuracy stop improving? Is it worth using 500 trees?

Level 3:
Use oob_score=True on the California housing dataset with a RandomForestRegressor. Compare the OOB R2 to the actual test R2. How close are they? Now try the same with only 20 trees. Does OOB become less reliable?


References


Next up, Post 59: XGBoost: The Algorithm That Wins Competitions. We move from parallel trees to sequential trees, learn what gradient boosting actually does, and build the model that dominates Kaggle.

Top comments (0)