DEV Community

Cover image for Random Forest Explained: Why It’s More Than Just a Bunch of Trees
Smit
Smit

Posted on • Edited on

Random Forest Explained: Why It’s More Than Just a Bunch of Trees

Introduction

In the vast forest of machine learning models, Random Forest stands out. But what’s the big deal? And seriously, what’s with that name? Did someone go hiking, see some trees 🌳, and think, “Yep, this is definitely an algorithm”? Probably not.

🔍Let’s explore why Random Forest is such a powerful and widely used algorithm and where the name actually comes from.

What is random forest algorithm ?

• A simplified way to understand Random Forest is this: imagine you're making a big decision, like choosing which college to attend. Instead of asking just one person for advice, you ask 100 random people. Each gives their opinion, and you go with the majority vote. That, in essence, is how Random Forest works.

• Random Forest builds a collection (or forest) of decision trees. Each tree is trained on a different subset of the data, with slight variations this process is known as bootstrapping. Every tree learns something a little different due to these variations.

• Each tree then makes its own prediction, and the forest combines all these predictions taking a majority vote (for classification) or averaging (for regression) to produce the final output.

• Now, each tree is different because it's trained on a different subset of data. Each tree sees different features and samples.

• At every decision point (called a split), each tree only looks at a random subset of features, not all of them. This *randomness * ensures that trees are diverse and less correlated.

• Finally, once all trees have voted, Random Forest aggregates those votes to make a robust final prediction.

The Voting Mechanism Behind Random Forest
Photo reference: https://builtin.com/data-science/random-forest-python

Key Equations for Random Forest:

1.🌳 Random forest equation (classification)
Image description

  • so h_i (x) is prediction from i-th decision tree
  • ŷ is final predicted class after coming to decision of most common vote among all the trees.

2.📊 Regression equation (numerical)

Image description

  • ŷ is the average prediction across all trees and T is total number of trees.
  • Each h_i(x) is a numerical prediction from the i-th decision tree

The image and equations illustrate how the Random Forest algorithm works by making predictions from individual trees, then selecting the majority vote (for classification) or averaging the outputs (for regression) to make a final decision.

Why is it so famous ?

Random forest is so famous for a few reasons:

  1. 🛡️ Resistant to overfitting
  • Single decision trees often perform well on training data, but their performance tends to drop when faced with new, unseen data.
  • Random Forest solves this problem by averaging the predictions of multiple trees, which helps smooth out noise and leads to more balanced and robust predictions.

2.🧪 Works well without any fine tuning:

  • Many machine learning algorithms require extensive fine-tuning to perform well. However, Random Forest performs strongly even with its default settings, without much need for parameter tuning.
  • This makes it a popular choice among beginners who need quick and accurate results.

  • Works with all kinds of data:
    📊 Numerical data
    🧩 Data with lots of features
    ❓ Missing values
    🏷️ Categorical data

3.Features (🌳) matters:

  • Most importantly, Random Forest identifies which features matter and which don’t.
  • This helps us better understand the model by providing insights into the variables that influence predictions.
  • It makes it easier to explain results to non-technical people.
  • It also allows for easy modification of which variables to include or exclude in future models.

Random Forest in action ?

  • Now that we know why Random Forest is so famous and how it got its name, let’s see it in action with a simple code snippet.
  • Imagine we want to build a model to predict whether someone will be approved for a house loan by a bank. Features include Income, Credit Score, Employment Status, Loan Amount, Interest Rate, and previous credit card history.
  • Training a single large decision tree often leads to overfitting. What makes Random Forest unique is that it creates many smaller decision trees, each looking at different parts of the data.

Python Code snippet:

# Sample dataset that simulates loan applications
X, y = make_classification(
    n_samples=2000,        # 2,000 loan applicants
    n_features=5,          # 5 features: credit score, income, etc.
    n_informative=3,       # only 3 features really affect the outcome
    n_redundant=0,
    random_state=0,
    shuffle=False)

# Create a Random Forest model with small trees
clf = RandomForestClassifier(max_depth=2,random_state=0)

# Training the model
clf.fit(X,y)

# Predict loan approval for a new applicant
print(clf.predict([[0, 0, 0, 0]]))  # Output: [1] (Approved)
Enter fullscreen mode Exit fullscreen mode

Reference: scikit-learn RandomForestClassifier documentation

Steps that algorithm follows:

  • So, what Random Forest does is it randomly picks samples of people from the list of loan applicants.
  • Then, it builds a decision tree on each sample.
  • This process is repeated many times to create multiple decision trees.
  • When it’s time to decide whether to approve a loan for a person, each tree makes its own decision: “Yes” or “No.”
  • The Random Forest algorithm collects all the votes from the trees and chooses the majority vote. If most trees say “Yes,” the loan is approved; otherwise, it is denied.

That’s it in simple terms, these are the steps the Random Forest algorithm follows.

Pro’s and Con’s of random forest algorithm.

Pro’s
  • The best part of the Random Forest algorithm is that it smooths out noise and creates balance in decision-making by considering the majority vote.
  • If one tree makes a mistake, the other trees have a chance to correct it by covering different edge cases.
  • Each tree is trained on different data, which makes the overall model more accurate.
  • It also identifies which features matter to the model and which don’t. This is very important for fine-tuning the model for future use cases.
Con’s
  • Slow process: Having many trees making individual decisions slows down the overall process, which is not ideal for real-time systems.
  • Difficult to explain: Random Forest is harder to interpret compared to simpler models like single decision trees or linear regression.
  • High memory usage: It can consume a lot of memory since it depends on the number of trees, each trained on different data subsets, making it computationally expensive.
  • Struggles with sparse data: Random Forest does not perform well on high dimensional, sparse data such as in text classification tasks.

Real life impact:

The best thing about the Random Forest algorithm is its impact on many real-world decisions. Here are some of the top real-life use cases:

  • 🏦 Finance: Used to approve loans by identifying risky applicants without bias.
  • 🧬 Healthcare: Widely used to diagnose diseases like cancer and diabetes, predict outcomes, and identify key risk factors.
  • 🛒 Retail: Helps analyze customer behavior and make recommendations—such as predicting when a customer will buy next or when they might stop shopping.
  • 🚗 Transportation: Predicts delays, forecasts demand fluctuations, and helps improve service and convenience for customers.

It’s fascinating how such a straightforward algorithm can have a profound impact across industries, enhancing our daily lives and positively contributing to society.

Follow along as I continue to explore and explain the fascinating world of AI and machine learning in simple terms.

Top comments (0)