Pavan Pothuganti

Posted on Jul 4

If Bagging Already Uses 100 Trees, Why Was Random Forest Invented?

#machinelearning #bagging #randonforest

After finally understanding Bagging, I thought I was done with ensemble learning.

The idea made sense.

Take multiple bootstrap samples.

Train multiple Decision Trees.

Combine their predictions using majority voting.

Variance decreases.

Simple.

Then I came across another algorithm:

Random Forest.

My first reaction was honest:

"Wait... isn't this just Bagging with a fancy name?"

It turns out the answer is no.

And the reason is surprisingly interesting.

Bagging Solves One Problem

Bagging makes Decision Trees more stable.

Each tree is trained on a different bootstrap sample, so they don't all learn exactly the same data.

That reduces variance.

At this point, I assumed every tree would become completely different.

But that's not always true.

Different Data Doesn't Mean Different Thinking

Imagine you're building a model to predict house prices.

Your features are:

Location
Area
Number of bedrooms
Age of the house
Parking

Suppose Area is by far the strongest predictor.

Even though every tree receives a different bootstrap dataset, most of them will still discover the same thing:

"Area is the best feature to split on."

So what happens?

Tree after tree starts with the same root node.

Many of them grow in very similar ways.

They are trained on different data, but they still think alike.

Why Is That a Problem?

Imagine asking 100 people the same question.

If every person has exactly the same information and thinks in exactly the same way, you'll probably hear the same answer 100 times.

Even if that answer is wrong.

Now compare that with asking 100 experts from different backgrounds.

One notices something others missed.

Another approaches the problem differently.

The diversity of opinions often leads to a better final decision.

Random Forest tries to create that diversity.

The Extra Randomness

Bagging changes the rows.

Random Forest changes both the rows and the features.

Instead of allowing every tree to examine every feature, Random Forest randomly selects a subset of features whenever a split is made.

Now imagine the strongest feature isn't available for a particular split.

The tree is forced to explore another path.

One tree may begin with Area.

Another may begin with Location.

Another may start with Age.

The trees become less similar.

And that's exactly what we want.

Why Diversity Matters

If every tree makes the same mistake, majority voting cannot help.

If different trees make different mistakes, majority voting becomes much more powerful.

Random Forest doesn't just build more trees.

It builds more independent trees.

That small difference is what makes the algorithm so effective.

The Lesson That Changed My Perspective

For a long time, I thought Random Forest was simply:

"Bagging + a random trick."

Now I see it differently.

Bagging asks:

"How do we make Decision Trees more stable?"

Random Forest asks:

"How do we make those trees think differently?"

Those are two completely different questions.

And that's why Random Forest usually performs better than plain Bagging with Decision Trees.

Key Takeaway

Bagging creates multiple Decision Trees using different datasets.

Random Forest goes one step further by ensuring those trees don't all rely on the same features.

It's not about creating more trees.

It's about creating more diverse trees.

Sometimes, diversity is more valuable than quantity.

DEV Community