Entropy and Information Gain in Decision Trees: A Practical Guide

#machinelearning #algorithms #math #tutorial

Entropy and Information Gain in Decision Trees: A Practical Guide

If the "lemon sorting" analogy helped you understand what decision trees do, this article explains how they decide which feature to split on.

The secret lies in two concepts: Entropy and Information Gain.

🌐 This is a cross-post from my interactive tutorial site mathisimple.com, where you can adjust class distributions and instantly see how entropy and information gain change.

What Is Entropy?

In information theory, entropy measures uncertainty or disorder.

If all fruits in a group are oranges → entropy = 0 (no uncertainty)
If half are oranges and half are lemons → entropy is maximum

The formula for entropy in a binary classification problem is:

H(S) = -p_1 \log_2(p_1) - p_2 \log_2(p_2)

Where ( p_1 ) and ( p_2 ) are the proportions of each class.

Information Gain = Reduction in Entropy

Information Gain tells us how much uncertainty we remove by asking a particular question.

\text{IG} = \text{Entropy(parent)} - \sum (\frac{N_i}{N} \times \text{Entropy(child}_i))

The algorithm chooses the split with the highest Information Gain.

Worked Example: Lemon Sorting

Initial group: 10 oranges, 10 lemons. Entropy = 1.0 (maximum uncertainty)

Split by "Color":

Yellow group (12 fruits): 2 oranges, 10 lemons → Entropy ≈ 0.65
Not yellow group (8 fruits): 8 oranges, 0 lemons → Entropy = 0

Weighted average entropy after split: 0.39

Information Gain: 1.0 - 0.39 = 0.61

Split by "Shape":

After calculating, we find it only gives Information Gain of 0.15.

Clearly, "Color" is the much better first question.

Why This Math Matters

It gives us a rigorous, mathematical way to quantify "how useful is this feature?"
It naturally prefers features that create very pure subgroups
It works for both classification and regression (with slight modifications)

Common Misconceptions

Higher entropy doesn't always mean "bad" — it means "high uncertainty"
Information Gain tends to favor features with more categories (this is why we sometimes use Gain Ratio)
A feature with high Information Gain early in the tree is usually very important

Interactive Entropy Explorer

On mathisimple.com you can:

Drag sliders to change class proportions in parent and child nodes
Instantly see entropy values and information gain update
Try different split scenarios to build intuition
Compare Information Gain vs Gini impurity

👉 Explore entropy and information gain interactively

This article pairs perfectly with the previous "lemon analogy" piece. Next, we'll dive into Gini Index and how CART decision trees actually choose splits in practice.

Understanding these concepts removes much of the mystery behind why decision trees make the decisions they do.