Sachin Kr. Rajput

Posted on Jan 22

Information Gain & Entropy: The Game Show Host Who Learned to Ask Perfect Questions

#python #machinelearning #datascience #beginners

The One-Line Summary: Entropy measures how "uncertain" or "mixed" a set is (high entropy = lots of uncertainty), and Information Gain measures how much a question reduces that uncertainty — so decision trees choose questions with the highest information gain to learn as efficiently as possible.

The Tale of Two Game Show Hosts

The hit TV show "Guess the Animal" had a simple format: contestants asked yes/no questions to identify a mystery animal from a list of 8 possibilities.

The show had two hosts with very different strategies.

Host #1: Random Randy

Randy asked whatever questions popped into his head:

THE ANIMALS: Dog, Cat, Eagle, Penguin, Shark, Goldfish, Snake, Frog

RANDY'S GAME (Mystery Animal: Snake)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Q1: "Does it have fur?"
A:  NO
    Remaining: Eagle, Penguin, Shark, Goldfish, Snake, Frog
    (Eliminated only 2 of 8 = 25%)

Q2: "Can it fly?"
A:  NO
    Remaining: Penguin, Shark, Goldfish, Snake, Frog
    (Eliminated only 1 more)

Q3: "Does it live in water?"
A:  NO
    Remaining: Penguin, Snake, Frog
    (Hmm, penguin is tricky...)

Q4: "Is it a reptile?"
A:  YES
    Remaining: Snake
    (Finally!)

Randy needed 4 questions. Not terrible, but not great.

Host #2: Efficient Emma

Emma had studied information theory. She asked questions strategically:

EMMA'S GAME (Mystery Animal: Snake)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

THE ANIMALS: Dog, Cat, Eagle, Penguin, Shark, Goldfish, Snake, Frog

Q1: "Is it warm-blooded?"
A:  NO
    YES group: Dog, Cat, Eagle, Penguin (4 animals)
    NO group:  Shark, Goldfish, Snake, Frog (4 animals) ✓

    PERFECT 50-50 SPLIT! Eliminated exactly half.

Q2: "Does it have fins?"
A:  NO  
    YES group: Shark, Goldfish (2 animals)
    NO group:  Snake, Frog (2 animals) ✓

    PERFECT 50-50 SPLIT again! Eliminated half of remaining.

Q3: "Does it have legs?"
A:  NO
    YES group: Frog (1 animal)
    NO group:  Snake (1 animal) ✓

    FOUND IT! Snake has no legs.

Emma needed only 3 questions!

The Secret: Emma's Questions Split Evenly

EMMA'S STRATEGY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Each question splits the remaining options as close 
to 50-50 as possible.

8 animals → 4 animals → 2 animals → 1 animal
    ↓           ↓           ↓
  Q1 (÷2)    Q2 (÷2)     Q3 (÷2)

With perfect splits: log₂(8) = 3 questions needed!


RANDY'S PROBLEM:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

His questions had uneven splits:

"Does it have fur?"
  YES: 2 animals (Dog, Cat)      = 25%
  NO:  6 animals (everything else) = 75%

This is a BAD question! Even if the answer is YES,
you only eliminated 75%. If NO, only 25%.

A 50-50 split GUARANTEES you eliminate 50% every time.

This is exactly what ENTROPY and INFORMATION GAIN measure!

![Entropy and Information Gain Overview]

The complete picture: Emma's strategy, entropy formula, information gain formula, and key takeaways

What is Entropy?

Entropy measures uncertainty or disorder in a set.

ENTROPY INTUITION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

HIGH ENTROPY = High uncertainty = Hard to predict
LOW ENTROPY  = Low uncertainty  = Easy to predict


EXAMPLE: Predicting tomorrow's weather

Location A: 50% sunny, 50% rainy
  → HIGH entropy (could go either way!)
  → Very uncertain

Location B: 99% sunny, 1% rainy  
  → LOW entropy (almost certainly sunny)
  → Very predictable

Location C: 100% sunny, 0% rainy
  → ZERO entropy (definitely sunny)
  → No uncertainty at all!

The Entropy Formula

ENTROPY FORMULA:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

H(S) = -Σ pᵢ × log₂(pᵢ)

Where:
• S is the set (e.g., a node in a decision tree)
• pᵢ is the proportion of class i in the set
• log₂ is the base-2 logarithm
• The sum is over all classes


WHY log₂?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Because entropy is measured in "BITS" — the number
of yes/no questions needed to identify something.

8 equally likely options → log₂(8) = 3 bits
  (Need 3 yes/no questions)

16 equally likely options → log₂(16) = 4 bits
  (Need 4 yes/no questions)

2 equally likely options → log₂(2) = 1 bit
  (Need 1 yes/no question)

Entropy Examples: Step by Step

import numpy as np
import matplotlib.pyplot as plt

def entropy(proportions):
    """Calculate entropy from a list of proportions."""
    # Filter out zeros (log(0) is undefined)
    proportions = np.array([p for p in proportions if p > 0])
    return -np.sum(proportions * np.log2(proportions))

print("ENTROPY EXAMPLES")
print("="*60)

examples = [
    ("Pure (all same class)", [1.0]),
    ("50-50 split", [0.5, 0.5]),
    ("75-25 split", [0.75, 0.25]),
    ("90-10 split", [0.9, 0.1]),
    ("99-1 split", [0.99, 0.01]),
    ("3-way equal split", [1/3, 1/3, 1/3]),
    ("4-way equal split", [0.25, 0.25, 0.25, 0.25]),
]

for name, props in examples:
    h = entropy(props)
    print(f"{name:<25} → Entropy = {h:.4f} bits")

print(f"""
INTERPRETATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

• Pure set (all one class):     H = 0.00 bits
  → No uncertainty! We KNOW the answer.

• 50-50 binary split:           H = 1.00 bit
  → Maximum uncertainty for 2 classes.
  → Need exactly 1 yes/no question.

• 4-way equal split:            H = 2.00 bits
  → Need 2 yes/no questions (2² = 4).

• Skewed splits (90-10, 99-1):  H < 1.00 bit
  → Less uncertain. Often can guess correctly.
""")

Output:

ENTROPY EXAMPLES
============================================================
Pure (all same class)     → Entropy = 0.0000 bits
50-50 split               → Entropy = 1.0000 bits
75-25 split               → Entropy = 0.8113 bits
90-10 split               → Entropy = 0.4690 bits
99-1 split                → Entropy = 0.0808 bits
3-way equal split         → Entropy = 1.5850 bits
4-way equal split         → Entropy = 2.0000 bits

INTERPRETATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

• Pure set (all one class):     H = 0.00 bits
  → No uncertainty! We KNOW the answer.

• 50-50 binary split:           H = 1.00 bit
  → Maximum uncertainty for 2 classes.
  → Need exactly 1 yes/no question.

• 4-way equal split:            H = 2.00 bits
  → Need 2 yes/no questions (2² = 4).

• Skewed splits (90-10, 99-1):  H < 1.00 bit
  → Less uncertain. Often can guess correctly.

Visualizing Entropy

import numpy as np
import matplotlib.pyplot as plt

# Create entropy curve for binary classification
p = np.linspace(0.001, 0.999, 1000)
h = -p * np.log2(p) - (1-p) * np.log2(1-p)

plt.figure(figsize=(12, 6))
plt.plot(p, h, 'b-', linewidth=3)
plt.fill_between(p, 0, h, alpha=0.2)

# Mark key points
plt.scatter([0.5], [1.0], s=200, c='red', zorder=5, marker='*')
plt.scatter([0.1, 0.9], [entropy([0.1, 0.9])]*2, s=100, c='orange', zorder=5)
plt.scatter([0.01, 0.99], [entropy([0.01, 0.99])]*2, s=100, c='green', zorder=5)

plt.xlabel('Proportion of Class 1 (p)', fontsize=12)
plt.ylabel('Entropy (bits)', fontsize=12)
plt.title('Entropy: Maximum Uncertainty at 50-50 Split', fontsize=14)
plt.xlim(0, 1)
plt.ylim(0, 1.1)
plt.grid(True, alpha=0.3)

plt.annotate('Maximum entropy!\n(most uncertain)', xy=(0.5, 1.0), 
             xytext=(0.65, 0.85), fontsize=10,
             arrowprops=dict(arrowstyle='->', color='red'))
plt.annotate('Low entropy\n(fairly certain)', xy=(0.9, 0.47), 
             xytext=(0.75, 0.6), fontsize=10,
             arrowprops=dict(arrowstyle='->', color='orange'))

plt.savefig('entropy_curve.png', dpi=150, bbox_inches='tight')
plt.show()

![Entropy Curve]

Entropy is maximized when the split is 50-50 and minimized (zero) when all samples belong to one class

The Story Behind the Formula

Let's understand WHY entropy has this particular formula.

THE INTUITION BEHIND -p × log₂(p):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Imagine you need to communicate which animal was chosen.

SCENARIO 1: 8 equally likely animals
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Each has probability 1/8 = 0.125

To specify one of 8, you need log₂(8) = 3 bits.
(Like asking 3 yes/no questions)

Example binary codes:
  Dog:      000
  Cat:      001
  Eagle:    010
  Penguin:  011
  Shark:    100
  Goldfish: 101
  Snake:    110
  Frog:     111

3 bits to specify any animal.
Entropy = -8 × (1/8) × log₂(1/8) = 3 bits ✓


SCENARIO 2: Skewed probabilities
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What if Dog appears 50% of the time, and others split the rest?
  Dog: 50% (1/2)
  Others: ~7% each (1/14 each)

Now we can be CLEVER with our code:
  Dog:      0        (1 bit - it's common!)
  Cat:      100      (3 bits)
  Eagle:    101      (3 bits)
  ...

On average, we need FEWER bits because Dog is common.
This is exactly what entropy calculates — the AVERAGE
number of bits needed, weighted by probability!

What is Information Gain?

Now we understand entropy (uncertainty). Information Gain is simply:

How much does a question REDUCE uncertainty?

INFORMATION GAIN FORMULA:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

IG(S, A) = H(S) - H(S|A)

       = H(parent) - Weighted Average H(children)

       = Entropy BEFORE - Entropy AFTER

Where:
• S is the set before splitting
• A is the attribute/feature we split on
• H(S|A) is the conditional entropy after splitting


IN PLAIN ENGLISH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Information Gain = How much UNCERTAINTY did we REMOVE?

High IG → Great question! (Removed lots of uncertainty)
Low IG  → Bad question! (Didn't help much)
Zero IG → Useless question! (Learned nothing)

Back to the Game Show

Let's calculate Information Gain for Emma's and Randy's questions:

import numpy as np

def entropy(labels):
    """Calculate entropy from a list of labels."""
    if len(labels) == 0:
        return 0
    _, counts = np.unique(labels, return_counts=True)
    probs = counts / len(labels)
    return -np.sum(probs * np.log2(probs))

def information_gain(parent_labels, children_labels_list):
    """Calculate information gain from a split."""
    parent_entropy = entropy(parent_labels)

    # Weighted average of children's entropy
    total = len(parent_labels)
    children_entropy = sum(
        (len(child) / total) * entropy(child)
        for child in children_labels_list
    )

    return parent_entropy - children_entropy

# The animals and their properties
animals = {
    'Dog':      {'warm_blooded': True,  'fur': True,  'fins': False, 'legs': True},
    'Cat':      {'warm_blooded': True,  'fur': True,  'fins': False, 'legs': True},
    'Eagle':    {'warm_blooded': True,  'fur': False, 'fins': False, 'legs': True},
    'Penguin':  {'warm_blooded': True,  'fur': False, 'fins': False, 'legs': True},
    'Shark':    {'warm_blooded': False, 'fur': False, 'fins': True,  'legs': False},
    'Goldfish': {'warm_blooded': False, 'fur': False, 'fins': True,  'legs': False},
    'Snake':    {'warm_blooded': False, 'fur': False, 'fins': False, 'legs': False},
    'Frog':     {'warm_blooded': False, 'fur': False, 'fins': False, 'legs': True},
}

print("COMPARING QUESTIONS: INFORMATION GAIN")
print("="*60)

# All animals as parent
all_animals = list(animals.keys())
print(f"Starting with {len(all_animals)} animals (all equally likely)")
print(f"Parent entropy: {entropy(all_animals):.4f} bits")
print(f"(log₂(8) = 3 bits - need 3 yes/no questions ideally)\n")

# Compare different questions
questions = ['warm_blooded', 'fur', 'fins', 'legs']

print(f"{'Question':<20} {'YES group':<20} {'NO group':<25} {'IG (bits)':<10}")
print("-"*75)

for q in questions:
    yes_group = [a for a, props in animals.items() if props[q]]
    no_group = [a for a, props in animals.items() if not props[q]]

    ig = information_gain(all_animals, [yes_group, no_group])

    yes_str = f"{len(yes_group)} animals"
    no_str = f"{len(no_group)} animals"

    print(f"{q:<20} {yes_str:<20} {no_str:<25} {ig:.4f}")

print(f"""
ANALYSIS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

'warm_blooded': 4 YES, 4 NO → PERFECT 50-50 split!
                IG = 1.0 bit (maximum possible!)

'fur':          2 YES, 6 NO → Uneven split
                IG = 0.81 bits (good, but not optimal)

'fins':         2 YES, 6 NO → Same as fur
                IG = 0.81 bits

'legs':         6 YES, 2 NO → Uneven split
                IG = 0.81 bits


WINNER: 'Is it warm-blooded?' 
This is EXACTLY what Emma asked first!
""")

Output:

COMPARING QUESTIONS: INFORMATION GAIN
============================================================
Starting with 8 animals (all equally likely)
Parent entropy: 3.0000 bits
(log₂(8) = 3 bits - need 3 yes/no questions ideally)

Question             YES group            NO group                  IG (bits) 
---------------------------------------------------------------------------
warm_blooded         4 animals            4 animals                 1.0000
fur                  2 animals            6 animals                 0.8113
fins                 2 animals            6 animals                 0.8113
legs                 6 animals            2 animals                 0.8113

ANALYSIS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

'warm_blooded': 4 YES, 4 NO → PERFECT 50-50 split!
                IG = 1.0 bit (maximum possible!)

'fur':          2 YES, 6 NO → Uneven split
                IG = 0.81 bits (good, but not optimal)

WINNER: 'Is it warm-blooded?' 
This is EXACTLY what Emma asked first!

Information Gain for Decision Trees

Now let's apply this to a real classification problem:

import numpy as np
import pandas as pd

# Classic "Play Tennis" dataset
data = {
    'Outlook':    ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 
                   'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 
                   'Overcast', 'Rain'],
    'Temperature':['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 
                   'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity':   ['High', 'High', 'High', 'High', 'Normal', 'Normal', 
                   'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 
                   'Normal', 'High'],
    'Wind':       ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 
                   'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 
                   'Weak', 'Strong'],
    'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 
                   'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}

df = pd.DataFrame(data)
print("THE PLAY TENNIS DATASET")
print("="*60)
print(df.to_string(index=False))
print(f"\nTotal: {len(df)} days, {sum(df['PlayTennis']=='Yes')} Yes, {sum(df['PlayTennis']=='No')} No")

# Calculate parent entropy
parent_labels = df['PlayTennis'].values
parent_entropy = entropy(parent_labels)
print(f"Parent Entropy: {parent_entropy:.4f} bits")

THE PLAY TENNIS DATASET
============================================================
  Outlook Temperature Humidity    Wind PlayTennis
    Sunny         Hot     High    Weak         No
    Sunny         Hot     High  Strong         No
 Overcast         Hot     High    Weak        Yes
     Rain        Mild     High    Weak        Yes
     Rain        Cool   Normal    Weak        Yes
     Rain        Cool   Normal  Strong         No
 Overcast        Cool   Normal  Strong        Yes
    Sunny        Mild     High    Weak         No
    Sunny        Cool   Normal    Weak        Yes
     Rain        Mild   Normal    Weak        Yes
    Sunny        Mild   Normal  Strong        Yes
 Overcast        Mild     High  Strong        Yes
 Overcast         Hot   Normal    Weak        Yes
     Rain        Mild     High  Strong         No

Total: 14 days, 9 Yes, 5 No
Parent Entropy: 0.9403 bits

Calculating Information Gain for Each Feature

def calc_ig_for_feature(df, feature, target='PlayTennis'):
    """Calculate information gain for a feature."""
    parent_entropy = entropy(df[target].values)

    # Get unique values
    values = df[feature].unique()

    # Calculate weighted entropy of children
    weighted_entropy = 0
    split_details = []

    for val in values:
        subset = df[df[feature] == val]
        weight = len(subset) / len(df)
        child_entropy = entropy(subset[target].values)
        weighted_entropy += weight * child_entropy

        # Count Yes/No
        yes_count = sum(subset[target] == 'Yes')
        no_count = sum(subset[target] == 'No')
        split_details.append((val, yes_count, no_count, child_entropy))

    ig = parent_entropy - weighted_entropy
    return ig, split_details

print("\nINFORMATION GAIN FOR EACH FEATURE")
print("="*60)
print(f"Parent Entropy: {entropy(df['PlayTennis'].values):.4f} bits")
print(f"(9 Yes, 5 No out of 14 samples)\n")

features = ['Outlook', 'Temperature', 'Humidity', 'Wind']
results = []

for feature in features:
    ig, details = calc_ig_for_feature(df, feature)
    results.append((feature, ig, details))

    print(f"\n{feature}: Information Gain = {ig:.4f} bits")
    print("-" * 50)
    for val, yes, no, h in details:
        print(f"  {val:<12}: {yes} Yes, {no} No  (H = {h:.4f})")

# Find best feature
best_feature = max(results, key=lambda x: x[1])
print(f"\n{'='*60}")
print(f"🏆 BEST SPLIT: {best_feature[0]} (IG = {best_feature[1]:.4f} bits)")
print(f"{'='*60}")

Output:

INFORMATION GAIN FOR EACH FEATURE
============================================================
Parent Entropy: 0.9403 bits
(9 Yes, 5 No out of 14 samples)


Outlook: Information Gain = 0.2467 bits
--------------------------------------------------
  Sunny       : 2 Yes, 3 No  (H = 0.9710)
  Overcast    : 4 Yes, 0 No  (H = 0.0000)
  Rain        : 3 Yes, 2 No  (H = 0.9710)

Temperature: Information Gain = 0.0292 bits
--------------------------------------------------
  Hot         : 2 Yes, 2 No  (H = 1.0000)
  Mild        : 4 Yes, 2 No  (H = 0.9183)
  Cool        : 3 Yes, 1 No  (H = 0.8113)

Humidity: Information Gain = 0.1518 bits
--------------------------------------------------
  High        : 3 Yes, 4 No  (H = 0.9852)
  Normal      : 6 Yes, 1 No  (H = 0.5917)

Wind: Information Gain = 0.0481 bits
--------------------------------------------------
  Weak        : 6 Yes, 2 No  (H = 0.8113)
  Strong      : 3 Yes, 3 No  (H = 1.0000)

============================================================
🏆 BEST SPLIT: Outlook (IG = 0.2467 bits)
============================================================

![Information Gain Comparison]

Outlook has the highest information gain because the "Overcast" branch becomes completely pure (all Yes)

Why Outlook Wins: The Power of Pure Nodes

WHY OUTLOOK HAS THE HIGHEST INFORMATION GAIN:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When we split on Outlook:

           [All 14 days]
            9 Yes, 5 No
            H = 0.940
                │
        Split on Outlook
                │
    ┌───────────┼───────────┐
    ↓           ↓           ↓
 [Sunny]    [Overcast]    [Rain]
 2Y, 3N      4Y, 0N       3Y, 2N
 H=0.971     H=0.000      H=0.971
              ↑
         PURE NODE!
         (Entropy = 0)

The "Overcast" branch is PERFECTLY PURE!
All 4 Overcast days resulted in "Yes" (play tennis).

This is gold! We can immediately say:
"If Overcast → Always play tennis"

The other features don't create any pure nodes,
so they have lower information gain.

Step-by-Step: Building the Tree

print("BUILDING THE DECISION TREE")
print("="*60)

print("""
STEP 1: Split on Outlook (highest IG = 0.247)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                    [Root: All 14 days]
                      9 Yes, 5 No
                      H = 0.940
                           │
                   Outlook = ?
           ┌───────────┼───────────┐
           ↓           ↓           ↓
       [Sunny]    [Overcast]    [Rain]
       2Y, 3N      4Y, 0N       3Y, 2N
       H=0.971     H=0.000      H=0.971
                      │
                   DONE!        
                 (Pure: Yes)
""")

# Now split Sunny branch
sunny_df = df[df['Outlook'] == 'Sunny']
print("\nSTEP 2: Split the Sunny branch")
print("-"*50)

for feature in ['Temperature', 'Humidity', 'Wind']:
    ig, details = calc_ig_for_feature(sunny_df, feature)
    print(f"{feature}: IG = {ig:.4f}")

print("\nHumidity wins! Split Sunny on Humidity:")
print("""
       [Sunny]
       2Y, 3N
          │
    Humidity = ?
    ┌─────────┴─────────┐
    ↓                   ↓
 [High]              [Normal]
 0Y, 3N              2Y, 0N
 DONE!               DONE!
 (Pure: No)          (Pure: Yes)
""")

# Split Rain branch  
rain_df = df[df['Outlook'] == 'Rain']
print("\nSTEP 3: Split the Rain branch")
print("-"*50)

for feature in ['Temperature', 'Humidity', 'Wind']:
    ig, details = calc_ig_for_feature(rain_df, feature)
    print(f"{feature}: IG = {ig:.4f}")

print("\nWind wins! Split Rain on Wind:")
print("""
       [Rain]
       3Y, 2N
          │
      Wind = ?
    ┌─────────┴─────────┐
    ↓                   ↓
 [Weak]              [Strong]
 3Y, 0N              0Y, 2N
 DONE!               DONE!
 (Pure: Yes)         (Pure: No)
""")

The Final Tree

THE COMPLETE DECISION TREE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                        [Outlook?]
                       /    |    \
                      /     |     \
                   Sunny Overcast  Rain
                    /       |        \
            [Humidity?]    YES    [Wind?]
              /    \               /    \
           High  Normal        Weak  Strong
            |      |             |      |
           NO     YES          YES     NO


DECISION RULES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. If Outlook = Overcast → Play Tennis (YES)
2. If Outlook = Sunny AND Humidity = High → Don't Play (NO)
3. If Outlook = Sunny AND Humidity = Normal → Play Tennis (YES)
4. If Outlook = Rain AND Wind = Weak → Play Tennis (YES)
5. If Outlook = Rain AND Wind = Strong → Don't Play (NO)

These rules achieve 100% accuracy on the training data!

![Decision Tree for Tennis]

The final decision tree built using information gain to select the best splits

Entropy vs Gini: A Comparison

Decision trees can use either entropy or Gini impurity. How do they compare?

import numpy as np
import matplotlib.pyplot as plt

# Compare entropy and Gini for binary classification
p = np.linspace(0.001, 0.999, 1000)
entropy_vals = -p * np.log2(p) - (1-p) * np.log2(1-p)
gini_vals = 2 * p * (1 - p)

# Normalize Gini to same scale for comparison
gini_scaled = gini_vals * 2  # Max Gini is 0.5, max entropy is 1

plt.figure(figsize=(12, 6))
plt.plot(p, entropy_vals, 'b-', linewidth=3, label='Entropy')
plt.plot(p, gini_vals, 'r--', linewidth=3, label='Gini Impurity')
plt.plot(p, gini_scaled, 'r:', linewidth=2, label='Gini (scaled 2x)')

plt.xlabel('Proportion of Class 1 (p)', fontsize=12)
plt.ylabel('Impurity', fontsize=12)
plt.title('Entropy vs Gini Impurity: Very Similar Shapes!', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim(0, 1)

plt.savefig('entropy_vs_gini.png', dpi=150, bbox_inches='tight')
plt.show()

ENTROPY vs GINI IMPURITY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                    ENTROPY         GINI
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Formula:        -Σ p log₂(p)      1 - Σ p²
Range:          [0, log₂(k)]      [0, 1-1/k]
For binary:     [0, 1]            [0, 0.5]
At 50-50:       1.0               0.5
Computation:    Slower (log)      Faster (no log)
Origin:         Information       Statistical
                theory            classification


WHICH TO USE?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In practice: Results are nearly identical!

• Gini is DEFAULT in scikit-learn (slightly faster)
• Entropy has nice interpretation (bits of information)
• Both prefer balanced splits
• Both produce similar trees 99% of the time

Choose either — the difference rarely matters.

![Entropy vs Gini]

Entropy and Gini impurity have very similar shapes — both are maximized at 50-50 and zero when pure

The Mathematics: Why This Works

THE INFORMATION-THEORETIC VIEW:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Claude Shannon (1948) asked:
"What's the minimum number of bits needed to 
 communicate a message?"

Answer: It depends on PROBABILITY!

If something is CERTAIN (p=1):
  → 0 bits needed (you already know!)

If two outcomes are EQUALLY LIKELY (p=0.5 each):
  → 1 bit needed (just say "yes" or "no")

If 8 outcomes are equally likely:
  → 3 bits needed (log₂(8) = 3)


ENTROPY IS THE AVERAGE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

H = Σ pᵢ × (bits needed for outcome i)
  = Σ pᵢ × log₂(1/pᵢ)
  = -Σ pᵢ × log₂(pᵢ)

Each outcome i:
  - Occurs with probability pᵢ
  - Needs log₂(1/pᵢ) bits to specify
  - Contributes pᵢ × log₂(1/pᵢ) to the average


INFORMATION GAIN = BITS SAVED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Before question: Need H(parent) bits to specify class
After question:  Need H(children) bits on average
Information Gain: H(parent) - H(children) bits SAVED!

A question with IG = 0.5 bits saves you half a yes/no question on average!

Code: Complete Implementation

import numpy as np
from collections import Counter

class DecisionTreeWithEntropy:
    """Decision tree using information gain (entropy) for splits."""

    def __init__(self, max_depth=None, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.tree = None

    def _entropy(self, y):
        """Calculate entropy of a label array."""
        if len(y) == 0:
            return 0
        counts = Counter(y)
        probs = [count / len(y) for count in counts.values()]
        return -sum(p * np.log2(p) for p in probs if p > 0)

    def _information_gain(self, y, y_left, y_right):
        """Calculate information gain from a split."""
        if len(y_left) == 0 or len(y_right) == 0:
            return 0

        parent_entropy = self._entropy(y)
        n = len(y)
        child_entropy = (
            (len(y_left) / n) * self._entropy(y_left) +
            (len(y_right) / n) * self._entropy(y_right)
        )
        return parent_entropy - child_entropy

    def _best_split(self, X, y):
        """Find the best feature and threshold to split on."""
        best_ig = 0
        best_feature = None
        best_threshold = None

        for feature in range(X.shape[1]):
            thresholds = np.unique(X[:, feature])

            for threshold in thresholds:
                left_mask = X[:, feature] <= threshold
                right_mask = ~left_mask

                if sum(left_mask) == 0 or sum(right_mask) == 0:
                    continue

                ig = self._information_gain(y, y[left_mask], y[right_mask])

                if ig > best_ig:
                    best_ig = ig
                    best_feature = feature
                    best_threshold = threshold

        return best_feature, best_threshold, best_ig

    def _build_tree(self, X, y, depth=0):
        """Recursively build the tree."""
        n_samples = len(y)
        n_classes = len(set(y))

        # Stopping conditions
        if (self.max_depth and depth >= self.max_depth) or \
           n_classes == 1 or \
           n_samples < self.min_samples_split:
            return {'leaf': True, 'class': Counter(y).most_common(1)[0][0],
                    'samples': n_samples, 'entropy': self._entropy(y)}

        # Find best split
        feature, threshold, ig = self._best_split(X, y)

        if feature is None or ig == 0:
            return {'leaf': True, 'class': Counter(y).most_common(1)[0][0],
                    'samples': n_samples, 'entropy': self._entropy(y)}

        # Split
        left_mask = X[:, feature] <= threshold
        right_mask = ~left_mask

        return {
            'leaf': False,
            'feature': feature,
            'threshold': threshold,
            'ig': ig,
            'entropy': self._entropy(y),
            'samples': n_samples,
            'left': self._build_tree(X[left_mask], y[left_mask], depth + 1),
            'right': self._build_tree(X[right_mask], y[right_mask], depth + 1)
        }

    def fit(self, X, y):
        self.tree = self._build_tree(np.array(X), np.array(y))
        return self

    def _predict_one(self, x, node):
        if node['leaf']:
            return node['class']
        if x[node['feature']] <= node['threshold']:
            return self._predict_one(x, node['left'])
        return self._predict_one(x, node['right'])

    def predict(self, X):
        return [self._predict_one(x, self.tree) for x in np.array(X)]

    def print_tree(self, node=None, indent="", feature_names=None):
        """Pretty print the tree with entropy and IG."""
        if node is None:
            node = self.tree

        if node['leaf']:
            print(f"{indent}🎯 Predict: {node['class']} "
                  f"(samples={node['samples']}, H={node['entropy']:.3f})")
        else:
            fname = feature_names[node['feature']] if feature_names else f"X[{node['feature']}]"
            print(f"{indent}📊 {fname} <= {node['threshold']}? "
                  f"(IG={node['ig']:.3f}, H={node['entropy']:.3f}, n={node['samples']})")
            print(f"{indent}├── Yes:")
            self.print_tree(node['left'], indent + "│   ", feature_names)
            print(f"{indent}└── No:")
            self.print_tree(node['right'], indent + "    ", feature_names)

# Test on iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

tree = DecisionTreeWithEntropy(max_depth=3)
tree.fit(X_train, y_train)

print("DECISION TREE WITH ENTROPY (INFORMATION GAIN)")
print("="*60)
print("\nTree Structure:")
tree.print_tree(feature_names=iris.feature_names)

# Accuracy
predictions = tree.predict(X_test)
accuracy = sum(p == t for p, t in zip(predictions, y_test)) / len(y_test)
print(f"\nTest Accuracy: {accuracy:.2%}")

Quick Reference

ENTROPY AND INFORMATION GAIN: CHEAT SHEET
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ENTROPY:
  H(S) = -Σ pᵢ log₂(pᵢ)

  • Measures uncertainty/disorder
  • H = 0 → Pure (all one class)
  • H = 1 → Maximum uncertainty (binary 50-50)
  • Higher H → More mixed → Harder to predict

INFORMATION GAIN:
  IG(S, A) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)

  • Measures reduction in uncertainty
  • IG = 0 → Question tells us nothing
  • High IG → Great question!
  • Decision trees pick highest IG at each split

COMPARISON WITH GINI:
  • Both measure impurity
  • Both prefer balanced splits
  • Gini: 1 - Σ pᵢ² (faster, no log)
  • Entropy: -Σ pᵢ log₂(pᵢ) (information theory)
  • Results usually identical

SCIKIT-LEARN:
  DecisionTreeClassifier(criterion='entropy')  # Use entropy
  DecisionTreeClassifier(criterion='gini')     # Use Gini (default)

Key Takeaways

Entropy measures uncertainty — High entropy = mixed classes = hard to predict
Pure nodes have zero entropy — All samples belong to one class
Maximum entropy at 50-50 — Most uncertain when perfectly balanced
Information Gain = entropy reduction — How much does a question help?
Trees choose highest IG — Ask the most informative question first
50-50 splits are ideal — They maximize information gain
Entropy uses bits — The number of yes/no questions needed
Gini and entropy are similar — Both work well, Gini is slightly faster

The One-Sentence Summary

Efficient Emma won the game show because she asked questions that split possibilities in half, maximizing information gain — and this is exactly how decision trees choose their questions: by picking the feature that reduces entropy (uncertainty) the most, learned from studying the training data to ask questions that most effectively separate the classes.

What's Next?

Now that you understand entropy and information gain, you're ready for:

Gini Impurity Deep Dive — The other splitting criterion
Random Forests — Many trees voting together
Pruning Strategies — Preventing overfitting
Feature Importance — Which features matter most?

Follow me for the next article in the Tree Based Models series!

Let's Connect!

If Emma's game show strategy made entropy click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your favorite "maximum information" question? Mine is "Is it bigger than a breadbox?" — a classic 20 Questions opener that splits the world roughly in half! 📦

The difference between Random Randy and Efficient Emma? Randy asked whatever came to mind; Emma asked questions that maximized information gain. Decision trees learned the same lesson — always ask the question that teaches you the most.

Share this with someone confused by entropy. After meeting Emma, they'll never forget it!

Happy splitting! 🌲

DEV Community

Information Gain & Entropy: The Game Show Host Who Learned to Ask Perfect Questions

The Tale of Two Game Show Hosts

Host #1: Random Randy

Host #2: Efficient Emma

The Secret: Emma's Questions Split Evenly

What is Entropy?

The Entropy Formula

Entropy Examples: Step by Step

Visualizing Entropy

The Story Behind the Formula

What is Information Gain?

Back to the Game Show

Information Gain for Decision Trees

Calculating Information Gain for Each Feature

Why Outlook Wins: The Power of Pure Nodes

Step-by-Step: Building the Tree

The Final Tree

Entropy vs Gini: A Comparison

The Mathematics: Why This Works

Code: Complete Implementation

Quick Reference

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)