DEV Community

Assitan
Assitan

Posted on • Originally published at Medium on

What’s the Gini index for machine learning?

The Gini index is used for decision trees. Indeed, how do we know how to separate the root node? Well, there are a couple of methods, and the Gini index is a good one. It allows checking if the leaves containing labels are pure or impure.

That’s right, the more diverse the leaves are, the higher the Gini index is. Why? Because if, let’s say, you want to recommend a product using a decision tree, you want to make sure that the leaves are the most homogeneous possible so that you can be confident in your proposition.

two decisions trees with colored dots<br>

When we glance, we can think that feature A gives leaves with less diversity, so a better score, because we have 3 purple circles and two red circles. But you know what, let’s be a bit more rigorous.

So to choose which feature we use as the root tree, we calculate the diversity of the leaves.

This is the formula:

gini index formula commented (dataset, probability rule (complement), sum of classes, the probability that if we pick two random features out of the dataset, they belong to different classes)

Then we compare the mean of each tree and choose the lowest number. Our winner is feature A!

gini index calculation and the mean

That’s it!

As a beginner in Data Science, are you overwhelmed by everything you need to put in your notebook? When and how to do feature engineering or which metrics use for validation? Then you can buy my Machine Learning Regression Starter Pack for Beginners on Gumroad! You'll have a -20% discount code: xsmleqj

If you want to know more things about Data science and programming with illustrations, follow me on Twitter.

Originally published at https://codistwa.com.

Top comments (0)