All about Decision Trees

#decisiontree

In machine learning, decision trees are one of the most intuitive and widely-used algorithms for classification and regression tasks. They model decision-making processes in a tree-like structure where each internal node represents a feature (attribute), each branch represents a decision rule, and each leaf node represents the outcome. Decision trees are easy to visualize and interpret, making them an excellent tool for explaining the logic behind predictions.
In this blog, we will explore the basics and the mathematics of decision trees, the ID3 algorithm (a popular method for building decision trees), and CART (Classification and Regression Trees).
What is a Decision Tree?
A decision tree is a supervised learning algorithm that can be used for both classification and regression tasks. It works by recursively partitioning the data into subsets based on feature values, making decisions at each node to maximize a specific criterion (e.g., information gain or Gini index).
Key Components:
• Root Node: The top node in the tree that represents the best feature to split the data.
• Internal Nodes: Represent the features used for splitting the data based on specific decision rules.
• Leaf Nodes: Terminal nodes that represent the predicted outcome (class label or numerical value).
• Branches: Connections between nodes representing the possible values of the features.
How Decision Trees Work?
The process of creating a decision tree involves:

Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or information gain, the best attribute to split the data is selected.
Splitting the Dataset: The dataset is split into subsets based on the selected attribute.
Repeating the Process: The process is repeated recursively for each subset, creating a new internal node or leaf node until a stopping criterion is met (e.g., all instances in a node belong to the same class or a predefined depth is reached).

Dataset where play tennis is the output feature and rest all are input features

Decision tree example where the root node chosen is outlook (random)
A leaf node in a decision tree is the terminal node at the bottom of the tree, where no further splits are made. Leaf nodes represent the final output or prediction of the decision tree. Once a data point reaches a leaf node, a decision or prediction is made based on the majority class (for classification) or the average value (for regression) of the data points that reach that leaf.
To check mathematically if any split is pure split or not we use entropy or gini impurity. Information Gain helps us to determine which features need to be selected
Entropy
Entropy is a concept borrowed from information theory and is commonly used as a measure of uncertainty or disorder in a set of data. In the context of decision trees, entropy is often employed as a criterion to decide how to split data points at each node, aiming to create subsets that are more homogeneous with respect to the target variable.
Entropy is a measure of uncertainty or disorder. A low entropy indicates a more ordered or homogeneous set, while a high entropy signifies greater disorder or diversity. In the context of a decision tree, the goal is to reduce entropy by selecting features and split points that result in more ordered subsets.
Entropy values range from 0 to . The minimum entropy (0) occurs when all instances belong to a single class, making the set perfectly ordered. The maximum entropy () occurs when instances are evenly distributed across all classes, creating a state of maximum disorder.
At each node of a decision tree, the algorithm evaluates the entropy for each feature and split point. The feature and split point that result in the largest reduction in entropy are chosen for the split. The reduction in entropy is often referred to as Information Gain and is calculated as the difference between the entropy before and after the split.
Gini impurity
Gini impurity is a measure of the impurity or disorder in a set of elements, commonly used in decision tree algorithms, especially for classification tasks. It quantifies the likelihood of misclassification of a randomly chosen element in the dataset.
A lower Gini impurity suggests a more homogeneous set of elements within the node, making it an attractive split in a decision tree. Decision tree algorithms aim to minimize the Gini impurity at each node, selecting the feature and split point that results in the lowest impurity.
Entropy and Gini impurity formulas

• The minimum value of entropy is 0.
• Thus, for different numbers of classes:
• For 2 classes (binary classification): maximum entropy is 1.
• For 3 classes: maximum entropy is log⁡2(3)≈1.585
• For 4 classes: maximum entropy is log⁡2(4)=2 and so on
• G = 0 indicates a perfectly pure node (all elements belong to the same class).
• G = 0.5 indicates maximum impurity (elements are evenly distributed among all classes)
Example
Lets take an example of a feature f1 split into nodes c1 and c2. The feature f1 has 6 yes and 3 no, and when it is split c1 it has 3 yes and 3 no and c2 has 3 yes and 0 no.

If our dataset is huge we should choose Gini impurity as its calculation is much simpler compared to entropy
Information Gain

Information Gain formula

Here in information gain formula -> 8 comes from (6Yes and 2No) and 14 comes from (9Yes and 5No). Similarly 6 comes from (3Yes and 3No) and 14 comes from (9Yes and 5No)

DEV Community

All about Decision Trees

Top comments (0)