In the second chapter of our Symbolic Machine Learning series, We will be talking about one of the most popular machine learning algorithm Random Forests.
So, Let's explore the definition first.
Random Forest is a popular decision tree-based non-linear Machine learning algorithm (ensemble learning) used for classification, feature selection and regression tasks. It combines multiple decision trees to improve the accuracy and robustness of the model.
Random Forest
The word 'Random' is for random selection of features/data instances which is known as bootstrapping method in statistics and in Machine Learning as well.
The word 'Forest' means that we will not use only one or two decision trees, we will use several decision trees in order to develop decision models through Bagging method.
Here, each and every decision-tree will be formed from a random subset of data. This is how, from large sets of data, by choosing a subset and forming a decision tree at once recursively, we make a Random Forest. Now, there should be a probability that a randomly chosen sample in node would be incorrectly labelled and here comes the term "GINI Impurity".
Gini impurity is a commonly used measure of impurity in decision tree-based algorithms for classification tasks. It is a measure of impurity or heterogeneity which quantifies how often a randomly chosen data point in a given dataset would be incorrectly classified if it were labeled randomly according to the distribution of labels in the subset. In decision trees, the Gini impurity is used to select the best feature to split the data, by comparing the Gini impurity of each possible split and selecting the one that results in the lowest impurity or highest information gain
Formula
Gini impurity = 1 - (p1^2 + p2^2)
where p1 and p2 are the proportions of the two classes in the dataset. In other words, the Gini impurity measures the probability of misclassifying a randomly chosen element in the dataset if it were labeled according to the distribution of class labels. A lower Gini impurity score indicates a more homogeneous subset in terms of the class labels, and therefore a better split.
In case we have have to split the data based on category, we would work on the data clusters. However, if we have numerical data, we may take the mean and split the data into datasets based on a specific condition. The image below will help to have a clear understanding.
Here's an example Python code to calculate the Gini impurity of a binary classification problem:
import numpy as np
def gini_impurity(labels):
# count the number of instances for each class
_, counts = np.unique(labels, return_counts=True)
# compute the probabilities of each class
probs = counts / len(labels)
# calculate the Gini impurity
impurity = 1 - np.sum(probs ** 2)
return impurity
Let us now review the Steps in Random Forest Classification Method:
- Bootstrapping for random data subset generation
- Decision tree construction for each of the data subset
Determination of GINI impurity of each of the features.
Determination of GINI impurity of prospective splitting
sub-treeConstruction of Decision tree based on the splitting
GINI impurity (i.e. if sum of the GINI impurity of splitted
sub-tree is lower than the GINI impurity of parent node
then split the parent node)
- Bagging for ensemble classification
- Majority voting for classification decision making.
Let us now see an example of the whole process:
Top comments (0)