Random Forest Overview
Random forest is an ensemble learning algorithm and can be applied for classification and regression problems. Ensemble methods consist of several machine learning algorithms and they tend to improve prediction performance compared to single algorithms. Random forest combines the following methods: bagging, random features selection, and decision trees. Let's describe each component separately.
Bagging
Bagging is an abbreviated form of Bootstrap aggregation. It represents a random selection of data points with replacement. For example, if we have a dataset with 1000 observations we randomly select 1000 observations, but with replacement. So our sampled dataset will have around 2/3 of the original 1000 observations and the remaining 1/3 will be duplicates of 2/3.
We repeat creating bootstrap sampled data sets multiple times and the sampled datasets will define the number of decision trees in the ensemble. The default number of estimators in sklearn is 100 for both regression and classification (since scikit-learn version 0.22).
Random feature selection
Random feature selection for each node split when constructing individual decision trees is another source of randomness in Random Forest algorithm. There are two options to select all features for random sampling or select from a predefined number of features. For example, by default scikit-learn uses all features for regression and the square root of the number of features for classification problems. This is the part of the parameter tuning process, and along with 'max_features', we can also study the effects of adjusting other parameters like 'max_depth' and 'min_samples_split'.
After we selected random features, we can use the same criteria for selecting the best splitting features, like Gini impurity (classification problem) or reduction in variance (regression problem).
Constructing multiple decision trees
Finally, we construct decision trees for each bootstrap sample and using a random feature selection process.
How Random Forest makes the final decision
As we can see Random Forest uses a few random methods to create multiple decision trees. In the end, each decision tree will have the same vote ('weight') in a final decision in the case of the classification problem, or the mean value of all trees' results is calculated for the regression problem.
Top comments (1)
Thank you for your explanation now I understand the random forest algorithm much more clearly.
I also write a blog on random forests please visit it and let me know your thoughts: - random forest