Introduction
Random Forest is a very powerful supervised machine learning algorithm but can be a little confusing to understand. A simple explanation of what a Random Forest does is that it takes in a collection of decision trees, with each tree using different random features (Subspace Sampling) and only a portion of the total data with replacement (Bagging), then returns the mean of the predictions in the case of Regression or the mode of the classes in the case of Classification. So to really understand random forest we need to understand how decision trees work.
Decision Trees
Decision Trees are very similar to a flowchart where you can follow each decision to another step until you get to a final outcome, though the way you make the splits in a Decision Tree to come to your outcome is a bit different than your normal flowchart. To create start at the top of the decision tree(root node) with all of your data and then make your first split into your first two leaves, from there you continue dividing your internal nodes into two more internal nodes until your data no longer provides information to create more splits(leaf node) or reaches a set limit which can be set by a number of hyperparameters. So you might be thinking that doesn't sound too complicated, well then here comes the math, to create these splits you will need to calculate which variable would best split the data out of all the different splits you could do, luckily we have computers for that because this process can get very labor intensive very quickly.
Metrics for Splitting
I will try to keep this part simple for your sanity as well as mine. There are two different metrics to calculate the best split, information gain and Gini impurity. Information gain uses a concept called entropy, which in information theory is kind of like the "amount of information held in a variable", to be able to numerically compare the change of information before and after the split and find the best one. While Gini impurity is a calculation of the probability of a randomly selected variable being classified incorrectly.
Some Pros/Cons of using Random Forest
Pros
- Strong performance and highly accurate for most datasets.
- Can handle a high number of variables.
- Works with missing data.
- Gives user good measure of variable importance.
- Avoids overfitting by using the diversity of different samples and features used in the separate decision trees.
Cons
- Large memory footprint due to being an ensemble of a large collection of decision trees.
- Slow due to computational power required.
- Can be time consuming to tune hyperparameters, though less complex some similar algorithms such as Gradient Boost.
Conclusion
Random Forest is extremely versatile and well performing machine learning algorithm especially for classification problems. It also works well with large amounts of data and does well avoiding overfitting. That being said it is not the most efficient model for predictions so it is best used when you have a lot of computational power and memory available.
Top comments (0)