Data scientists employ a wide range of algorithms to receive and analyze input data to predict output values within an acceptable range. The more experience a data scientist gains, the more they know the right algorithm to use for each problem.
Random Forest is one of the extremely useful algorithm since it works for both classification and regression tasks.
In this article, you'll learn all you need to know about Random Forest. We'll cover:
- What is Random Forest?
- What Random Forest is used for?
- How Random Forest works?
- Important Hyper-parameters
- How to execute Random Forest with lines of code
- Advantages of Random Forest
- Disadvantages of Random Forest
Random Forests also known as random decision forests are ensemble learning method for classification, regression and other tasks that works by constructing a multitude of decision trees at training time. For classification tasks, the output of the Random Forest is the class selected by most trees. Random Forest is also a supervised machine learning algorithm that grows and combines decision trees to make a 'forest'. Random Forest can be used for both classification and regression tasks in R and python.
Before we explore more details in Random Forest, let's break down the keywords in the definition;
- Supervised machine learning
- Classification and regression
- Decision tree
Understanding these keywords will make you understand the concept of Random Forest, we initiate with;
Supervised machine learning is a category of machine learning that uses labeled datasets to train algorithms to classify data or predict outcomes accurately.
A good example of supervised learning problems is predicting house prices. First, we need data about the houses: square footage, number of rooms, features, whether a house has a swimming pool or not and so on. We then need to know the prices of these houses, i.e. the corresponding labels. Using the data coming from thousands of houses, their features and prices, we can now train a supervised machine learning model to predict a new house’s price based on the examples observed by the model.
Classification and Regression
Classification is the process of finding a model that helps in the separation of data into multiple categorical classes (discrete values).
Regression is the process of finding a model that distinguishes the data into continuous real values rather than classes or discrete values.
A simpler way to distinguish both, remember that classification uses discrete values (yes or no, 1 or 0, etc) while regression uses continuous values.
As said earlier on, Random Forest model combines multiple decision trees to make a 'forest'. A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resources cost and so on.
A decision tree consists of three components: decision nodes, leaf nodes, and a root node.
- Decision node: has two or more branches (e.g sunny, windy and rainy).
- Leaf node: represents a classification or decision.
- Root node: the topmost decision node that corresponds to the best predictor .
A decision tree algorithm divides a training dataset into branches, which further segregate into other branches. This sequence continues until a leaf node is attained. The leaf node cannot be segregated further.
Random Forest is used by Data scientist on jobs in many industries like banking, medicine, e-commerce and so on. Random Forest is used to predict things that would help these industries run efficiently;
In banking to predict customers who are more likely to repay their debts also those who will use the bank's services more frequently.
In health care, Random Forest can be used to analyze a patient's medical history to identify the sickness. Also in the study of genetics.
Retail companies also use Random Forest to recommend products and predict customer satisfaction as well.
Before we look into how Random Forest works, we need to look into the "ensemble" technique as used in the definition of Random Forest. Ensemble means combining multiple models. Thus a collection of models is used to make predictions rather than an individual model. Ensemble uses two types of methods ;
Bagging: It creates a different training subset from sample training data with replacement & the final output is based on majority voting. For example, Random Forest. Decision trees in an ensemble, like the trees within Random Forest, are usually trained using the "bagging" method. The bagging method is also a type of ensemble machine learning algorithm called Bootstrap Aggregation.
Bootstrap randomly performs row sampling and feature sampling from the dataset to form sample datasets for every model.
Aggregation reduces these sample datasets into summary statistics based on the observation and combines them. Bootstrap Aggregation can be used to reduce the variance of high variance algorithms such as decision trees.
Boosting: It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST.
An ensemble method combines predictions from multiple machine learning algorithms together to make more accurate predictions than an individual model.
Random Forest is also an ensemble method.
Variance is an error resulting from sensitivity to small fluctuations in the dataset used for training. High variance will cause an algorithm to model irrelevant data, or noise, in the dataset instead of the intended outputs, called signal. This problem is called overfitting. An overfitted model will perform well in training, but won’t be able to distinguish the noise from the signal in an actual test.
Steps involved in random forest algorithm:
Step I: In Random Forest, number of random records(n) are taken from the data set with a number of records(k).
Step II: Individual decision trees are constructed for each sample.
Step III: Each decision tree will generate an output.
Step IV: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.
Consider the fruit basket as the data as shown in the figure above. Now n number of samples are taken from the fruit basket and an individual decision tree is constructed for each sample. Each decision tree will generate an output as shown in the figure. The final output is considered based on majority voting. In the figure above you can see that the majority decision tree gives output as an apple when compared to a banana, so the final output is taken as an apple.
The hyperparameters in Random Forest are either used to increase the predictive power of the model or to make the model faster. Let's look at these hyperparameters:
To increase the predictive power:
- n_estimators: This is the number of trees the algorithm builds before taking the maximum voting or taking the averages of predictions. In general, a higher number of trees increases the performance and makes the predictions more stable, but it also slows down the computation.
- max_features: This is the maximum number of features random forest considers to split a node.
- min_sample_leaf: This determines the minimum number of leafs required to split an internal node.
- max_depth: This specifies the maximum dept of each tree. The default value for max_depth is None, which means that each tree will expand till every leaf is pure(all of the data come from the same class). There has been some work that says best depth is 5-8 splits. It of course, depends on the problem and data.
To increase the model's speed:
- n_jobs: This hyperparameter tells the engine how many processors it is allowed to use. If it has a value of one, it can only use one processor. A value of “-1” means that there is no limit.
- random_state: This hyperparameter makes the model’s output replicable. The model will always produce the same results when it has a definite value of random_state and if it has been given the same hyperparameters and the same training data.
- oob_score: Also known as 'oob sampling'. It is a random forest cross-validation method. In this sampling, about one-third of the data is not used to train the model and can be used to evaluate its performance. These samples are called the out-of-bag samples. It's very similar to the leave-one-out-cross-validation method.
Now, let's understand how to implement Random Forest with lines of code.
#import necessary libraries e.g pandas, matplotlib and so on #import dataset #clean the dataset if necessary #visualize if necessary #spilt the dataset into train and test #import Random Forest model #classifier from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators = 100, random_state = 42) #fit the model using the training sets clf.fit(X_train, y_train) #check predictions y_pred = clf.predict(X_test) #check accuracy with the actual and predicted values #import sci-kit learn metrics module to check accuracy from sklearn import metrics metrics.accuracy_score(y_test, y_pred) #regressor from sklearn.ensemble import RandomForestRegressor reg = RandomForestRegressor(n_estimators = 100, random_state = 42) #train the model using the training sets reg.fit(X_train, y_train) #check predictions y_pred = reg.predict(X_test) #check accuracy with the actual and predicted values #import sci-kit learn metrics module to check accuracy from sklearn import metrics metrics.accuracy_score(y_test, y_pred)
Random Forest is more efficient than a single decision tree when performing analysis on a very large databases. Also, Random Forest produces a great result without hyperparameter tuning. The following are advantages of using Random Forest:
- It reduces overfitting in decision trees and helps to improve the accuracy.
- It is flexible to both classification and regression problems.
- It works well with both categorical and continuous values.
- It automates missing values present in the data.
- Normalising of data is not required as it uses a rule-based approach.
- It takes less time and expertise to develop.
Random Forest is really useful, talk about the avengers of algorithms!
Like every other thing, Random Forest also has some draw backs;
- It requires much time for training as it combines a lot of decision trees to determine the class.
- Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each variable.
- It requires much computational power as well as resources as it builds numerous trees to combine their outputs.
- It is a predictive modeling tool and not a descriptive tool, meaning if you're looking for a description of the relationships in your data, other approaches would be better.
Which of the following is/are true about Random Forest?
I. It can be used for classification task(s)
II. It can be used for regression task(s)
III. It is the act or process of classifying
A. I & II
B. I only
C. II only
Random Forest is a supervised or unsupervised learning model?
A. Supervised learning
B. Unsupervised learning
When does overfitting occur?
A. The model performs well on testing and not so well on training.
B. The model performs well on both the testing and training.
C. The model doesn't perform well on both testing and training.
D. The model performs well on the training but not on the testing.
The bagging method is a type of ensemble machine learning algorithms called?
D. Bootstrap Aggregation.
Thanks for reading!