In the field of machine learning, regression is a widely used technique for predicting continuous numerical values. And random forest regression is most versatile and effective algorithm in regression. In this article we will learn how to implement random forest regression using python language.
Random Forest Regressor
Random Forest Regressor is an ensemble learning algorithm which combines decision trees and the concept of randomness. It is belongs to the supervised learning algorithm family. While working on data this algorithm create multiple decision trees and combines the predictions of all trees to give final output.
The name Random Forest
comes from two concepts: Randomness and forests
. Forest meaning collection of trees, which this model creates by generating multiple decision trees and combining them all.
In decision tree algorithm all data is used to create only one tree and predict using it. But in random forest algorithm creates trees using input given by user n_estimators
and predicts or gives output by combining output from all these trees.
How Random Forest Regressor Works
Data Preparation :
First step is preparing the training data. Each data point should be represented by a set of features and a corresponding numerical target value.Generating Trees :
In this step algorithm creates trees according the input given by user by parametern_estimators
. Each tree is built using a small random group of data from training dataset. This group of data is choose randomly and can be used many times.Growing The Trees :
Each tree is build using smaller groups of data which are created by using chosen features. We do this process until we can not divide the data further or we met certain condition. Our aim is to make sure that the values we want to predict are as similar as possible within each smaller group.Combining Predictions :
After Building all the trees, we need to combine the predictions. We do this by adding predictions from all trees and taking average. This is called aggregation. Let, if we are trying to predict a number, like the price of house, we usually take the average of all the predicted prices from the trees.
This combined prediction gives us more accurate result rather than relying on just one tree.Missing Values :
Random forest regressor can handle missing values using averaging approach. During prediction process, each decision tree separately handles the missing values and makes prediction.
Hyper Parameter Tuning
Hyper parameters controls the behavior of algorithm and these parameters should be set before learning or training process. Tuning these parameters can impact the performance of the model. We will discuss here two important hyper parameters and their tuning.
n_estimators
: This parameter decides the number of decision tress in random forest. By increasing the number of trees the performance of model can be improved but also computational complexity is also increased. We can choose different values and final the one values which gives best results.max_depth
: It represents the maximum depth of decision tree. A deeper tree can capture more complex relationships in the data but can also lead to overfitting. By tuning this parameter, we can find the right balance between model complexity and generalization.
To tune these parameters we can use Grid Search, Random Search, or Bayesian Optimization.
Grid Search involves deciding the set of values for each hyper parameter and exhaustively evaluating all possible combinations.
Random Search randomly samples combinations of hyperparameters and evaluate their performance.
Bayesian Optimization uses a probabilistic model to search for promising hyperparameters.
For brief explanation and more information on hyper parameter tuning you can refer this Link
Implementation of Random Forest Regressor using Python
To implement random forest regression we will use sklearn
library, which provides different set of tools for machine learning tasks.
Step 1 : Importing Necessary Libraries
First we will import necessary libraries for loading and manipulating the data.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
Step 2 : Loading the data
We will use salary data for this examples.
sal_d = pd.read_csv('Position_Salaries.csv')
Our data will look like this -
Step 3 : Creating x and y variables
x will be the independent variable and y will be the dependent variable.
x = sal_d.iloc[:,1:-1].values
y = sal_d.iloc[:,-1].values
Step 4 : Hyper Parameter Tuning
In this step we will use GridSearchCV
for hyper parameter tuning.
## Importing required libraries.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
## Defining param grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10]
}
## Creating a random forest regressor object
rf_regressor = RandomForestRegressor(random_state=42)
# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(rf_regressor, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x, y)
# Getting the best parameters and best score
best_params = grid_search.best_params_
best_score = -grid_search.best_score_
print("Best Parameters:", best_params)
print("Best Score (MSE):", best_score)
Step 5 : Training the Model
In this step we will use above tuned parameters and whole dataset to train our model.
rf_regressor_best = RandomForestRegressor(n_estimators=best_params['n_estimators'], max_depth=best_params['max_depth'], random_state=42)
Above code will create a model using tuned parameter in previous step.
Below is the code to fit the whole dataset.
# Fiting the model to the training data using the best parameters
rf_regressor_best.fit(x, y)
Step 6 : Predictions.
In this step we will predict for unique input which will not be in dataset.
rf_regressor_best.predict([[6.5]])
Conclusion
By using all these steps anyone can implement random forest regressor using python. Also there are more parameters than 2, by tuning these parameters we can improve our model more.
All these steps are done by me in python also this theory information is from internet and some udemy course. which I wrote in my words. If there is mistake above then please comment.
Top comments (0)