DEV Community

Avinash Gupta
Avinash Gupta

Posted on

Random Forest

About Random Forest Classifier:

Random Forest is a classifier that contains several Decision Trees on various subsets of a given DataSet and takes the average to improve the predictive accuracy of that dataset. During the implementation of homework #2, I fitted several classifiers, including RandomForestClassifier and ExtraTreesClassifier to predict the binary response variable – TREG1 (whether a person is a smoker or not). All variables in the dataset, like age, gender, race, alcohol use, and others (see dataset) were used to build the final model. After fitting the model, these factors influenced the final variable with different levels of importance.
Calculated and sorted descending these factors into feature importance lists:

marever1 0.096374
age 0.083599
DEVIANT1 0.080081
SCHCONN1 0.075221
GPA1 0.074775
DEP1 0.071728
FAMCONCT 0.067389
PARACTV 0.063784
ESTEEM1 0.057945
ALCPROBS1 0.057670
VIOL1 0.048614
ALCEVR1 0.043539
PARPRES 0.039425
WHITE 0.022146
cigavail 0.021671
BLACK 0.018512
BIO_SEX 0.014942
inhever1 0.012832
cocever1 0.012590
PASSIST 0.010221
EXPEL1 0.009777
HISPANIC 0.007991
AMERICAN 0.005332
ASIAN 0.003844

Code:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
%matplotlib inline
RND_STATE = 55324

AH_data = pd.read_csv(“data/tree_addhealth.csv”)
data_clean = AH_data.dropna()
data_clean.dtypes

data_clean.describe()

predictors = data_clean[[‘BIO_SEX’, ‘HISPANIC’, ‘WHITE’, ‘BLACK’, ‘NAMERICAN’, ‘ASIAN’, ‘age’,
‘ALCEVR1’, ‘ALCPROBS1’, ‘marever1’, ‘cocever1’, ‘inhever1’, ‘cigavail’, ‘DEP1’, ‘ESTEEM1’,
‘VIOL1’,
‘PASSIST’, ‘DEVIANT1’, ‘SCHCONN1’, ‘GPA1’, ‘EXPEL1’, ‘FAMCONCT’, ‘PARACTV’, ‘PARPRES’]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4, random_state=RND_STATE)

print(“Predict train shape: “, pred_train.shape)
print(“Predict test shape: “, pred_test.shape)
print(“Target train shape: “, tar_train.shape)
print(“Target test shape: “, tar_test.shape)

classifier = RandomForestClassifier(n_estimators=25, random_state=RND_STATE)
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print(“Confusion matrix:”)
print(confusion_matrix(tar_test, predictions))
print()
print(“Accuracy: “, accuracy_score(tar_test, predictions))

important_features = pd.Series(data=classifier.feature_importances_,index=predictors.columns)
important_features.sort_values(ascending=False,inplace=True)

print(important_features)

model = ExtraTreesClassifier(random_state=RND_STATE)
model.fit(pred_train, tar_train)

print(model.feature_importances_)

trees = range(25)
accuracy = np.zeros(25)
for idx in range(len(trees)):
classifier = RandomForestClassifier(n_estimators=idx + 1, random_state=RND_STATE)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
accuracy[idx] = accuracy_score(tar_test, predictions)

plt.cla()
plt.plot(trees, accuracy)
plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Final model looked well on test data and showed an accuracy level of 83,4%! So results can be presented in this plot:

Image description

As we can see from the plot that even one tree can show the accuracy at a good level. The above-given data can be described even with one tree. But, on the other hand, it is clear, that after adding some more trees final accuracy increases a bit, can make the model able to predict the data more precisely.

Top comments (0)