Random Forest

#assignment #course

About Random Forest Classifier:

Random Forest is a classifier that contains several Decision Trees on various subsets of a given DataSet and takes the average to improve the predictive accuracy of that dataset. During the implementation of homework #2, I fitted several classifiers, including RandomForestClassifier and ExtraTreesClassifier to predict the binary response variable – TREG1 (whether a person is a smoker or not). All variables in the dataset, like age, gender, race, alcohol use, and others (see dataset) were used to build the final model. After fitting the model, these factors influenced the final variable with different levels of importance.
Calculated and sorted descending these factors into feature importance lists:

marever1 0.096374 age 0.083599 DEVIANT1 0.080081 SCHCONN1 0.075221 GPA1 0.074775 DEP1 0.071728 FAMCONCT 0.067389 PARACTV 0.063784 ESTEEM1 0.057945 ALCPROBS1 0.057670 VIOL1 0.048614 ALCEVR1 0.043539 PARPRES 0.039425 WHITE 0.022146 cigavail 0.021671 BLACK 0.018512 BIO_SEX 0.014942 inhever1 0.012832 cocever1 0.012590 PASSIST 0.010221 EXPEL1 0.009777 HISPANIC 0.007991 AMERICAN 0.005332 ASIAN 0.003844

Code:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
%matplotlib inline
RND_STATE = 55324

AH_data = pd.read_csv(“data/tree_addhealth.csv”)
data_clean = AH_data.dropna()
data_clean.dtypes

data_clean.describe()

predictors = data_clean[[‘BIO_SEX’, ‘HISPANIC’, ‘WHITE’, ‘BLACK’, ‘NAMERICAN’, ‘ASIAN’, ‘age’,
‘ALCEVR1’, ‘ALCPROBS1’, ‘marever1’, ‘cocever1’, ‘inhever1’, ‘cigavail’, ‘DEP1’, ‘ESTEEM1’,
‘VIOL1’,
‘PASSIST’, ‘DEVIANT1’, ‘SCHCONN1’, ‘GPA1’, ‘EXPEL1’, ‘FAMCONCT’, ‘PARACTV’, ‘PARPRES’]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4, random_state=RND_STATE)

print(“Predict train shape: “, pred_train.shape)
print(“Predict test shape: “, pred_test.shape)
print(“Target train shape: “, tar_train.shape)
print(“Target test shape: “, tar_test.shape)

classifier = RandomForestClassifier(n_estimators=25, random_state=RND_STATE)
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print(“Confusion matrix:”)
print(confusion_matrix(tar_test, predictions))
print()
print(“Accuracy: “, accuracy_score(tar_test, predictions))

important_features = pd.Series(data=classifier.feature_importances_,index=predictors.columns)
important_features.sort_values(ascending=False,inplace=True)

print(important_features)

model = ExtraTreesClassifier(random_state=RND_STATE)
model.fit(pred_train, tar_train)

print(model.feature_importances_)

trees = range(25)
accuracy = np.zeros(25)
for idx in range(len(trees)):
classifier = RandomForestClassifier(n_estimators=idx + 1, random_state=RND_STATE)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
accuracy[idx] = accuracy_score(tar_test, predictions)

plt.cla()
plt.plot(trees, accuracy)
plt.show()

Output:

Final model looked well on test data and showed an accuracy level of 83,4%! So results can be presented in this plot:

As we can see from the plot that even one tree can show the accuracy at a good level. The above-given data can be described even with one tree. But, on the other hand, it is clear, that after adding some more trees final accuracy increases a bit, can make the model able to predict the data more precisely.

DEV Community

Random Forest

About Random Forest Classifier:

Code:

Output:

Top comments (0)

Read next

Core JavaScript Concepts

Distribution Calculator In Svelte - Hosted on Amazon S3

Transforming Text to Markdown: AI vs Traditional Methods

Wegweisendes Urteil für mehr Verbraucherschutz bei Online-Coachings 🚀