jpetoskey

Posted on

# Use Random Forest Feature Importance for Feature Selection

Saving on computation is a priority for me, as I am practicing data science on an old machine, and don't currently have access to cloud computing software.

So, what to do when recursive feature selection (RFE) runs for 20 minutes while I take a snack break and is still spinning when I return?

My Random Forest models are among my best, so this feature is a really nice way to save on computation and still return a high-performing model.

After fitting, predicting, running a confusion matrix, and a classification report, I turn to feature importance so I am able to iterate and improve my model by running it on fewer features, or report the feature importance to my superiors.

See this bit of documentation code from sklearn's Feature importances with a forest of trees web page

One of my favorite things to do with this information, is produce a top ten features bar chart that plots the features by importance.

And, here is the code I copied or wrote to produce it:

``````# Feature importance
features = pd.DataFrame(forest6.feature_importances_)
features['Feature'] = X_train.columns.values
features['Feature Importance'] = features[0]
features = features.drop(0, axis=1)
features = features.sort_values(by=['Feature Importance'], ascending=True)
features = features.nlargest(n=10, columns=['Feature Importance'])
features

import matplotlib.style as style
# style.available
style.use('fivethirtyeight')

plt.figure(figsize=(8,8))
plt.barh(range(10), features['Feature Importance'], align='center')
plt.yticks(np.arange(10), features['Feature'])
plt.xlabel('Feature importance (Weight)')
plt.ylabel('Feature')
plt.title('Top Ten Features by Importance')
``````

I hope this helps you enjoy feature_importances_ by scikitlearn!