DEV Community


Posted on

Use Random Forest Feature Importance for Feature Selection

Saving on computation is a priority for me, as I am practicing data science on an old machine, and don't currently have access to cloud computing software.

So, what to do when recursive feature selection (RFE) runs for 20 minutes while I take a snack break and is still spinning when I return?

The answer is scikitlearn's model.feature_importances_.

My Random Forest models are among my best, so this feature is a really nice way to save on computation and still return a high-performing model.

After fitting, predicting, running a confusion matrix, and a classification report, I turn to feature importance so I am able to iterate and improve my model by running it on fewer features, or report the feature importance to my superiors.

See this bit of documentation code from sklearn's Feature importances with a forest of trees web page

One of my favorite things to do with this information, is produce a top ten features bar chart that plots the features by importance.

Image description

And, here is the code I copied or wrote to produce it:

# Feature importance
features = pd.DataFrame(forest6.feature_importances_)
features['Feature'] = X_train.columns.values
features['Feature Importance'] = features[0]
features = features.drop(0, axis=1)
features = features.sort_values(by=['Feature Importance'], ascending=True)
features = features.nlargest(n=10, columns=['Feature Importance'])

import as style
# style.available

plt.barh(range(10), features['Feature Importance'], align='center') 
plt.yticks(np.arange(10), features['Feature']) 
plt.xlabel('Feature importance (Weight)')
plt.title('Top Ten Features by Importance')
Enter fullscreen mode Exit fullscreen mode

I hope this helps you enjoy feature_importances_ by scikitlearn!

Top comments (0)