Saving on computation is a priority for me, as I am practicing data science on an old machine, and don't currently have access to cloud computing software.
So, what to do when recursive feature selection (RFE) runs for 20 minutes while I take a snack break and is still spinning when I return?
The answer is scikitlearn's model.feature_importances_.
My Random Forest models are among my best, so this feature is a really nice way to save on computation and still return a high-performing model.
After fitting, predicting, running a confusion matrix, and a classification report, I turn to feature importance so I am able to iterate and improve my model by running it on fewer features, or report the feature importance to my superiors.
See this bit of documentation code from sklearn's Feature importances with a forest of trees web page
One of my favorite things to do with this information, is produce a top ten features bar chart that plots the features by importance.
And, here is the code I copied or wrote to produce it:
# Feature importance features = pd.DataFrame(forest6.feature_importances_) features['Feature'] = X_train.columns.values features['Feature Importance'] = features features = features.drop(0, axis=1) features = features.sort_values(by=['Feature Importance'], ascending=True) features = features.nlargest(n=10, columns=['Feature Importance']) features import matplotlib.style as style # style.available style.use('fivethirtyeight') plt.figure(figsize=(8,8)) plt.barh(range(10), features['Feature Importance'], align='center') plt.yticks(np.arange(10), features['Feature']) plt.xlabel('Feature importance (Weight)') plt.ylabel('Feature') plt.title('Top Ten Features by Importance')
I hope this helps you enjoy feature_importances_ by scikitlearn!
Top comments (0)