Pipelines: Clean Your Notebook

Introduction

Normally when I am creating a Machine Learning model I actually run through a bunch of models to get an idea of how the default models will perform on my data. Though every time I do this I end up creating a mess in my notebook and have to keep scrolling up and down to find out what I named each model and how it performed so I can compare it to it's neighbor, so I have been trying to implement an awesome tool from Scikit Learn called Pipeline to help clean up my mess and create a neater notebook. While this is not necessarily the only use or even debatably the best use for this tool, it is something that has come in handy for me and so I would like to share it.

Pipeline?

Before I get too far ahead of myself and show you how I use Pipeline, I should give a little more detail about what it does. According to the Scikit Learn website "The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters". What this is saying is that we can have a preprocessor and a model together in the same pipeline, so that we can quickly assess different models. So let's try it out!

Example

To get started we need to import all our libraries and tools, and we have a ton for trying out different classifiers.

import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

Now we can load in our data. Like I have done in previous posts I really enjoy using the Pandas for the layout and ease of access.

wine = load_wine()
data = pd.DataFrame(data= np.c_[wine['data'], wine['target']],
                     columns= wine['feature_names'] + ['target'])
data.describe()

With all of that beautiful data loaded in let's do our split and then we can run our pipeline.

X=data.drop('target',axis=1)
y=data['target']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y)

Setting up the Pipeline is amazingly easy the format for doing this all you need to do is call the pipeline tool and then enter the name or names of the preprocessing processes you want to run and then do the same for the model. Though here since I am running multiple classifiers I just add them to a list and then have the pipeline tool loop through them.

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier()
    ]

for c in classifiers:
    pl = Pipeline([('standard scaler', StandardScaler()), ('models', c)])
    pl.fit(Xtrain,ytrain)
    print(f"{c.__class__.__name__} has a score of {round(pl.score(Xtest,ytest),2)}")

And there we go we have all of our models running in the same space to get an easy early comparison. Another great use I have found for creating a pipeline like this is if I want to quickly run the same model with a couple different hyper-parameters to see the effect they have on the model, though if you have found a model you want to use I would recommend Grid Search in the long run for hyper parameter optimization.