Okay, here's the deal, I spent hours writing code, and reading other people's code, and pair programming with my instructors while they typed code. All in an effort to learn how to use a powerful tool that I knew existed, but I didn't know how to implement yet. I'm talking about machine learning pipelines. So what have I learned? Pipelines, in theory, are great, but in practice, they are hard to use with Pandas (at least the one that is packaged with Scikit-Learn
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union. Let's talk about that theory, and later I'll get to some key take-aways.
- Take in raw_data (predictible data, but completely unprocessed)
- Pre-process the raw_data
- Things like
fillna(0)or drop rows with missing values, ignore certain columns, convert date-times, that sort of thing
- Things like
- Fit certain feature transformations from the training set, and remember them for later
- We specify which categorical columns should be OneHotEncodified, it will learn how to build the many columns and drop the old ones
- Technically even something like
fillna(column.median)should go here, because it needs to "learn" what the median is according to the training set and NOT just blindly fill with the median of whatever data was passed in.
- Normalization of columns would go here, along with any log() transformations, etc.
- Apply those feature transformations to the raw_data (could be either the training set, the test set, the validation set, or even future data that you haven't even collected yet).
- important to remember it should NOT "learn" about the model from either the test or validation sets
- Be able to receive input parameters for these transformations, so that we can explore the hyper-parameter space.
- Be able to fit a predictive model on this newly processed data (only if it's the training set, I'll explain how this happens later)
- Use this fitted model to predict outcomes from the processed data, and return some sort of error metric so we can compare the results of various models.
- train_test_split the raw data
- calculate new transformations everytime we pass raw_data in to it
You're probably tired of hearing it by now, but . . .
Important note: it's always a good idea to split your data BEFORE you do you any heavy modifications to it (like one-hot-encoding or fillna).
This is because the action of "modifying" the data before a split will inherently pass some bit of information into the test data that gets pulled out later, which will bias the results of the model. For example, if we have 15% of our data that is
null but we chose to fill those nulls with the median value for the column, THEN split the data, if the test data has any
null values at all, it has inadvertanly gained information about the training data, even though that's never supposed to happen. My first data science instructor really tried to drive home this point, and I'll borrow from him in saying that this is a "Career Limiting Mistake".
Okay, back to pipelines: One could think of a pipeline as a wrapper function that calls a lot of other functions, and keeps passing the output of one function in as the input to the next. Here's an example in psuedo-code:
def do_everything_in_order(raw_data): fillna() transform() if columns.any() == categorical: one_hot_encode(applicable_columns) fit_to_model(X_train_cleaned, y_train_cleaned) predict(X_test_cleaned) return evaluate(prediction, y_test_cleaned)
This isn't quite accurate, but it's a helpful way to think about it. More accurately, the pipeline is a class, rather than a function. This means it will need to be instantiated as a variable name, and it will carry various attributes (nouns aka descriptions) and methods (verbs aka custom functions that always apply an action to itself, but not to other "versions" of the same class). I was unable to actually build a working pipeline, but once I do, they work very similar to any individual model from sklearn (or other probably other libraries / your own custom model). If you are reading this blog, you have probably already done something like this:
from sklearn.linear_model import SomethingInCamelCase # This is almost certainly a model builder, they exist in other parts of sklearn as well, such as ... from sklearn.neighbors import KNeighborsClassifier # Whichever model builder you chose, it will probably behave similarly, I'll call mine 'model' from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_raw_data, y_raw_data) # various cleaning & transforming steps. This is often MANY steps, and annoying to do repeatedly, but eventually it results in ... X_train_final = clean_transform(X_train) Y_train_final = clean_transform(y_train) model = KNeighborsClassifier() # You may have various *args inside this function, but NOT any data itself model.fit(X_train_final, y_train_final) X_test_predictions = model.predict(X_test_final) # Oh, shoot! we haven't cleaned the X_test yet, time to copy paste code and change variable names. I wish there were a better way. some_eval_metric(X_test_predictions) # See my note below
Special Note: This could be as simple as R2, but often you'll want something better. I like the F1-score for classification problems, there are many evaluation metrics to chose from and you will want to think wisely about which one helps your model match the specifics of your business case the best.
So how would pipelines help us here? If you can successfully construct one, it will change into something like this:
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_raw_data, y_raw_data) pipe = make_pipeline( cleaning_steps, # the outputs of one ... transforming_steps, # MUST be the correct inputs for the next. model, evaluate_function # This might need to specified inside the model function on the line above ) pipe.fit(X_train) pipe.score(X_test, y_test) # This automatically performs pipe.transform(X_test), so you don't need to call that first
And that's it! The advantages here are numerous, but perhaps most importantly, it applies D.R.Y. programming in a robust way, and behaves like the other objects we are already familiar with.
- Pandas does NOT play nicely with Scikit-Learn's implimentation of Pipelines. We can basically only use numpy arrays, and that means we CANNOT reference columns by name, only by index. This is bad, and hopefully will be upgraded in the future.
- In the meantime, here is another blog that tries to work around this.
- Perhaps a better solution: There is also a different library called sklearn-pandas which looks promising, but I only learned about this after I had already started over two other times trying to get pipelines to work. There is enough demand for utilizing pipelines with pandas that I'm sure a solution will be worked out eventually, maybe you'll even be inspired enough to work on the source code and make it professional enough to get merged into Scikit-learn so we won't need an alternate library.
- There is currently a significant upfront development cost to building pipelines, if you only need to do something once, it may not be worth it, but if you need something repeatable, especially something that scale into unseen data that hasn't even been collected yet, and then use that for future training, pipelines are amazing.
I first found this data through OpenML, but then ended up going straight to the source: Bureau of Transportation Statistics. The inspiration data is provided by Albert Bifet & Elena Ikonomovska, Data Expo competition (2009). Here is the description Elena Ikonomovska gave from her website:
The dataset consists of a large amount of records, containing flight arrival and departure details for all the commercial flights within the USA, from October 1987 to April 2008. This is a large dataset with nearly 120 million records (11.5 GB memory size). The dataset was cleaned and records were sorted according to the arrival/departure date (year, month, and day) and time of flight. Its final size is around 116 million records and 5.76 GB of memory.
We will be using OpenML to access the data, along with
fetch_openml from sklearn so that we don't even need to worry about unzipping or finding a folder for the data (it's all handled inside python). (The specifics from OpenML about this dataset can be found here.) First, these are the imports for the entire project along with the code to save the data in RAM and read the description.
Originally, this dataset wasn't gathered for machine learning, but rather for a "Data Expo" Here is the original challenge:
The aim of the data expo is to provide a graphical summary of important features of the data set. This is intentionally vague in order to allow different entries to focus on different aspects of the data.
Check out the resulting posters here.
Great! Let's do a bit of prep work, then we'll build the pipeline. The purpose here is to input "Raw" data, whatever that means for you, and output an evaluation metric.
import pandas as pd import doctest import numpy as np from sklearn_pandas import DataFrameMapper from sklearn.pipeline import ( Pipeline, FeatureUnion, ) from sklearn.compose import ( ColumnTransformer, make_column_transformer, ) from sklearn.preprocessing import ( FunctionTransformer, OneHotEncoder, ) raw_data = pd.read_csv('data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv') def convert_time_to_hour(minutes_since_midnight): """Returns the whole number of hours since midnight, roundeded down. Examples: >>> convert_time_to_hour(33) 0 >>> convert_time_to_hour(1353) 22 """ return minutes_since_midnight // 60 doctest.testmod()
target_col = 'DepDelayMinutes' predictor_cols = [ 'DepTime', 'ArrTime', 'DayOfWeek', 'OriginAirportID', 'DestAirportID', 'Reporting_Airline', ] cols = predictor_cols + [target_col] data = raw_data.loc[:, cols].dropna() predictor_col_idxs = [data.columns.get_loc(c) for c in predictor_cols] select_columns = ColumnTransformer( remainder='drop', transformers=[ ('select_columns', 'passthrough', predictor_col_idxs), ]) preprocess_times = ColumnTransformer( remainder='drop', transformers=[ ('convert_time_to_hour', FunctionTransformer(convert_time_to_hour), [0, 1]), ('keep_other_columns', 'passthrough', [2, 3, 4, 5]), ], ) pipe = Pipeline([ ('select columns', select_columns), ('preprocess time fields', preprocess_times), #('one-hot encode everything', OneHotEncoder), #If we allow this to run, it throws an error ]) pipe.fit(data) pipe.transform(data) mapper = DataFrameMapper([ # Beginning to work with sklearn_pandas, but ran out of time () ])
Throughout the last year, I have worked part-time as a working student and also studied at the university. I was not the first and not the last one who has combined that during their studies, but the problem for me was, that at the end of the day I have felt absolutely exhausted mentally and physically. That caused problems with my health and motivation to continue working on my goals or anything. (yeah, “goals,” I wish I had something more specific at that time).