No, this isn't an awesome sed hack that trains logistic regression models with regexes, it's how to build machine learning models with scripts rather than notebooks.
Well actually, how to do that is pretty straightforward. How to do it effectively may not be. I'm going to walk through my process and reasoning in this post.
Notebooks are nice! What's wrong with training in those? I could (and probably will) write a huge post about why notebooks are bad for writing software in the future. For now I'm going to try writing something that won't get me flamed on Twitter, so here are two (not orthogonal) reasons:
- Ever try to reproduce a model from someone else's notebook? Unless they've written it well, it's pretty hard.
- Ever try to do a code review on a notebook? It sucks.
Writing your model training as a script enables you to train your model in one contained process. If set up correctly another team member can easily train your model without having to ask you fifty questions about it, something you'll appreciate when that model needs to be trained while you're on vacation. Moreover, code reviews on scripts are far simpler than notebooks. They can be unit tested and run in CI/CD pipelines for production grade ML.
The main idea is this: put everything the model needs as a command line argument, use command line options for hyperparameters, and save the prediction results to a file at the end as well as the serialized model. It's actually pretty simple, and once you get used to iterating at the command line you'll begin to appreciate having everything in a self contained script.
You'll need only two special ingredients: a main function and some library to parse the command line arguments. I typically use Click for managing my command line arguments as it's pretty straightforward to work with. Python's standard library also comes with a module, argparse, that lets you set these things up too but I think it's a little less intuitive personally.
So here's the skeleton:
import click @click.command() def main(): pass # TODO: Implement. if __name__ == "__main__": main()
Now obviously there's nothing in there so it won't do anything, but let me explain what's going on. Basically
@click.command() transforms your main function into a Click command. This enables Click to set up your function with things like a help page, etc for you. The key here is you have to decorate a function. It can't just be a pile of code hanging around, it has to be a pile of code wrapped in
If you don't write a lot of scripts the last part might be unfamiliar.
if __name__ == "__main__": ... effectively says "if this script is invoked as a python main process, run the main function. Otherwise it's just a library. So if I do
from model import main inside another script or the interpreter it won't run, but if I hit
python model.py or
python -m model at the command line it will. Without that, those last two commands won't do anything. Not saying I know personally because I forget the
if __name__ == "__main__" thing a lot or anything.
Alright so now we're ready for some code that actually does stuff.
import click import pandas as pd from xgboost import XGBClassifier from sklearn.external import joblib @click.command() @click.argument("training_data", type=str) @click.option("--model-file", type=str, default="model.pkl") @click.option("--prediction-file", type=str, default="predictions.csv") @click.option("--n-estimators", type=int, default=500) @click.option("--max-depth", type=int, default=3) @click.option("--learning-rate", type=float, default=0.15) def main( training_data, model_file, prediction_file, n_estimators, max_depth, learning_rate ): training_df = pd.read_csv(training_data) X = training_df.drop(columns="target") y = training_df[["target"]] model = XGBClassifier( max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate ) model.fit(X, y) predictions = model.predict(X) training_df.loc[:, "predictions"] = predictions training_df.to_csv(prediction_file, index=False) joblib.dump(model, model_file) if __name__ == "__main__": # A little disconcerting, but click injects the arguments for you. main()
Obviously there'd be a lot more in there than just train and dump. Personally I put mlflow tracking in there and lots of logging. I also save out plots in a directory for review when it's done (mlflow lets you log these out too which is pretty neat).
The point is now you can run the whole pipeline with just this at the terminal.
python train_model.py training_data.csv --n-estimators 100 # or ... python train_model.py training_data.csv --max-depth 10 --learning-rate 0.2 # or ... python train_model.py --help # Look, documentation! ish.
You have full control over how the model is built right from the terminal and it's just one button. There's very little setup for other people to pick it up and run, and if you've added
help arguments the script will literally tell people how to run it, all without them having to even open the code itself.
The best part is that there's zero code change to adjust your parameters, which isn't possible in a notebook. In production every code change is a risk, and that's mitigated by abstracting your parameters to what's effectively configuration, which is what they are. Moreover, now with just one button you can run this command easily as part of a larger pipeline (for continuous integration, inside Docker, as a background process, etc.). That's very challenging with notebooks.
It takes some adjustment, but setting your ML model training as a script rather than a notebook keeps almost the same flexibility you have with notebooks but enables one button runs, the ability to run as a headless process, straightforward code reviews and simple version control diffs.