DEV Community: Blair Hudson

Scaling Jupyter notebooks across the world with AWS and Papermill

Blair Hudson — Wed, 16 Sep 2020 02:40:06 +0000

As a data scientist, one of the most exciting things to me about Faethm is that data science is at the heart of our products.

As the head of our data engineering team, it's my responsibility to ensure our data science can scale to meet the needs of our rapidly growing and global customer base.

In this article, I'm going to share some of the most interesting parts of our approach to scaling data science products, and a few of the unique challenges that we have to address.

Faethm is data science for the evolution of work

Before we delve into our approach, it's important to understand a few things about Faethm and what we do.

Our customers depend on us to understand the future of work, and the impacts that technology and shifts in work patterns have on their most critical asset: their people.

Our data science team is responsible for designing and building our occupation ontology, breaking down the concept of "work" into roles, tasks, skills and a myriad of dynamic analytical attributes to describe all of these at the most detailed level. Our analytics are derived from a growing suite of propriety machine learning models.

Our platform ties it all together to help people leaders, strategy leaders and technology leaders make better decisions about their workforce, with a level of detail and speed to insight that is impossible without Faethm.

We use Python and Jupyter notebooks for data science

Our data scientists primarily use Python, Jupyter notebooks and the ever-growing range of Python packages for data transformation, analysis and modelling that you would expect to see in any data scientist's toolkit (and perhaps some you wouldn't).

Luckily running an interactive Jupyter workbench in the cloud is pretty easy.

AWS SageMaker provides the notebook platform for our teams to configure managed compute instances to their requirements and turn them on and off on-demand. Self-service access to variably powerful modelling environments requires managing a few IAM Role policies and some clicks in the AWS Console.

This means a data scientist can SSO into the AWS Console and get started on their next project with access to whatever S3 data is permitted by their access profile. Results written back to S3, notebooks pushed to the appropriate Git repository.

How do we turn this into a product so that our data scientists don't ever have to think about running a operational workflow?

Engineering data science without re-engineering notebooks

One of the core design goals of our approach is to scale without re-engineering data science workflows wherever possible.

Due to the complexity of our models, it's critical that data scientists have full transparency of how their models are functioning in production. So we don't re-write Jupyter notebooks. We don't even replicate the code within into executable Python scripts. We just execute them, exactly as written, no change required.

We do this with Papermill.

Papermill is a Python package for parameterising and executing Jupyter notebooks. As long as a notebook is written with parameters for dynamic functionality (usually with sensible defaults in the first notebook cell), Papermill can execute the notebook ($NOTEBOOK) on the command line with a single command. Any parameters (-r raw or -p normal) can be overridden at runtime and Papermill does this by injecting a new notebook cell assigning the new parameter values.

A simple Papermill command line operation looks like this:

pip install papermill
papermill "$NOTEBOOK" "$OUTPUT_NOTEBOOK" \
    -r A_RAW_PARAMETER "this is always a Python string" \
    -p A_PARAMETER "True" # this is converted to a Python data type

Since Papermill executes the notebook and not just the code, the cell outputs including print statements, error messages, tables and plots are all rendered in the resulting output notebook ($OUTPUT_NOTEBOOK). This means that the notebook itself becomes a rich log of exactly what was executed, and serves as a friendly diagnostic tool for data scientists to assess model performance and detect any process anomalies.

Reproducible notebook workflows

Papermill is great for executing our notebooks, but we need notebooks to be executed outside of the SageMaker instance they were created in. We can achieve this by capturing a few extra artifacts alongside our notebooks.

Firstly, we store a list of package dependencies in a project's Git repository. This is generated easily in the Jupyter terminal with pip freeze > requirements.txt, but often best hand-crafted to keep dependencies to essentials.

Any other dependencies are also stored in the repository. These can include scripts, pickled objects (such as trained models) and common metadata.

We also capture some metadata in a YAML configuration file:

...
Notebooks:
 - my-notebook.ipynb
 - my-second-notebook.ipynb
...

This file lists the notebooks in execution order, so a workflow can be composed of multiple independent notebooks to maintain readability.

Finally, a simple buildspec.yml configuration file is included that initiates the build process. This is the standard for AWS CodeBuild which we use as a build pipeline.

Changes to notebooks, dependencies and other repository items are managed through a combination of production and non-production Git branches, just like any other software project. Pull Requests provide a process for code promotion between staging and production environments, and facilitate a manual code review and automate a series of merge checks to create confidence in code changes.

Notebook containers built for production deployment

To keep our data science team focused on creating data science workflows and not build pipelines, the container build and deployment process is abstracted from individual Jupyter projects.

Webhooks are configured on each Git repository. Pushing to a branch in a notebook project triggers the build process. Staging and production branches are protected from bad commits by requiring a Pull Request for all changes.

A standard Dockerfile consumes the artifacts stored in the project repository at build-time:

FROM python:3.7

RUN pip install papermill

# package dependencies
COPY requirements.txt
RUN pip install -r requirements.txt

# notebook execution order from YAML config
ARG NOTEBOOKS
ENV NOTEBOOKS=${NOTEBOOKS}

# prepare entrypoint script
COPY entrypoint.sh

# catch-all for other dependencies in the repository
COPY .

# these parameters will be injected at run-time
ENV PARAM1=
ENV PARAM2=

CMD ./entrypoint.sh

The entrypoint is an iterative bash script:

#!/bin/bash

for NOTEBOOK in ${NOTEBOOKS//,/ }
do
    papermill "$NOTEBOOK" "s3://notebook-output-bucket/$NOTEBOOK" \
        -r PARAM1 "$PARAM1" \
        -p PARAM2 "$PARAM2"
done

This entrypoint.sh script follows this configuration file to execute each of the notebooks at run-time, and stores the resulting notebook output in S3.

AWS CodeBuild determines the target environment from the repository branch, builds the container and pushes it to AWS ECR so it is available to be deployed into our container infrastructure.

Serverless task execution for Jupyter notebooks

With Faethm's customers spanning many different regions across the world, the data is subject to the data regulations of each customer's local jurisdiction. Our data science workflows need to be able to execute in the regions which our customers specify for their data to be stored. With our approach, data does not have to transfer between regions for processing.

We operate cloud environments in a growing number of customer regions across the world, throughout the Asia Pacific, US and Europe. As Faethm continues to scale, we need to be able to support new regions.

To run our Jupyter notebook containers, each supported region has a VPC with a ECS Fargate cluster configured to run notebook tasks on-demand.

Each Jupyter project is associated with an ECS task definition, and an ECS task definition template is configured by the build pipeline and deployed through CloudFormation.

Event-driven Jupyter notebook tasks

To simplify task execution, each notebook repository has a single event trigger. Typically, a notebook task will run in response to a data object landing in S3. An example is a CSV being uploaded from a user portal, upon which our analysis takes place.

In the project repository, the YAML configuration file captures the S3 bucket and prefix that will trigger the ECS task definition when a CloudTrail log sent to EventBridge matches it:

...
S3TriggerBucket: notebook-trigger-bucket
S3TriggerKeyPrefix: path/to/data/
...

The EventBridge rule template is configured by the build pipeline and deployed through CloudFormation, and this completes the basic requirements for automating Jupyter notebook execution.

Putting it all together

In this article we've looked at a few of the challenges to scaling and automating data science workflows in a multi-region environment. We've also looked at how to address them within the Jupyter ecosystem and how we are implementing solutions that take advantage of various AWS serverless offerings.

When you put all of these together, the result is our end-to-end serverless git-ops containerised event-driven Jupyter-notebooks-as-code data science workflow execution pipeline architecture.

We just call it notebook-pipeline.

You’ve been reading a post from the Faethm AI engineering blog. We’re hiring, too! If share our passion for the future of work and want to pioneer world-leading data science and engineering projects, we’d love to hear from you. See our current openings: https://faethm.ai/careers

Building a super fast serverless container deployment pipeline on Google Cloud

Blair Hudson — Wed, 06 Nov 2019 03:10:11 +0000

One of our driving principles for shirtctl is #frugalbydesign - we simply don’t want to be paying for anything that we don’t use.

Our architecture needs to balance cost alongside other core capabilities like application security 🔒, design flexibility 💪 and developer collaboration 👩‍💻👨‍💻.

In this post, we’ll be sharing the some of the details of our continuous deployment pipeline. We’ve combined BitBucket with Google's Cloud Build service, which deploys our applications onto Cloud Run in an average of 1-2 minutes per build!

For development, we’ve also created a local build workflow to:

Speed up local code iteration 🏎💨
Minimise the number of Cloud Build jobs and Cloud Run revisions (#frugalbydesign) ☁️
Keep our commit log tidy! 🧹

Here’s a high level view of our approach:

Now let’s take a closer look at some of the major components. 🔎

Speedy local builds

Our MVP sign-ups API is a Python Flask app. It relies on a few various Python packages that provide the REST framework, email, storage and other capabilities. Right now it’s a simple api.py file and a requirements.txt that represents our package dependencies.

Our Dockerfile for local and cloud deployment is purposefully identical, so we can focus on API development.

FROM python:slim

# install python dependencies
RUN python3 -m venv /app/env
COPY requirements.txt .
RUN /app/env/bin/pip install -r requirements.txt

# configure port (Cloud Run requires 8080)
ENV PORT=8080
EXPOSE $PORT

# setup application runtime
WORKDIR /app/src
ENV GOOGLE_APPLICATION_CREDENTIALS="/app/sa-key.json”

COPY entrypoint.sh .
RUN chmod +x entrypoint.sh

COPY api.py .

CMD ["sh", "-c", "./entrypoint.sh"]

We have a localbuild.sh script that emulates Cloud Run deployment locally using Docker, which means we can iterate our development tasks very quickly without having to redeploy to Cloud Run.

#!/bin/bash
REPO=$(basename -s .git $(git config --get remote.origin.url))
BRANCH=$(git rev-parse --abbrev-ref HEAD)

gcloud iam service-accounts keys create sa-key.json \
 --iam-account service-account@project.iam.gserviceaccount.com
SA_KEY_FILE_BASE64=$(cat sa-key.json | base64)

docker build -t shirtctl-${REPO}-${BRANCH}:latest .

docker run --rm -it \
 -e K_SERVICE=localbuild \
 -e SA_KEY_FILE_BASE64 \
 -p 8080:8080 \
 -v $(pwd):/app/src \
 shirtctl-${REPO}-${BRANCH}:latest

We can “hot reload” 🔥 our changes to develop even faster! entrypoint.sh determines at run time whether to run Flask or Gunicorn depending on the value of $K_SERVICE. This way our Flask service restarts automatically when changes to the source code are detected:

#!/bin/bash
echo $SA_KEY_FILE_BASE64 | base64 -d > $GOOGLE_APPLICATION_CREDENTIALS

if [ "$K_SERVICE" = "localbuild" ] ; then
    export FLASK_APP="api.py"
    export FLASK_DEBUG=1
    /app/env/bin/flask run --host=0.0.0.0 --port=$PORT
else
    /app/env/bin/gunicorn --bind=0.0.0.0:$PORT api:app
fi

BitBucket to Cloud Source Repository

Code is committed and pushed to a private BitBucket repo. Our branching structure is simple:

⚙️ dev for feature-based development (we can have as many of these as required!)
✅ test where all feature dev branches are merged to (by pull request only)
🚀 prod where test is released to (also by pull request only, with dual approval required)

The BitBucket repo is automatically synced to a Cloud Source Repository of the same name and branch structure.

Deploying with Cloud Build

Cloud Build allow a build job to trigger on a push to our repo. This runs submits the cloudbuild.yaml file from our repo to Cloud Build, which accomplishes the following steps for the current branch:

Pulls the previous Docker image from Google Container Registry

docker pull gcr.io/$PROJECT_ID/$REPO_NAME-$BRANCH_NAME:latest

Builds and tags a new Docker image from our Dockerfile above:

docker build . \
 --cache-from gcr.io/$PROJECT_ID/$REPO_NAME-$BRANCH_NAME:latest \
 -t gcr.io/$PROJECT_ID/$REPO_NAME-$BRANCH_NAME:$SHORT_SHA \
 -t gcr.io/$PROJECT_ID/$REPO_NAME-$BRANCH_NAME:latest

Pushes the latest image to Google Container Registry:

docker push gcr.io/$PROJECT_ID/$REPO_NAME-$BRANCH_NAME:$SHORT_SHA
docker push gcr.io/$PROJECT_ID/$REPO_NAME-$BRANCH_NAME:latest

Deploys the latest image to Cloud Run, and maps the appropriate domains to access the service:

gcloud beta run deploy $REPO_NAME-$BRANCH_NAME \
         --image gcr.io/$PROJECT_ID/$REPO_NAME-$BRANCH_NAME:$SHORT_SHA
gcloud beta run domain-mappings create \
         --service $REPO_NAME-$BRANCH_NAME \
         --domain $BRANCH_NAME.$REPO_NAME.shirtctl.com

That's all for now! Keep an eye on shirtctl.com for our MVP sign-ups launch! 👕👚

shirtctl --blog: a series on building a tech tee startup

Blair Hudson — Thu, 31 Oct 2019 08:53:00 +0000

Hey there and welcome to shirtctl --blog.

This is the very beginning of the official blog of shirtctl! Pronounced “shirt control”, we’re bringing continuous delivery to tech tees. 👕👚

Ok, what in the world are you talking about?

In the world of DevOps, continuous delivery is an approach to building and shipping great software to users at any time, with a focus on reliability. 👨‍💻👩‍💻📦🚢

In the world of tech merch, that means creating and shipping cool t-shirts reliably to fans at any time. 😎

So what is this blog about then?

We’re building shirtctl out in the open!

In this blog we’ll be publishing a series of short posts around all of the product, user, architecture, engineering and design challenges we have. We'll be detailing the options we explore and how we make key decisions, all to show you the steps we take building shirtctl from scratch as a cloud-native data-driven DevSecBizFinOps startup! 🚀

And you can ask us anything along the way! (Just leave a comment.)

Hmm... who are you anyway?

shirtctl began in the warmth of down-under October 2019 by Sydneysiders Blair Hudson and Anthony Wales. Initially one of those this-is-so-crazy-it-just-might-work ideas, we’re combining our experience across Australia’s technology sector to sprint 🏃‍♂️, hack 💻 and ship 📦 our way to tech tee haven. How exciting!

Wait, I’m still confused. What is your product?

How would you like free tech tees delivered to your office every* month?

🤩 Sounds good? We think so too!

(*Assuming we can find something in your size. We think we can!)

You said free?

We’re making it super easy for technology companies to build their brand and connect with the community by harnessing the awesome power of t-shirts. ✨ They provide the goods and cover shipping. You provide your size and office address. shirtctl makes it all happen. Simple!

I love it! Where do I sign up?

While we haven’t launched yet, we plan to start sign-ups for developers working in Sydney very soon. 🥳

In true MVP style, devs will be able to sign-up using our awesome API and the programming language of their choice! Follow our blog (dev.to/shirtctl) and keep an eye on shirtctl.com for docs to get started.

Once we prioritise it (#agile), we’ll be building out a sign-up form for everyone else to join in on the free tee fun too (including the lazy devs)! 👫👬👭

I want to build my brand, how do I make tees available?

Watch this space. shirtctl is working with a small number of launch partners in Sydney to create a fantastic SX (we coined it, shirt experience is the next big thing). Then we'll open up to all!

If you’re really really interested to be involved early, reach out to us in the comments or using the links to our LinkedIn profiles above (since we haven’t prioritised building a contact form yet...).

👕👚

Machine Learning microservices: Python and XGBoost in a tiny 486kB container

Blair Hudson — Thu, 03 Oct 2019 11:40:55 +0000

In my last post, we looked at how to use containers for machine learning from scratch and covered the complexities of configuring a Python environment suitable to train a model with the powerful (and understandably popular) combination of the Jupyter, Scikit-Learn and XGBoost packages.

We worked through the complexities of setting up this environment, and then how to use containers to make it easily reproducible and portable. We also looked at how to build and run that environment at scale on Docker Swarm and Kubernetes.

That article intended to introduce containers to data scientists, and demonstrate how machine learning can fit into the world of containers for those already familiar. If this sounds useful to you, you should definitely check it out first and then come back right here 👇

In the opening section, I joked about that the title of the article (...from scratch to Kubernetes...) was not a reference to the FROM scratch command that you might find in certain Dockerfiles choosing to forgo a base image such as centos:7 that we used to build our Jupyter environment.

Well, in this follow-on article, we're going to explore why you would actually build a machine learning container using scratch, and a method for doing so that can avoid re-engineering an entire data science workflow from Python into another language.

What is `scratch`? Don't I need an operating system?

In Docker, the scratch image is actually a reserved keyword that literally means "nothing". Normally, you would specify in your Dockerfile a base image from which to build upon. This might be an official base image (such as centos:7) representing an operating system that includes a package manager and a bunch of tools that will be helpful for you to build your application into a container. This might also be another container you've built previously, where you want to add new layers of functionality such as new packages or scripts for specific tasks.

When you build a container on the scratch base, it starts with a total size of 0kB, and only grows as you ADD or COPY files into your container and manipulate them from there throughout the build process.

Why is this good?

Creating containers that are as small as possible is a challenging practice which has many benefits:

Smaller images build quicker, transmit faster through a network (no more long wait time for docker push and docker pull), take up less space on disk and require less memory
Smaller images have a reduced attack surface (which means would-be attackers have fewer options for exploiting or compromising your application)
Smaller images have less components to upgrade, patch and secure (which means less work is required to maintain them over time!)

Of course there are tradeoffs.

Creating containers to be as small as possible often sacrifices tooling that can help with debugging, which means you'll need to consider your approach for this by the time you reach production. It also limits reusability, which means you might end up with many more containers each with highly specialised functionality.

It turns out that there are many ways to reduce the size of a container before resulting to scratch. We won't go into these in any more detail in this article, but the techniques include:

switching to a different base image like alpine, a Linux distribution commonly used with containers due to its small size (run docker pull centos:7 , docker pull alpine, and then docker images to find alpine is a conservative 5.58MB compared to the 202MB size of centos:7)

minimising packages and other dependencies to only install what you need for running your application (in the Python world, this means checking every line of your requirements.txt file)
clearing caches and other build artefacts that are not required after install

We could also decide to implement our own machine learning algorithm entirely in a language that we can execute with minimal dependencies, but that will make it really hard to build, maintain and collaborate with others on.

What about existing data science workflows?

Our aim is to create a workflow that allows us to keep using our favourite Python tools to train our model, so let's build a Docker image to do just that.

Create a suitable directory and add the following to a new file called Dockerfile:

FROM centos:7 AS jupyter

RUN yum install -y epel-release && \
    yum install -y python36-devel python36-pip libgomp
RUN pip3 install jupyterlab scikit-learn xgboost

RUN adduser jupyter
USER jupyter

WORKDIR /home/jupyter

EXPOSE 8888
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888"]

You can build the container with the following command:

docker build -t devto-jupyter --target jupyter .
- --target allows us to build to a specific FROM step in a multi-stage Dockerfile (more on this in a bit)

Run the container and bring up your Jupyter instance by browsing to the localhost address output in the console:

docker run -it --rm -p "8888:8888" -v "$(pwd):/home/jupyter" devto-jupyter

Create a new Jupyter notebook called iris_classifier.ipynb and within it the following three cells:

from sklearn import datasets

X, y = datasets.load_iris(return_X_y=True)

import xgboost as xgb

train = xgb.DMatrix(X, label=y)
params = {
    'objective': 'multi:softmax',
    'num_class': 3,
}

model = xgb.train(params, train, num_boost_round = 5)

model.save_model('iris.model')

In order, these three cells load a dataset from which we can base our example (the Iris flower dataset), train an XGBoost classifier, and finally dump the trained model as a file called iris.model.

After running each cell, the directory where you executed docker run ... above should now contain your notebook file and the trained model file.

Introducing Multi-stage builds

As we were building our Dockerfile above, we specifically targeted the first FROM section called jupyter by using the --target option in our docker build command.

It turns out that we can have multiple FROM sections in a single Dockerfile, and combine them to copy our build artefacts from earlier steps in the process to later steps.

It's quite common when using containers to build microservices with other languages, such as Go, to follow a multi-stage build where the final step copies only the compiled binaries and any dependencies required for execution into an otherwise empty scratch container.

Since the build tools for this type of workflow are quite mature in Go, we are going to find a way to apply this to our Python data science process. Importantly, Python is an interpreted language, which makes it difficult to create small application distributions as they would need to bundle the Python interpreter and the full contents of any package dependencies.

The next step in our Dockerfile simply looks for the notebook we created above, and executes it in place to output the trained model. Go ahead and add this to the bottom of Dockerfile:

FROM jupyter AS trainer

COPY --chown=jupyter:jupyter ./iris_classifier.ipynb .
RUN jupyter nbconvert --to noteook --inplace --execute iris_classifier.ipynb

Predictions with an XGBoost model in Go

It turns out there is an existing pure Go implementation of the XGBoost prediction function in a package called Leaves, and the documentation includes some helpful examples of how to get started.

For this article, we're just looking to load up our trained model from the previous step and run a single prediction. We'll take the features as command line arguments so we can run the container with a simple docker run command.

Create a file in the same directory as your Dockerfile and call it iris_classifier_predict.go, with the contents:

package main

import (
    "fmt"
    "os"
    "strconv"
    "github.com/dmitryikh/leaves"
)

// Based on: https://godoc.org/github.com/dmitryikh/leaves
func main() {

    // load model
    model, err := leaves.XGEnsembleFromFile("/go/bin/iris.model", false)
    if err != nil {
        panic(err)
    }

    // preallocate slice to store model prediction
    prediction := make([]float64, model.NOutputGroups())

    // get inputs as floats
    var inputs []float64
    for _, arg := range os.Args[1:] {
        if n, err := strconv.ParseFloat(arg, 64); err == nil {
            inputs = append(inputs, n)
        }
    }

    // make predction
    model.Predict(inputs, 0, prediction)
    fmt.Printf("%v\n", prediction)
}

Now we need to create a third step in our multi-stage build to compile our microservice so it's ready for prediction. Add this to the bottom of Dockerfile:

FROM golang:alpine AS builder

RUN apk update && apk add --no-cache git upx
WORKDIR $GOPATH/src/xgbscratch/iris/
COPY ./iris_classifier_predict.go .

RUN go get -d -v
RUN GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o /go/bin/iris

# https://blog.filippo.io/shrink-your-go-binaries-with-this-one-weird-trick/
RUN upx --brute /go/bin/iris

These steps start with a ready made Go build environment, install Git (to grab Leaves from GitHub) and upx (the Ultimate Packer for eXecutables), copy our microservice script from above, build it with a series of options that basically translate into "everything needed to run standalone", and then compress the resulting binary.

(For the purposes of this article, upx compression helps us achieve a roughly 60% reduction in our final image footprint. In a future post we'll look at performance benchmarks of these various techniques and the tradeoffs with size, especially around the compression step.)

Building our tiny final container and generating predictions

The last step of our Dockerfile needs to take the trained model file iris.model from the second step, and the compiled Go binary from the third step, and run it.

You can add this to the bottom of Dockerfile:

FROM scratch

COPY --from=builder /go/bin/iris /go/bin/iris
COPY --from=trainer /home/jupyter/iris.model /go/bin/

ENTRYPOINT ["/go/bin/iris"]

Build the final container with the following command:

docker build -t devto-iris .

Run docker images and you'll find the final image to be around a tiny 486kB!

Compared to our original training image based on centos:7 which weighed in at a hefty 1.24GB, we've been able to achieve a size reduction of 99.96%, which is over 2500x times smaller.

How about actually making some predictions?

Since our Go binary accepts feature inputs as command line arguments, we can generate individual predictions using docker run with the following command:

docker run -it --rm devto-iris 1 2 3 4
- 1 2 3 4 can be replaced with the parameter inputs for our model, from which predictions are generated. With this example, the output should be similar to [-0.43101535737514496 0.39559850541076447 0.933891354361549] which are the relative positive probabilities of each label

What does this mean?

In addition to the tiny container benefits we discussed around data volume, application security and maintenance, tiny containers bring two great benefits to the world of machine learning:

being able to easily deploy a model into heavily resourced constrained places, such as embedded devices with low amounts of storage. Who knows, you could soon be running XGBoost predictions through your light switch, your sunglasses or your toaster! I'm looking forward to checking out k3OS, a low-resource operating system based on Kuberenetes to do exactly that.
with a much smaller footprint, a model can achieve a much greater predictive throughput ("predictions per second", or pps) and benefit high-permutation and prediction hungry applications of machine learning such as recommandation engines, simulations and scenario testing and pairwise comparisons.

Containers for Machine Learning, from scratch to Kubernetes

Blair Hudson — Mon, 16 Sep 2019 12:48:35 +0000

This article is for all those who keep hearing about the magical concept of containers from the world of DevOps, and wonder what it might have to do with the equally magical (but perhaps more familiar) concept of machine learning from the world of Data Science.

Well, wonder no more — in this article we're going to take a look at using containers for machine learning from scratch, why they actually make such a good match, and how to run them at scale in both the lightweight Docker Swarm and it's popular alternative Kubernetes!

(No container people... not FROM scratch, although you can read all about that in my follow-on post)

A primer on machine learning in Python

If you've been working with Python for data science for a while, you will already be well-aquinted with tools like Jupyter, Scikit-Learn, Pandas and XGBoost. If not, you'll just have to take my word for it that these are some of the best open source projects out there for machine learning right now.

For this article, we're going to pull some sample data from everyone's favourite online data science community, Kaggle.

Assuming you already have Python 3 installed, let's go ahead and install our favourite tools (though you'll probably have most of these already):

pip install jupyterlab pandas scikit-learn xgboost kaggle

(If you’ve had any troubles installing Python 3 or the above package requirements you might like to skip straight to the next section.)

Once we've configured our local Kaggle credentials, change to a suitable directory and download and unzip the bank loan prediction dataset (or any other dataset you prefer)!

kaggle datasets download -d omkar5/dataset-for-bank-loan-prediction
unzip dataset-for-bank-loan-prediction.zip

With our data ready to go, let's run Jupyter Lab and start working on our demonstration model. Use the command jupyter lab to start the service, which will open http://localhost:8888 in your browser.

Create a new notebook from the launcher, and call it notebook.ipynb. You can copy the following code into each cell of your notebook.

First, we read the Kaggle data into a DataFrame object.

import pandas as pd
path_in = './credit_train.csv'
print('reading csv from %s' % path_in)
df = pd.read_csv(path_in)

Now we quickly divide our DataFrame into features and a target (but don't try this at home...)

def prep_data(df):
    X = df.drop(['Number of Credit Problems'], axis=1).select_dtypes(include=['number','bool'])
    y = df['Number of Credit Problems'] > 1
    return X, y

print("preparing data")
X_train, y_train = prep_data(df)

With our data ready, let's fit an XGBoost classifier with all of the default hyper-parameters.

from xgboost import XGBClassifier
model = XGBClassifier()

print("training model")
model.fit(X_train, y_train)

When that finishes running, we now have ... a model? Though admittedly not a very good one, but this article is about containers not tuning XGBoost! Let's save our model so we can use it later on if necessary.

import joblib

path_out = './model.joblib'
print("dumping trained model to %s" % path_out)
joblib.dump(model, path_out)

Using Docker for managing your data science environment and executing notebooks

So we just did all of that work to set up our Jupyter environment with the right packages. Depending on our operating system and previous installations we may have even had some unexpected errors. (Did anyone else fail to install XGBoost the first time?) Hopefully you found a workaround for installing everything and I hope you took notes of the process — since we’ll want to be able to repeat that when we take our machine learning project to production later...

Ok, here comes the juicy part.

Docker solves this problem for us by allowing us to specify our entire environment (including the operating system and all the installation steps) as a reproducible script, so that we can easily move our machine learning project around without having to resolve the installation challenges ever again!

You'll need to install Docker. Luckily Docker Desktop for Mac and Windows includes everything we need for this tutorial. Linux users can find Docker in their favourite package manager — but you might need to configure the official Docker repository to get the latest version.

Once installed, make sure the Docker daemon is running, then run your first container!

This command will pull the CentOS 7 official Docker image and run an interactive terminal session. (Why CentOS 7? Given the similarities to Amazon Linux and Red Hat, which you'll often encounter in enterprise envirnonments. With some tweaking of the yum installation commands, you could use any base operating system.)

docker run -it --rm centos:7
- -it tells Docker to make your container interactive (as opposed to detached) and attaches a tty (terminal) session to actually interact with it
- --rm tells Docker to remove your container as soon as we stop it with ctrl-c

Now we want to find the right commands to install Python, Jupyter and our other packages, and as we do we'll write them into a Dockerfile to develop our new container on top of centos:7.

Create a new file and name it Dockerfile, the contents should look a little something like this:

FROM centos:7

# install python and pip
RUN yum install -y epel-release
RUN yum install -y python36-devel python36-pip

# install our pacakges
RUN pip3 install jupyterlab kaggle pandas scikit-learn xgboost 
# turns out xgboost needs this
RUN yum install -y libgomp

# create a user to run jupyterlab
RUN adduser jupyter

# switch to our user and their home dir
USER jupyter
WORKDIR /home/jupyter

# tell docker to listen on port 8888 and run jupyterlab
EXPOSE 8888
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888"]

To build your new container, run this command from the directory where your Dockerfile exists,:

docker build -t jupyter .

This will run each of the commands in the Dockerfile except for the last "CMD" comment, which is the default command to be executed when you launch the container, and then tag with built image with the name jupyter.

Once the build is complete, we can run a container based on our new jupyter image using the default CMD we provided (which will hopefully start our Jupyter server!):

docker run -it --rm jupyter

Done? Not quite.

So it turns out we also need to map the container port to our host computer so we can reach it in the browser. While we're at it, let's also map the current directory to the container user's home directory so we can access our files when Jupyter is launched:

docker run -it --rm -p "8888:8888" -v "$(pwd):/home/jupyter" jupyter
- -p "HOST_PORT:CONTAINER_PORT" tells Docker to map a port on our host computer to a port on the container (in this case 8888 to 8888 but they need not be the same)
- -v "/host/path/or/file:/container/path/or/file tells Docker to map a path or file on our host so that the container can access it (and $(pwd) simply outputs the current host path)

Using the same notebook cell code as above, write and execute a new notebook.ipynb using the "containerised" Jupyter service.

Now we need to automate our notebook execution. In the Jupyter terminal prompt enter:

nbconvert --to notebook --inplace --execute notebook.ipynb

This calls a Jupyter utility to convert our run and update our notebook in-place, so any outputs/table/charts will be updated in addition to any actual outputs from the script.

When you're done, Ctrl-C a few times to quit Jupyter (and in doing so, this will exit and remove our container since we set the --rm option in the previous docker run command).

To make things automatable, it turns out we can override the default CMD without creating a new Dockerfile. With this, we can skip running Jupyterlab and instead run our nbconvert command:

docker run -it --rm -p "8888:8888" -v "$(pwd):/home/jupyter" jupyter jupyter nbconvert --to notebook --inplace --execute notebook.ipynb

Notice that we simply specify our custom command (CMD) by specifying the command and any arguments at the end of our docker run command. (Note the first jupyter is the image tag, while the second is the command to trigger our process.)

For the curious, this is the same as modifying our Dockerfile CMD to the following:

#...
CMD ["jupyter", "nbconvert", "--to", "notebook", "--inplace", "--execute", "notebook.ipynb"]

Once the container has exited, check model.joblib, which should have been modified seconds ago.

Success!

Scaling your environments with Docker Swarm

Running a container on your computer is one thing — but what if you want to speed up your machine learning workflows beyond what your computer alone can achieve? What if you want to run many of these services at the same time? What if all your data is stored in a remote environment and you don't want to transmit gigabytes of data over the Internet?

There's loads of great reasons why running containers in a cluster environment is beneficial, but whatever the reason, I'm going to show you just how easy this is by introducing Docker Swarm.

Conveniently Docker Swarm is a built-in capabaility of Docker, so to keep following this article you don't need to install anything else. Of course, in reality you would more likely choose to provision multiple compute resources in the cloud and initialise and join your cluster there. In fact, assuming network connectivity between them, you could even set up a cluster that spans multiple cloud providers! (How's that for high availability!? 👊)

To start a single node cluster, run docker swarm init. This designates that host as a manager node in your 'swarm', which means it is responsible for scheduling services to run across all of the nodes in your cluster. If your manager node goes offline, then you lose access to your cluster so if resiliency is important it's good practice to have 3 or 5 to maintain consensus if 1 or 2 nodes fail.

This command will output a another command starting with docker swarm join which when run on another host, joins that host as a worker node in your swarm. You can run this on as many worker nodes as you want, or even in an auto-scaling arrangement to ensure your cluster always has enough capacity — but we won't need it for now.

To run Jupyter as a service, Docker Swarm has a special command which is similar to docker run above. The key difference is that this publishes (exposes) port 8888 across every node in your cluster, regardless of where the container itself is actually running. This means if you send traffic to port 8888 on any node in your cluster, Docker will automatically forward it to the correct host like magic! In certain use cases (such as stateless REST APIs or static application front-ends, you can use this to automatically load balance your services — cool!)

On a manager node in your cluster (which is your computer for now), run

docker service create --name jupyter --mount type=bind,source=$(pwd),destination=/home/jupyter --publish 8888:8888 jupyter
- --name gives the service a nickname to easily reference it later (for example, to stop it)
- --mount allows you to bind data into the container
- --publish exposes the specified port across the cluster

(Note that in this case bind-mounting a host directory will work since we only have a single node swarm. In multi-node clusters this won't work so well unless you can guarantee the data at the mount point on each host to be in sync. How to achieve this is not discussed here.)

After running the command, the service will output various status messages until it converges to a stable state (which basically means that no errors have occurred for 5 seconds once the container command is executed).

You can run docker service logs -f jupyter to check the logs (I told you that naming our service would come in handy), and if you want to access Jupyter in your browser, you'll need to do this to retrieve the access token.

Now you can remove the service by running

docker service rm jupyter

What about our notebook execution? Try running this:

docker service create --name jupyter --mount type=bind,source=$(pwd),destination=/home/jupyter --restart-condition none jupyter jupyter nbconvert --to notebook --inplace --execute notebook.ipynb
- --restart-condition none is important here to prevent your restarting container when it's finished executing
- jupyter jupyter [params] represents the name of the container, the name of a custom command to run, and it's subsequent parameters (nbconvert ...)

These commands are getting pretty complex now, so it might be a good idea to start documenting them so we can easily reproduce our services later on. Luckily we have Docker Compose, which is a configuration-based service for doing just that. Here is what the first service command looks like as a compose.yaml file:

version: "3.3"
services:
  jupyter:
    image: jupyter
    volumes:
      - ${PWD}:/home/jupyter
    ports:
      - "8888:8888"

If you save this, you can run it as a "stack" of services (even though it only describes one right now), using the command:

docker stack deploy --compose-file compose.yaml jupyter

Much neater. It turns out you can include many related services in a single Docker Compose Stack, and so when you deploy one the services are named as stackname_servicename, so to retrieve the logs enter:

docker service logs -f jupyter_jupyter

This is the Docker Compose configuration for running our Jupyter notebook. Note the introduction of the restart_policy. This is super important for running our job since we expect it to finish and by default Docker Swarm will automatically restart stopped containers which will execute your notebook repeatedly.

version: "3.3"
services:
  jupyter:
    image: jupyter
    deploy:
      restart_policy:
        condition: none
    volumes:
      - ${PWD}:/home/jupyter
    command: jupyter nbconvert --to notebook --inplace --execute notebook.ipynb

Getting started with Kubernetes

Docker Desktop for Mac and Windows also includes a single-node Kubernetes cluster, so in the settings for Docker Desktop you'll want to switch that on. Starting up Kuberenetes can take a while, since it is a pretty heavyweight cluster deigned for running massive workloads. Think thousands and thousands of containers at once!

In practice, you'll want to configure your Kubernetes cluster over multiple hosts, and with the introduction of tools like kubeadm that process is similar to configured Docker Swarm as we did earlier. We won't be discussing setting up Kubernetes any further in this article, but if you're interested you can read more about kubeadm here. If you are planning to use Kubernetes, you might also consider using one of the cloud vendor managed services such as AWS Elastic Kubernetes Service or Google Kubernetes Engine on Google Cloud.

In recent versions of Docker and Kubernetes, you can actually deploy a Docker stack straight to Kubernetes — using the same Docker Compose files we created earlier! (Though not without some gotcha's, such the convenient bind-mounted host directory we deployed without fear earlier.)

To target the locally configured Kubernetes cluster, simply update your command to add --orchestrator kubernetes:

docker stack deploy --compose-file compose.yaml --orchestrator kubernetes jupyter

This will deploy a Kubernetes stack just as it deployed a Docker Swarm stack, containing your services (no pun intended). In Kubernetes, a Docker Swarm "service" is known as a "pod".

To see what pods are running, and to confirm that our Jupyter stack is one of them, just run this and take note of the exact name of your Jupyter pod (such as jupyter-54f889fdf6-gcshl).

kubectl get pods

As usual you'll need to grab the Jupyter token to access your notebooks, and the equivalent command to access the logs is below. Note that you'll need to use the exact name of the pod from the above command.

kubectl logs -f jupyter-54f889fdf6-gcshl

And when you're all done with Jupyter on Kubernetes, you can tear down the stack with:

kubectl delete stack jupyter