DEV Community: Derek Haynes

Machine learning deserves its own flavor of Continuous Delivery

Derek Haynes — Fri, 24 Apr 2020 15:54:28 +0000

Things feel slightly different - but eerily similar - when traveling through Canada as an American. The red vertical lines on McDonald's straws are a bit thicker, for example. It's a lot like my travels through the world of data science as a software engineer.

While the data science world is magical, I'm homesick for the refined state of Continuous Delivery (CD) in classical software projects when I'm there. What's Continuous Delivery?

Martin Fowler says you're doing Continuous Delivery when:

Your software is deployable throughout its lifecycle

Your team prioritizes keeping the software deployable over working on new features

Anybody can get fast, automated feedback on the production readiness of their systems any time somebody makes a change to them

You can perform push-button deployments of any version of the software to any environment on demand

For a smaller-scale software project deployed to Heroku, Continuous Delivery (and its hyper-active sibling Continuous Deployment) is a picture of effortless grace. You can setup up CI in minutes, which means your unit tests run on every push to your git repo. You get Review Apps to preview the app from a GitHub pull request. Finally, every git push to master automatically deploys your app. All of this takes less than an hour to configure.

Now, why doesn't this same process work for our tangled balls of yarn wrapped around a single predict function? Well, the world of data science is not easy to understand until you are deep in its inner provinces. Let's see why we need CD4ML and how you can go about implementing this in your machine learning project today.

Side-note: CD4ML in action? I'd love to put an example machine learning project that implements a lean CD4ML stack on GitHub, but I perform better in front of an audience. Subscribe here to be notified when the sample app is built!

Why classical CD doesn't work for machine learning

I see 6 key differences in a machine learning project that make it difficult to apply classical software CD systems:

1. Machine Learning projects have multiple deliverables

Unlike a classical software project that solely exists to deliver an application, a machine learning project has multiple deliverables:

ML Models - trained and serialized models with evaluation metrics for each.
A web service - an application that serves model inference results via an HTTP API.
Reports - generated analysis results (often from notebooks) that summarize key findings. These can be static or interactive using a tool like Streamlit.

Having multiple deliverables - each with different build processes and final presentations - is more complicated than compiling a single application. Because each of these is tightly coupled to each other - and to the data - it doesn't make sense to create a separate project for each.

2. Notebooks are hard to review

Code reviews on git branches are so common, they are baked right into GitHub and other hosted version control systems. GitHub's code review system is great for *.py files, but it's terrible for viewing changes to JSON-formatted files. Unfortunately, that's how notebook files (*.ipynb) are stored. This is especially painful because notebooks - not abstracting logic to *.py files - is the starting point for almost all data science projects.

3. Testing isn't simply pass/fail

Elle O'Brien covers the ambigous nature of using CI for model evaluation.

It's easy to indicate if a classical software git branch is OK to merge by running a unit test suite against the code branch. These tests contain simple true/false assertions (ex: "can a user signup?"). Model evaluation is harder and often requires a data scientist to manually review as evaluation metrics often change in different directions. This phase can feel more like a code review in a classical software project.

4. Experiments versus git branches

Software engineers in classical projects create small, focused git branches when working on a bug or enhancement. While branches are lightweight, they are different than model training experiments as they usually end up being deployed. Model experiments? Like fish eggs, the vast majority of them will never see open waters. Additionally, an experiment likely has very few changes code changes (ex: adjusting a hyper-parameter) compared to a classical software git branch.

5. Extremely large files

Most web applications do not store large files in their git repositories. Most data science projects do. Data almost always requires this. Models may require this as well (there's also little point in keeping binary files in git).

6. Reproducible outputs

It's easy for any developer on a classical software project to reproduce the state of the app from any point in time. The app typically changes in just two dimensions - the code and the structure of the database. It's hard to get to a reproducible state in ML projects because many dimensions can change (the amount of data, the data schema, feature extraction, and model selection).

How might a simple CD system look for Machine Learning?

Some of the tools available to make CD4ML come to life.

In the excellent Continuous Delivery for Machine Learning, the Thoughtworks team provides a sample ML application that illustrates a CD4ML implementation. Getting this to run involves a number of moving parts: GoCD, Kubernetes, DVC, Google Storage, MLFlow, Docker, and the EFK stack (ElasticSearch, FluentD, and Kibana). Assembling these ten pieces is approachable for the enterprise clients of Throughworks, but it's a mouthful for smaller organizations. How can leaner organizations implement CD4ML?

Below are my suggestions, ordered by what I'd do first. The earlier items are more approachable. The later items build on the foundation, adding in more automation at the right time.

1. Start with Cookiecutter Data Science

Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run rails new to get a standard project skeleton like everybody else.

Cookiecutter Data Science

When I view the source code for an existing Ruby on Rails app, I can easily navigate around as they all follow the same structure. Unfortunately, most data science projects don't use a shared project skeleton. However, it doesn't have to be that way: start your data science project with Cookiecutter Data Science rather than dreaming up your own unique structure.

Cookiecutter Data Science sets up a project structure that works for most projects, creating directories like data, models, notebooks, and src (for shared source code).

2. nbdime or ReviewNB for better notebook code reviews

nbdime is a Python package that makes it easier to view changes in notebook files.

It's hard to view changes to *.ipynb in standard diff tools. It's trivial to conduct a code review of a *.py script, but almost impossible with a JSON-formatted notebook. nbdime and ReviewNB make this process easier.

3. dvc for data files

dvc add data/
git commit -ma "Storing data dir in DVC"
git push
dvc push

After installing dvc and adding a remote, the above is all that's required to version-control your data/ directory with dvc.

It's easy to commit large files (esc. data files) to a git repo. Git doesn't handle large files well: it slows things down, can cause the repo to grow in size quickly, and you can't really view diffs of these files anyway. A solution gaining a lot of steam for version control of large files is Data Version Control (dvc). dvc is easy to install (just use pip) and its command structure mimics git.

But wait - you're querying data direct from your data warehouse and don't have large data files? This is bad as it virtually guarantees your ML project is not reproducible: the amount of data and schema is almost guaranteed to change over time. Instead, dump those query results to CSV files. Storage is cheap, and dvc makes it easy to handle these large files.

4. dvc for reproducible training pipelines

dvc isn't just for large files. As Christopher Samiullah says in First Impressions of Data Science Version Control (DVC):

Pipelines are where DVC really starts to distinguish itself from other version control tools that can handle large data files. DVC pipelines are effectively version controlled steps in a typical machine learning workflow (e.g. data loading, cleaning, feature engineering, training etc.), with expected dependencies and outputs.

You can use dvc run to create a Pipeline stage that encompasses your training pipeline:

dvc run -f train.dvc \
          -d src/train.py -d data/ \
          -o model.pkl \
          python src/train.py data/ model.pkl

dvc run names this stage train, depends on train.py and the data/ directory, and outputs a model.pkl model.

Later, you run this pipeline with just:

dvc repro train.dvc

Here's where "reproducible" comes in:

If the dependencies are unchanged - dvc repo simply grabs model.pkl from remote storage. There is no need to run the entire training pipeline again.
If the dependencies change - dvc repo will warn us and re-run this step.

If you want to go back to a prior state (say you have a v1.0 git tag):

git checkout v1.0
dvc checkout

Running dvc checkout will restore the data/ directory and model.pkl to is previous state.

5. Move slow training pipelines to a CI cluster

In Reimagining DevOps, Elle O'Brien makes an elegant case for using CI to run model training. I think this is brilliant - CI is already used to run a classical software test suite in the cloud. Why not use the larger resources available to you in the cloud to do the same for your model training pipelines? In short, push up a branch with your desired experiment changes and let CI do the training and report results.

Christopher Samiullah also covers moving training to CI, even providing a link to a GitHub PR that uses dvc repo train.dvc in his CI flow.

6. Deliverable-specific review apps

Netlify, a PaaS for static sites, lets you view the current version of your site right from a GitHub pull request.

Platforms like Heroku and Netlify allow you to view a running version of your application right from a GitHub pull request. This is fantastic for validating things are working as desired. These are great for presenting work to non-technical team members.

Do the same using GitHub actions that start Docker containers to serve your deliverables. I'd spin up the web service and any dynamic versions of reports (like something built with Streamlit).

Conclusion

Like eating healthy, brushing your teeth, and getting fresh air, there's no downside to implementing Continuous Delivery in a software project. We can't copy and paste the classical software CD process on top of machine learning projects because an ML project ties together large datasets, a slow training process that doesn't generate clear pass/fail acceptance tests, and contains multiple types of deliverables. By comparison, a classical software project contains just code and has a single deliverable (an application).

Thankfully, it's possible to create a Machine Learning-specific flavor of Continuous Delivery (CD4ML) for non-enterprise organizations with existing tools (git, Cookiecutter Data Science, nbdime, dvc, and a CI server) today.

CD4ML in action? I'd love to put an example machine learning project that implements a lean CD4ML stack on GitHub, but I perform better in front of an audience. Subscribe here to be notified when the sample app is built!

Elsewhere

Continuous Delivery for Machine Learning - this blog post (and the CD4ML acronym) wouldn't exist without the original work of Daniel Sato, Arif Wider, and Christoph Windheuser.
First Impressions of Data Science Version Control (DVC) - this is a great technical walkthrough migrating an existing ML project to using DVC for tracking data, creating pipelines, versioning, and CI.
Reimagining DevOps for ML by Elle O'Brien - a concise 8 minute YouTube talk that outlines a novel approach for using CI to train and record metrics for model training experiments.
How to use Jupyter Notebooks in 2020 - a comprehensive three-part series on architecting data science projects, best practices, and how the tools ecosystem fits together.

Is your Django app slow? Think like a data scientist, not an engineer

Derek Haynes — Mon, 22 Apr 2019 15:56:57 +0000

This post originally appeared on the Scout Blog.

I'm an engineer by trade. I rely on intuition when investigating a slow Django app. I've solved a lot of performance issues over the years and the short cuts my brain takes often work. However, intuition can fail. It can fail hard in complex Django apps with many layers (ex: an SQL database, a NoSQL database, ElasticSearch, etc) and many views. There's too much noise.

Instead of relying on an engineer's intuition, what if we approached a performance investigation like a data scientist? In this post, I'll walk through a performance issue as a data scientist, not as an engineer. I'll share time series data from a real performance issue and my Google Colab notebook I used to solve the issue.

The problem part I: too much noise

Photo Credit: pyc Carnoy

Many performance issues are caused by one of the following:

A slow layer - just one of many layers (the database, app server, etc) is slow and impacting many views in a Django app.
A slow view - one view is generating slow requests. This has a big impact on the performance profile of the app as a whole.
A high throughput view - a rarely used view suddenly sees an influx of traffic and triggers a spike in the overall response time of the app.

When investigating a performance issue, I start by looking for correlations. Are any metrics getting worse at the same time? This can be hard: a modest Django app with 10 layers and 150 views has 3,000 unique combinations of time series data sets to compare! If my intuition can't quickly isolate the issue, it's close to impossible to isolate things on my own.

The problem part II: phantom correlations

Determining if two time series data sets are correlated is notoriously fickle. For example, doesn't it look like the number of films Nicolas Cage appears in each year and swimming pool drownings are correlated?

Spurious Correlations is an entire book related to these seemingly clear but unconnected correlations! So, why do trends appear to trigger correlations when looking at a timeseries chart?

Here's an example: five years ago, my area of Colorado experienced a historic flood. It shut off one of the two major routes into Estes Park, the gateway to Rocky Mountain National Park. If you looked at sales receipts across many different types of businesses in Estes Park, you'd see a sharp decline in revenue while the road was closed and an increase in revenue when the road reopened. This doesn't mean that revenue amongst different stores was correlated. The stores were just impacted by a mutual dependency: a closed road!

One of the easiest ways to remove a trend from a time series is to calculate the first difference. To calculate the first difference, you subtract from each point the point that came before it:

y'(t) = y(t) - y(t-1)

That's great, but my visual brain can't re-imagine a time series into its first difference when staring at a chart.

Enter Data Science

We have a data science problem, not a performance problem! We want to identify any highly correlated time series metrics. We want to see past misleading trends. To solve this issue, we'll use the following tools:

Google Colab, a shared notebook environment
Common Python data science libraries like Pandas and SciPy
Performance data collected from Scout, an Application Performance Monitoring (APM) product. Signup for a free trial if you don't have an account yet.

I'll walk through a shared notebook on Google Colab. You can easily save a copy of this notebook, enter your metrics from Scout, and identify the most significant correlations in your Django app.

Step 1: view the app in Scout

I login to Scout and see the following overview chart:

Time spent in SQL queries jumped significantly from 7pm - 9:20pm. Why? This is scary as almost every view touches the database!

Step 2: load layers time series data into Pandas

To start, I want to look for correlations between the layers (ex: SQL, MongoDB, View) and the average response time of the Django app. There are fewer layers (10) than views (150+) so it's a simpler place to start. I'll grab this time series data from Scout and initialize a Pandas Dataframe. I'll leave this data wrangling to the notebook.

After loading the data into a Pandas Dataframe we can plot these layers:

Step 3: layer correlations

Now, let's see if any layers are correlated to the Django app's overall average response time. Before comparing each layer time series to the response time, we want to calculate the first difference of each time series. With Pandas, we can do this very easily via the diff() function:

df.diff()

After calculating the first difference, we can then look for correlations between each time series via thecorr() function. The correlation value ranges from −1 to +1, where ±1 indicates the strongest possible agreement and 0 the strongest possible disagreement.

My notebook generates the following result:

SQL appears to be correlated to the overall response time of the Django app. To be sure, let's determine the Pearson Coefficient p-value. A low value (< 0.05) indicates that the overall response time is highly likely to be correlated to the SQL layer:

df_diff = df.diff().dropna()
p_value = scipy.stats.pearsonr(df_diff.total.values, df_diff[top_layer_correl].values)[1]
print("first order series p-value:", p_value)

The p-value is just 1.1e-54. I'm very confident that slow SQL queries are related to an overall slow Django app. It's always the database, right?

Layers are just one dimension we should evaluate. Another is the response time of the Django views.

Step 4: Rinse+repeat for Django view response times

The overall app response time could increase if a view starts responding slowly. We can see if this is happening by looking for correlations in our view response times versus the overall app response time. We're using the exact same process as we used for layers, just swapping out the layers for time series data from each of our views in the Django app:

After calculating the first difference of each time series, apps/data does appear to be correlated to the overall app response time. With a p-value of just 1.64e-46, apps/data is very likely to be correlated to the overall app response time.

We're almost done extracting the signal from the noise. We should check to see if traffic to any views triggers slow response times.

Step 5: Rinse+repeat for Django view throughputs

A little-used, expensive view could hurt the overall response time of the app if throughput to that view suddenly increases. For example, this could happen if a user writes a script that quickly reloads an expensive view. To determine correlations we'll use the exact same process as before, just swapping in the throughput time series data for each Django view:

endpoints/sparkline appears to have a small correlation. The p-value is 0.004, which means there is a 4 in 1,000 chance that there is not a correlation between traffic to endpoints/sparkline and the overall app response time. So, it does appear that traffic to the endpoints/sparkline view triggers slower overall app response times, but it is less certain than our other two tests.

Conclusion

Using data science, we've been able to sort through far more time series metrics than we ever could with intuition. We've also been able to make our calculations without misleading trends muddying the waters.

We know that our Django app response times are:

strongly correlated to the performance of our SQL database.
strongly correlated to the response time of our apps/data view.
correlated to endpoints/sparkline traffic. While we're confident in this correlation given the low p-value, it isn't as strong as the previous two correlations.

Now it’s time for the engineer! With these insights in hand, I’d:

investigate if the database server is being impacted by something outside of the application. For example, if we have just one database server, a backup process could slow down all queries.
investigate if the composition of the requests to the apps/data view has changed. For example, has a customer with lots of data started hitting this view more? Scout's Trace Explorer can help investigate this high-dimensional data.
hold off investigating the performance of endpoints/sparkline as its correlation to the overall app response time wasn't as strong.

It's important to realize when all of that hard-earned experience doesn't work. My brain simply can't analyze thousands of time series data sets the way our data science tools can. It's OK to reach for another tool.

If you'd like to work through this problem on your own, check out my shared Google Colab notebook I used when investigating this issue. Just import your own data from Scout next time you have a performance issue and let the notebook do the work for you!