DEV Community: Elle O'Brien

How to use GitHub Actions with your GPU

Elle O'Brien — Mon, 24 Aug 2020 16:23:52 +0000

Tools like GitHub Actions and GitLab CI automate repetitive aspects of software development- and they can also automate machine learning tasks like model training, testing, and reporting. By default, these tools provide CPUs for running workflows.

This tutorial will show you how to set up a GPU (on-premise or cloud) as a self-hosted runner using the CML Docker container, which comes ready with CUDA drivers and software to run GitHub Actions and GitLab CI workflows! It's part of a series of MLOps tutorials I've been making. Enjoy!

MLOps Tutorial 🎦: Track models with Git & GitHub Actions

Elle O'Brien — Mon, 17 Aug 2020 19:47:14 +0000

Did you know you can use Git to keep track of your ML models? Yes, you can use Git to snapshot your project at many stages of development! Then with GitHub Actions, you can take your work to the next level by automating repetitive processes like model training and reporting. I'm creating a video series to help people take advantage of these software tools for data science and ML.

One of the big ideas around Git is to use branches to develop new features. In data science, this can look like using new branches to try out new modeling approaches or ways of processing data. So today I've released a new video tutorial about a frequent question:

How do I compare ML models on different Git branches?

The answer (and the video!) goes a little more in-depth than you might expect. There's an easy approach, and then there's a good approach.

Easy answer. If your model training and evaluation scripts creates a metric file- say, metrics.csv- then you could use

$ git diff metrics.csv

Good answer. So you can do a git diff of your metrics file, but aside from being a little hard to read, there's another issue:

What if metrics.csv is modified by different processes on different branches?

For example, on the main branch of a project, I might run a script train.py that creates metrics.csv. But there's no guarantee that on a feature branch, I or a teammate will keep train.py and metrics.csv "in-sync". A few scenarios:

Someone manually updates metrics.csv on their branch
Someone changes train.py but forgets to re-run it, so metrics.csv is never re-generated
Someone modifies train.py on a feature branch to output a reformatted file (metrics.json instead of .csv, perhaps), or to output an entirely different file (score.csv).

To avoid these kinds of errors, we need to make sure that our metrics file is tightly linked to the processes that produced it (and any other processes it depends on, like standardizing data).

So long story short- I set out to make a video about how to do something like a git diff for model metrics and then report it in a Pull Request with GitHub Actions. But I ended up telling a longer story about why and how to use ML pipelines to ensure that your model metrics are reproducibly regenerated on every branch of your project. It got bigger than I expected but I hope you'll find the tutorial worth it!

Video tutorial 🎥 When data is too big for Git

Elle O'Brien — Thu, 06 Aug 2020 20:32:24 +0000

Have you ever tried to put a large dataset or model weights into Git? Git is amazing except when it comes to big files... which happens pretty often in machine learning.

As part of an MLOps Tutorials series, I made a video covering:

Git fundamentals for ML
How to add external storage (from Google Drive!) to a GitHub repo to store datasets and trained models

There's also some inklings of a topic we'll develop further in upcoming videos: what does it mean to version data as code? How do we create high-level abstractions to separate data from the way it's stored? Stay tuned.

VIDEO 🎥 MLOps tutorial: Intro to continuous integration for ML

Elle O'Brien — Fri, 24 Jul 2020 23:26:03 +0000

Earlier this month, my team launched CML, our latest open-source project in the MLOps space. We think it's a step towards establishing powerful
DevOps practices (like continuous integration) as a regular fixture of machine learning and data science projects.

iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML

What is CML? Continuous Machine Learning (CML) is an open-source CLI tool for implementing continuous integration & delivery (CI/CD) with a focus on MLOps. Use it to automate development workflows — including machine provisioning, model training and evaluation, comparing ML experiments across project history, and monitoring changing datasets.

CML can help train and evaluate models — and then generate a visual report with results and metrics — automatically on every pull request.

An example report for a neural style transfer model.

CML principles:

GitFlow for data science. Use GitLab or GitHub to manage ML experiments, track who trained ML models or modified data and when. Codify data and models with DVC instead of pushing to a Git repo.
Auto reports for ML experiments. Auto-generate reports with metrics and plots in each Git pull request. Rigorous engineering practices help your team make informed, data-driven decisions.
No additional services. Build your…

View on GitHub

But there are plenty of challenges ahead, and a big one is literacy.

So many data scientists, like developers, are self-taught. Data science degrees have only recently emerged on the scene, which means if you polled a handful of senior-level data scientists, there'd almost certainly be no universal training
or certificate among them. Moreover, there's still no widespread agreement about what it takes to be a data scientist: is it an engineering role with a little
bit of TensorFlow sprinkled on top? A title for statisticians who can code? We're not expecting an easy resolution to these existential questions anytime soon.

In the meantime, we're starting a video series to help data scientists curious about DevOps (and developers and engineers curious about data science!) get started. Through hands-on coding examples and use cases, we want to give data science practitioners the fundamentals to explore, use, and influence MLOps.

The first video in this series uses a lightweight and fairly popular data science problem- building a model to predict wine quality ratings- as a playground to introduce continuous integration.

The tutorial covers:

Using Git-flow in a data science project (making a feature branch and pull request)
Creating your first GitHub Action to train and evaluate a model
Using CML to generate visual reports in your pull request summarizing model performance

Code for the project is available online so you can follow along!

elleobrien / wine

wine prediction dataset

Wine quality prediction

Modelling a Kaggle dataset of red wine properties and quality ratings.

View on GitHub

We also recommend checking out the CML docs for more details, tutorials, and use cases.

If you have questions, the best way to get in touch is by leaving a comment on the blog, video, or our Discord channel. And, we're especially interested to hear what use cases you'd like to see covered in future videos- tell us about your data science project and how you could imagine using continuous integration, and we might be able to create a video!