DEV Community

Elle O'Brien
Elle O'Brien

Posted on

MLOps Tutorial 🎦: Track models with Git & GitHub Actions

Did you know you can use Git to keep track of your ML models? Yes, you can use Git to snapshot your project at many stages of development! Then with GitHub Actions, you can take your work to the next level by automating repetitive processes like model training and reporting. I'm creating a video series to help people take advantage of these software tools for data science and ML.

One of the big ideas around Git is to use branches to develop new features. In data science, this can look like using new branches to try out new modeling approaches or ways of processing data. So today I've released a new video tutorial about a frequent question:

How do I compare ML models on different Git branches?

The answer (and the video!) goes a little more in-depth than you might expect. There's an easy approach, and then there's a good approach.

Easy answer. If your model training and evaluation scripts creates a metric file- say, metrics.csv- then you could use

$ git diff metrics.csv

Good answer. So you can do a git diff of your metrics file, but aside from being a little hard to read, there's another issue:

What if metrics.csv is modified by different processes on different branches?

For example, on the main branch of a project, I might run a script train.py that creates metrics.csv. But there's no guarantee that on a feature branch, I or a teammate will keep train.py and metrics.csv "in-sync". A few scenarios:

  • Someone manually updates metrics.csv on their branch
  • Someone changes train.py but forgets to re-run it, so metrics.csv is never re-generated
  • Someone modifies train.py on a feature branch to output a reformatted file (metrics.json instead of .csv, perhaps), or to output an entirely different file (score.csv).

To avoid these kinds of errors, we need to make sure that our metrics file is tightly linked to the processes that produced it (and any other processes it depends on, like standardizing data).

So long story short- I set out to make a video about how to do something like a git diff for model metrics and then report it in a Pull Request with GitHub Actions. But I ended up telling a longer story about why and how to use ML pipelines to ensure that your model metrics are reproducibly regenerated on every branch of your project. It got bigger than I expected but I hope you'll find the tutorial worth it!

Top comments (0)