Did you know you can use Git to keep track of your ML models? Yes, you can use Git to snapshot your project at many stages of development! Then with GitHub Actions, you can take your work to the next level by automating repetitive processes like model training and reporting. I'm creating a video series to help people take advantage of these software tools for data science and ML.
One of the big ideas around Git is to use branches to develop new features. In data science, this can look like using new branches to try out new modeling approaches or ways of processing data. So today I've released a new video tutorial about a frequent question:
How do I compare ML models on different Git branches?
The answer (and the video!) goes a little more in-depth than you might expect. There's an easy approach, and then there's a good approach.
Easy answer. If your model training and evaluation scripts creates a metric file- say,
metrics.csv- then you could use
$ git diff metrics.csv
Good answer. So you can do a
git diff of your metrics file, but aside from being a little hard to read, there's another issue:
metrics.csv is modified by different processes on different branches?
For example, on the
main branch of a project, I might run a script
train.py that creates
metrics.csv. But there's no guarantee that on a feature branch, I or a teammate will keep
metrics.csv "in-sync". A few scenarios:
- Someone manually updates
metrics.csvon their branch
- Someone changes
train.pybut forgets to re-run it, so
metrics.csvis never re-generated
- Someone modifies
train.pyon a feature branch to output a reformatted file (
.csv, perhaps), or to output an entirely different file (
To avoid these kinds of errors, we need to make sure that our metrics file is tightly linked to the processes that produced it (and any other processes it depends on, like standardizing data).
So long story short- I set out to make a video about how to do something like a
git diff for model metrics and then report it in a Pull Request with GitHub Actions. But I ended up telling a longer story about why and how to use ML pipelines to ensure that your model metrics are reproducibly regenerated on every branch of your project. It got bigger than I expected but I hope you'll find the tutorial worth it!