DEV Community

jaeyow for AWS Community Builders

Posted on • Originally published at

Small to Reasonable Scale MLOps - An Approach to Effective and Scalable MLOps when you're not a Giant like Google

Quick recap of the AWS Summit ANZ

For me, one of the more memorable presentations in the recently concluded AWS Summit 2022 ANZ was the one where Carsales described their strategy in scaling their AI/ML Operations. They did not have a large Data Science team, but as they had more and more Data Science projects, they needed an effective strategy for scaling their AI operations.

It's undeniable that leadership is instrumental in any company and project success, however I was intrigued with one of their ML tool choices that helped them reach their goal. I was so curious about this choice that I just had to learn more about it, so in this article will be talking about a sound strategy of effectively scaling your AI/ML undertaking and a tool that makes this possible - Metaflow.

Why Metaflow

Metaflow was created in Netflix where they used it internally in demanding real-life data science projects and was open-sourced in 2019. And because of its tight experience with AWS, it plays really well with many AWS services, in fact these are all described in detail here.

Metaflow logo

As I researched about it and used it in a project, I came to the realization that its secret I think is in its simplicity. But don't let this simplicity fool you.

In a nutshell, Metaflow allows you to create DAGs (Directed Acyclic Graph), so we are now touching graph theory here, but in the end it is really just a fancy term for a workflow, but one that doesn't form a closed loop.

These DAGs, combined with Python, Serverless and the Open stack, is a very powerful combination. This results in the democratization of the Machine Learning function, making it easier than ever to kick off that personal ML project, or scale your company's capability in the ML Space.


The typical machine learning process starts with simple experiments, mostly done on laptops or PCs, and this can easily be done with Metaflow. The realization of being resource constrained will come pretty quickly as soon as more complex algorithms or gigantic volumes of data come into the picture.

However, using the same Metaflow Python scripts (oh yeah you can use R too), plus a sprinkling of decorators, you will have the ability to leverage almost infinite cloud compute (both GPU and CPU), gigabytes of memory, and a long list of mostly open source SAAS/PAAS tools.

Analysis Paralysis

Remember the feeling of trying to select a tool, but feel paralyzed of picking one for fear of being stuck with it forever? Metaflow will not only enable you to easily pick and integrate an ML tool to your project, it will also allow you to abandon that choice relatively easily once a better one comes along, and be rest assured that you are using a tool that was battle tested in Netflix.

This article comes with a simple example project and although the algorithms it needs don't require the resources that a more complicated model requires, it represents real world data, and was originally created when I started following Formula 1 more regularly and was looking for something that I can learn Data Science and ML with, and wouldn't mind spending countless of hours with.

However, if you want a more realistic problem, more worked-out open source examples are found here, and here, and hopefully you will believe me that you don't need to be the size of Google to be able to tackle these types of Data Science problems.

Example ML project with Metaflow

The example project I have here is a very simple workflow that although consumes real-world dataset, was really only created for learning purposes. Following image shows some of the technologies I used to get it working.

Example project with Metaflow

Most of the code is from my General Assembly capstone project - where I go through the process of consuming data I have pulled from a public API, do a bit of feature engineering, integration with another popular Machine Learning tool called Comet ML and Github Actions, then train multiple algorithms in parallel, all repeatable since Metaflow keeps track of all experiment metadata.

Simple Model Training Pipeline using Metaflow

The intention is to show how easy it is to leverage Metaflow not just for orchestrating the parallel workflow, but also for enabling the repeatability of your experiments. It finishes after the parallel training and testing of models, however in reality you could do a myriad of tasks after this, such as model selection, model deployment or even scheduling for retraining.

Code is freely available here.


The main disconnect with many new to Data Science and Machine Learning is the difficulty of shipping models to production. Many Data Science courses may teach you the basics of the whys and hows of using algorithms and model building, however, throw you to the wolves with deployment and scalability. With Metaflow, it is easy to create ML pipelines for development when you're working on your laptop, and when it is time to push it to production, there will be minimal work involved in moving that workload on the AWS cloud.

The example project shows that it was easy to create a workflow that performs feature engineering, model training and testing in parallel, and easily integrate 3rd party tools. This is only the tip of the iceberg, it really does enable one to have the ability to do ML using the open stack, 3rd party SAAS offerings, many free tools, and do ML at par with the big boys.


Top comments (0)