jaeyow for AWS Community Builders

Posted on Aug 14, 2022 • Originally published at fullstackdeveloper.tips

Going to Production with Github Actions, Metaflow and AWS SageMaker

#machinelearning #mlops #datascience #python

A scalable (and cost-effective) strategy to transition your Machine Learning project from prototype to production

Fully working source code is freely available here.

Three is a crowd

Github Actions, Metaflow and AWS SageMaker are awesome technologies by themselves however they are seldom used together in the same sentence, even less so in the same Machine Learning project.

Github Actions is every software developer's favorite CI/CD tool, from the makers of Github. It is used by millions of developers around the globe, and is very popular among the Open Source community and is mostly free.

Metaflow, the Data Science framework developed in Netflix, is a very attractive proposition as it allows one to easily move from prototype stage to production, without the need for a separate specialist team of platform engineers, enabling your data science team to be more productive.

And lastly, AWS SageMaker has been in the ML game for a while, and AWS is still the leading Cloud provider (for Cloud services and ML workloads) in the world.

But why use the three together?

It all boils down to the use of Metaflow and its inherent scalability. You can start using Metaflow even when you are still running your experiments on your laptop.

But when you want to move to the cloud, you will need a basic (and free/cheap) scheduler/orchestrator to kick-off your workflows - enter Github Actions. There are some limitations with using Github Actions as your workflow scheduler/orchestrator, but for simple projects, or in the case when you're leaning on SageMaker for your training and deployment workloads, it is more than sufficient.

And as to AWS SageMaker, teams that are already on the AWS ecosystem will appreciate the almost unlimited deployment possibilities in their fingertips. Typically SageMaker workflows are created and controlled through the AWS console, but to enable us to go to Production in this article, we will be using the AWS Python SDK, through Metaflow.

What does Going to Production mean?

A typical Data Science project starts with the Prototyping stage, and the tool which is the perennial favorite among data professionals is Jupyter Notebooks. And this is also my personal playground when I start looking at DS and ML problems, and it's awesome for quick iterations.

It allows me to quickly play with the data, create visualizations, transform data, train and deploy them to an extent. It is awesome, and it has become my favorite too. However, when and if the project needs to progress to production, we need a better tool and work practice. Many teams still deploy to production with their laptop and Jupyter Notebook setup, and though you can, you really shouldn't.

The main characteristic of production workflows is that they should run without a human operator: they should start, execute, and output results automatically. Note that automation doesn’t imply that they work in isolation. They can start as a result of some external event, such as new data becoming available. --Ville Tuulos, Effective Data Science Infrastructure

Because my background is in Software Engineering, I've been honed to the process of continuous integration and deployment for virtually all my software projects. That all projects should have the ability to be built and deployed at a click of a button, or as a result of a schedule or event, even multiple times a day if required.

Everything automated, everything typically stress-free unremarkable events.

This is exactly what we want to happen when we want to push ML projects to Production. We want it to be automated, and we want it to be a stress-free unremarkable process.

The Sample Project

As I'm currently studying towards the AWS ML Specialty certification, I wanted to play a bit more with the built-in algorithms that SageMaker has in its portfolio. One of the first algorithms I checked out was Linear Learner. It's a type of supervised learning algorithm for solving classification and regression problems.

I did not create the notebook from scratch by myself, however I based it from the examples in AWS SageMaker - An Introduction to Linear Learner with MNIST.

I basically ported it into Metaflow DAGs as seen in the flow image below:

I've leaned on Python for all my ML work, and Metaflow is based on Python (you can use R too), and AWS SageMaker has an extensive SageMaker Python SDK at your disposal.

Looking at the above flow, we are able to do an end to end ML workflow including model training on AWS SageMaker, leveraging on AWS ML instances for ML training workloads, to endpoint creation with your required AWS instance type.

For brevity, I have skipped the part of serving that endpoint to the public, but this can easily be done by serving the endpoint behind an API Gateway and a Lambda. Perhaps we can do this operation in a separate article. Also take note that we have deleted the endpoint, before we end the flow to avoid the dreaded AWS bill specially now as we are just checking out the service.

	@step
	def model_training(self):
	"""
	Model training
	- now training starts, first we specify the Docker image for the required algorithm, in this case linear learner
	- create an estimator with the specified parameters,
	- set the static hyperparameters, and SageMaker will automatically calculate those set as 'auto'
	- calling fit() starts the training process, upto the specified number of epochs
	- the save the model name and location for the next steps
	- take note that we have to specify an instance for training, which may be different from the endpoint instance
	"""
	import boto3
	import sagemaker
	from sagemaker import image_uris
	image = image_uris.retrieve(region=boto3.Session().region_name, framework="linear-learner")

	self.output_location = f"s3://{self.bucket}/{self.prefix}/output"
	print(f"training artifacts will be uploaded to: {self.output_location}")

	session = sagemaker.Session()

	linear = sagemaker.estimator.Estimator(
	image,
	self.role,
	instance_count=1,
	instance_type="ml.c4.xlarge",
	output_path=self.output_location,
	sagemaker_session=session,
	)
	linear.set_hyperparameters(
	epochs=10,
	feature_dim=784,
	predictor_type="binary_classifier",
	mini_batch_size=200)

	linear.fit({"train": self.s3_train_data})

	# after an Estimator fit, the model will have been persisted in the defined S3 output location base folder
	self.model_data = linear.model_data
	print(f'Estimator model data: {self.model_data}')

	self.next(self.create_sagemaker_model)

view raw sagemaker-metaflow-step.py hosted with ❤ by GitHub

In this sample project, I have also leveraged on Github Actions as our workflow scheduler/orchestrator, this is really to prove that Github Actions, even in the free tier can really be used in 'production', albeit in a very simple flow.

Although it's worth mentioning that AWS Step Functions is being pushed by Metaflow as the workflow orchestrator of choice if we really want one that that is robust and easy to use. Metaflow has an official Step Functions integration available. Yes this is also a topic that I'd like to cover in a future post.

Metaflow performs the SageMaker operations - from model training, to model creation, to creating and testing the endpoint, all through the Python SDK, where we really have the power of SageMaker in our fingertips.

Summary

In this article we've ported the Python code from the MNIST AWS example, trained our ML model in AWS SageMaker, built, and deployed the model in a SageMaker endpoint in production. This is all automated by Github Actions (in this example the workflow gets triggered by a repository push), and brought all together and made possible by the awesome Metaflow.

Thanks, I've made the source code freely available here.

Let me know if you have any concerns, or any questions, or if you want to collaborate!

Resources

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

DEV Community

Going to Production with Github Actions, Metaflow and AWS SageMaker

Three is a crowd

But why use the three together?

What does Going to Production mean?

The Sample Project

Summary

Resources

The Future of AI, LLMs, and Observability on Google Cloud

Top comments (0)

Read next

AI Models Can Now Generate High-Quality Medical Exam Questions, Study Shows

GhubScan osint tool

使用 selenium 讀取需要登入會員的網頁

#004 | Automate PDF data extraction: Build