nicolaspinocea for AWS Community Builders

Posted on Dec 8, 2022 • Edited on Jul 12, 2023

ML-Ops: An automated routine of training and deployment of a model using AWS ML-OPs Orchestrator and Step Functions

#machinelearning #mlops #aws #deploy

Introduction

One of the greatest challenges, and one that is of vital importance when it comes to adding value in an organization that develops AI solutions, is to distribute them among its different stakeholders and to turn them into a support tool for decision-making. However, to achieve this collaboration and visibility of the data scientist team it is necessary to evolve towards operational solutions, that is, to scale a model, outside of a jupyter notebook, that can be consumed by different teams within the company, in addition to optimizing the different stages of the machine learning lifecycle. Next, we will show how to create and deploy, in an automatic way, a machine learning model, specifcally the creation of a Step Function State Machine, in charge of orchestrating a training with hyperparameter job in SageMaker, to later query the metrics of its result, obtaining the name and path of the best performing model, where latter the artifact will be registered in SageMaker and an Endpoint will be deployed ready to be queried from external applications since this last step is executed through the MLOps Orchestrator, a lambda attached to API Gateway will be provisioned with a REST API to which queries can be made from outside AWS, using the corresponding IAM credentials.

Resource creation.

Since the resources attached to the Endpoint, i.e. the lambda together with the API gateway and its respective API will be provisioned through the orchestrator, it is necessary to first upload your Cloudformation template to the development account (Link for Cloud Formation is left in the reference section).

Lambda

The main objective is to automate the deployment of this pipeline, therefore the first generated lambda will be in charge of generating a random name for the Sagemaker Hyperparameter Tuning Job.

import json
import uuid

def lambda_handler(event, context):
  uuid_tmp = str(uuid.uuid4())
  random_uuid = uuid_tmp[:6]
  nick=f'training-hyperparameter-{random_uuid}'

  return nick

Extract the path of the best model

Then we proceed to create a new lambda function in charge of extracting the best-performing artifact in the hyperparameter job, which will be displayed.

The associated metric to define which is the 'best' is indicated at the time of creating the tuning job, so we will only handle the name of the one that obtained the best performance.

def lambda_handler(event, context):

  bestTrainingJobName=event['BestTrainingJob']['TrainingJobName']

  return '"models/'+bestTrainingJobName+'/output/model.tar.gz"

The following Lambda has the function of receiving the previous result, i.e. the best training model to indicate it to the orchestrator, which will execute a commit that will deploy the Endpoint with its respective API.

import json
import uuid
import os
import boto3

def lambda_handler(event, context):
  client = boto3.client('codecommit')
  uuid_tmp = str(uuid.uuid4())
  random_uuid = uuid_tmp[:6]


  json_tmp = f"""
  "pipeline_type": "byom_realtime_builtin",
  "model_framework": "xgboost",
  "model_framework_version": "1",
  "model_name": "best-model-{random_uuid}",
  "model_artifact_location": "{event}",       
  "data_capture_location": "stepfunctionssample-sagemak-bucketformodelanddata-1vkv7vhuej3kt/capture",
  "inference_instance": "ml.m5.large",
  "endpoint_name": "endpoint-best-{random_uuid}"
  """

  json='{'+json_tmp+'}'
  response = client.get_branch(
  repositoryName=str(os.environ.get('REPOSITORY_NAME')),
  branchName='main',
  )
  last_commit_id=response['branch']['commitId']

  response = client.create_commit(
      repositoryName=str(os.environ.get('REPOSITORY_NAME')),
      branchName='main',
      ...

Orchestration using Step Functions

Once we have developed the fundamental elements for the creation of this sequence of procedures, we proceed to orchestrate each stage previously designed using the Step Functions service. Initially, we will create our State Machine that offers us to diagram a process logic through a graphical interface, in this way we concatenate each developed component.

Execute State Machine

Finally, our designed logic is executed, launching a hyperparameter job, and selecting the best model, which is deployed through an API Rest, using the AWS MLOps Workload Orchestrator deployment pipeline.

Conclusion

In this publication, an alternative to generating a pipeline with machine learning models was presented, involving training and deployment in the AWS cloud, combining two alternatives for process orchestration. Given the benefits offered by a lambda function, it is possible to continue incorporating improvements to our modeling logic, such as launching processes through the arrival of files in S3, re-training a solution, or even adding monitoring jobs for the analysis of the behavior of a solution in production.

Acknowledgement

I would like to highlight the collaboration of my friend and teammate @matiasgonzalezes, who was fundamental in the different developments shown in this document.

DEV Community

ML-Ops: An automated routine of training and deployment of a model using AWS ML-OPs Orchestrator and Step Functions

Introduction

Resource creation.

Lambda

Extract the path of the best model

Orchestration using Step Functions

Execute State Machine

Conclusion

Acknowledgement

Reference

Top comments (0)

Read next

Amazon Q Developer Tips: No.20 Amazon Q Developer Agents - /review

Part 12: Building Your Own AI - Model Evaluation and Tuning for Optimal Performance

Terraform vs AWS CDK: ¿Qué herramienta de infraestructura como código es mejor para tu proyecto?

Scaling to Zero with Amazon Aurora Serverless v2