About this post
This post is a sequel to the previous one below. Please refer to the earlier post before reading this one.
Let's build a simple MLOps workflow on AWS! #1 - ML model preperation - DEV Community
https://dev.to/hikarunakatani/lets-build-a-simple-mlops-workflow-on-aws-1-ml-model-preperation-3af8
Overview
In the previous post, I showed how to implement a simple deep learning model. However, that code was intended for a local laptop environment and was purely experimental. By containerizing the application, you can ensure consistent and reproducible execution across different environments. This approach also enables the use of container orchestration tools like Kubernetes, which simplify managing, scaling, and orchestrating ML training jobs. Running machine learning tasks on a container orchestration tool is especially beneficial for training large ML models, as it allows for distributed training across multiple nodes and efficient resource utilization.
In this post, I'll explain how to run the training code as a Docker container on Amazon ECS. Additionally, I'll demonstrate how to automatically build and deploy the container when changes are made to the model.
Without further ado, let's first look at the overall architecture needed to implement this workflow!
Architecture of the system
In this system, the following workflow will be executed:
- A developer pushes an ML model to the GitHub repository
- The training task, including the model, is automatically built as a Docker image and pushed to the ECR repository
- EventBridge detects the push in the ECR repository and invokes a Lambda function
- Lambda function invokes the ECS task
- The pre-trained ML model gets automatically saved to an S3 bucket
To achieve this, we'll tackle the following tasks step-by-step:
- Preparing AWS resources to automate the deployment process of the training task.
- Building a CI/CD pipeline for the ML model to automatically push Docker images to the repository.
- Testing that the automated deployment process works properly.
In this post, I'll only explain how to implement the first step. Regarding building AWS resources, I chose Terraform so that we can test the code experimentally.
Preparing AWS resources by Terraform
There are a number of small resources to implement the whole system, but I'll focus on introducing the core service setting required to implement the workflow.
EventBridge
In order to trigger the ECS task in an event-driven manner, you have to prepare the event pattern in EventBridge. I used an event pattern to detect the push event in the ECR repository. After that, you need to set the Lambda as a target of the event rule.
# EventBridge
resource "aws_cloudwatch_event_rule" "ecr_push_rule" {
name = "${var.project_name}-run-ecs-task"
description = "Trigger an ECS task when an image is pushed to ECR"
event_pattern = jsonencode({
"source" : ["aws.ecr"],
"detail-type" : ["ECR Image Action"],
"detail" : {
"repository-name" : [aws_ecr_repository.main.name],
"action-type" : ["PUSH"],
},
})
}
resource "aws_cloudwatch_event_target" "ecr_push_target" {
rule = aws_cloudwatch_event_rule.ecr_push_rule.name
target_id = "run-index-py-function"
arn = aws_lambda_function.invoke_task.arn
}
Lambda
We use Lambda function to invoke training task in ECS. The content of the Lambda function is like below:
import json
import logging
import os
import sys
import boto3
# Setting up logging
logger = logging.getLogger()
for h in logger.handlers:
logger.removeHandler(h)
h = logging.StreamHandler(sys.stdout)
FORMAT = "%(levelname)s [%(funcName)s] %(message)s"
h.setFormatter(logging.Formatter(FORMAT))
logger.addHandler(h)
logger.setLevel(logging.INFO)
ecs = boto3.client("ecs")
def run_ecs_task(cluster, task_definition, subnets, security_groups):
"""
Function to run an ECS task.
Parameters:
cluster (str): The name of the ECS cluster.
task_definition (str): The ARN of the task definition.
subnets (str): The subnets for the task.
security_groups (str): The security groups for the task.
Returns:
None
"""
try:
response = ecs.run_task(
cluster=cluster,
taskDefinition=task_definition,
launchType="FARGATE",
count=1,
networkConfiguration={
"awsvpcConfiguration": {
"subnets": subnets.split(","),
"securityGroups": security_groups.split(","),
"assignPublicIp": "ENABLED",
}
},
)
logger.info(f"Response: {response}")
failures = response.get("failures", [])
if failures:
logger.error(f"Task failures: {failures}")
except Exception as e:
logger.error(f"Error running ECS task: {e}")
def lambda_handler(event, context):
"""
AWS Lambda function handler.
Parameters:
event (dict): The event data passed by AWS Lambda service.
context (LambdaContext): The context data passed by AWS Lambda service.
Returns:
None
"""
try:
# Get configuration from environmental variables
ECS_CLUSTER = os.environ["ECS_CLUSTER"]
TASK_DEFINITION_ARN = os.environ["TASK_DEFINITION_ARN"]
AWSVPC_CONF_SUBNETS = os.environ["AWSVPC_CONF_SUBNETS"]
AWSVPC_CONF_SECURITY_GROUPS = os.environ["AWSVPC_CONF_SECURITY_GROUPS"]
logger.info(f"ECS_CLUSTER: {ECS_CLUSTER}")
logger.info(f"TASK_DEFINITION_ARN: {TASK_DEFINITION_ARN}")
run_ecs_task(
ECS_CLUSTER,
TASK_DEFINITION_ARN,
AWSVPC_CONF_SUBNETS,
AWSVPC_CONF_SECURITY_GROUPS,
)
except Exception as e:
logger.error(f"An error occured while running ECS task: {e}")
Basically, it sends an API call to an ECS cluster to start the task using the AWS SDK (boto3). Please note that you need to specify some settings, such as the ECS cluster name, task definition ARN, VPC subnet, and security groups, to invoke the task. These settings are acquired through the environment variables embedded in the Lambda runtime.
To build this handler, we need to prepare the Lambda function in Terraform as shown below:
# Lambda function
resource "aws_lambda_function" "invoke_task" {
# If the file is not in the current working directory you will need to include a
# path.module in the filename.
filename = "lambda_function.zip"
function_name = "${var.project_name}-invoke-task"
role = aws_iam_role.lambda_execution_role.arn
handler = "invoke_task.lambda_handler"
source_code_hash = data.archive_file.lambda.output_base64sha256
runtime = "python3.9"
environment {
variables = {
ECS_CLUSTER = aws_ecs_cluster.main.name
TASK_DEFINITION_ARN = aws_ecs_task_definition.main.arn
AWSVPC_CONF_SUBNETS = "${aws_subnet.private1a.id}"
AWSVPC_CONF_SECURITY_GROUPS = "${aws_security_group.ecs.id}"
}
}
}
An important point here is properly setting environment variables so that the Lambda function gets the necessary information to run the training task. Also, keep in mind to avoid hardcoding environment variables for better security and operational efficiency. For a more secure solution, I highly recommend using AWS Secrets Manager or AWS Systems Manager Parameter Store instead of using environment variables.
ECS cluster
# Task Definition
resource "aws_ecs_task_definition" "main" {
family = "${var.project_name}-task"
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = "2048" # 2 vCPU
memory = "8192" # 8GB RAM
task_role_arn = aws_iam_role.ecs_task_role.arn
execution_role_arn = aws_iam_role.ecs_task_exec.arn
container_definitions = jsonencode([
{
name = "${var.project_name}-container"
image = "${aws_ecr_repository.main.repository_url}:latest"
cpu = 2048
memory = 4098
essential = true
portMappings = [
{
"containerPort" : 80,
"hostPort" : 80
}
],
logConfiguration = {
options = {
"awslogs-create-group": "true",
"awslogs-region" = "ap-northeast-1"
"awslogs-group" = "${var.project_name}-log-group"
"awslogs-stream-prefix" = "ecs"
}
logDriver = "awslogs"
}
}
])
}
Training ML model usually requires GPU, but I chose CPU because it doesn't demand as many computing resources. Also, GPU is only suported for ECS on EC2, which requires more complex settings.
There's a bunch of resources you need to define, but I won't cover all of them here to keep this post simple. If you're interested in the complete resource settings, please refer to the repository below:
hikarunakatani/cifar10-aws: Simple MLOps workflows
https://github.com/hikarunakatani/cifar10-aws
CI/CD of Infrastructue using GitHub Actions
As we defined Infrastructue using Terraform, we can apply CI/CD practice to infrastructure. We use GitHub Actions to build CI/CD pipeline. The definition of the workflows is as follows:
# Execute terraform apply when changes are merged to main branch
name: "Terraform Apply"
on:
push:
branches: main
env:
TF_VERSION: 1.6.5
AWS_REGION: ap-northeast-1
jobs:
terraform:
name: terraform
runs-on: ubuntu-latest
permissions:
id-token: write
contents: write
pull-requests: write
issues: write
statuses: write
steps:
- name: Checkout
uses: actions/checkout@v3
- uses: aws-actions/configure-aws-credentials@v1 # Use OIDC token
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- name: Terraform setup
uses: hashicorp/setup-terraform@v1
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Setup tfcmt
env:
TFCMT_VERSION: v3.4.1
run: |
wget "https://github.com/suzuki-shunsuke/tfcmt/releases/download/${TFCMT_VERSION}/tfcmt_linux_amd64.tar.gz" -O /tmp/tfcmt.tar.gz
tar xzf /tmp/tfcmt.tar.gz -C /tmp
mv /tmp/tfcmt /usr/local/bin
tfcmt --version
- name: Terraform init
run: terraform init
- name: Terraform fmt
run: terraform fmt
- name: Terraform apply
id: apply
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# Make apply results comment on commit
run: tfcmt apply -- terraform apply -auto-approve -no-color -input=false
This is an example of a workflow for the apply process. I set the trigger for this workflow to activate on pull requests, so the terraform apply command runs when a pull request is opened in the repository.
When you want to manipulate AWS resources from GitHub, obviously you need to set AWS credential information. However, directly putting secret information in your repository poses security risks. Instead, you can use an OIDC token to get temporary AWS credential information. This way, you only need to put the ARN of the IAM role in your GitHub account, which is much safer.
Once the workflows have executed properly, you can view the results in the "Actions" tab on GitHub, like this:
If you see the output saying "Apply complete!", you can confirm that your infrastructure has been successfully deployed to the AWS environment.
In the next post, I'll explain how to integrate the training code we created in the first post into the system.
Top comments (0)