DEV Community

Hikaru
Hikaru

Posted on • Edited on

Let's build a simple MLOps workflow on AWS! #3 - Running ML training as a container

About this post

This post is a sequel to the previous one below. Please refer to the earlier post before reading this one.

Let's build a simple MLOps workflow on AWS! #2 - Building infrastructure on AWS - DEV Community
https://dev.to/hikarunakatani/lets-build-a-simple-mlops-workflow-on-aws-2-building-infrastructure-on-aws-3h2j

Overview

We've prepared the ML model in the first post and set up the infrastructure in the second post. Now, we've configured the necessary settings to run the training process as a container image. Let's work on the necessary settings and see how it works.

Access to S3 bukcet

As we run this traing task on the ECS cluster in private subnet, we can't use internet connection to download training data nor uploading trained model. Instead, we preplace them on an S3 bucket and access them via VPC Endpoint using the AWS SDK. To PUT/GET obeject from S3 bucket, you can implement the code like below:

import zipfile
import boto3
import botocore
from botocore.exceptions import ClientError


def download_data():
    """Download training data from Amazon S3 bucket
    Used when run as a ECS task
    """

    s3 = boto3.client("s3")
    bucket_name = "cifar10-mlops-bucket"
    file_key = "data.zip"
    local_file_path = "data.zip"
    extract_to = "./"

    botocore.session.Session().set_debug_logger()

    # Download the file from S3
    try:
        s3.download_file(bucket_name, file_key, local_file_path)
        print("File downloaded successfully.")

        # Extract the contents of the zip file
        with zipfile.ZipFile(local_file_path, "r") as zip_ref:
            zip_ref.extractall(extract_to)
            print(f"Zip file extracted successfully to '{extract_to}'.")

    except ClientError as e:
        print(f"An error occurred while downloading training data {e}")


def upload_model():
    """Upload pre-trained model to S3 bucket"""

    s3_client = boto3.client("s3")
    file_path = "model.pth"
    bucket_name = "cifar10-mlops-bucket"
    object_key = "model.pth"

    try:
        s3_client.upload_file(file_path, bucket_name, object_key)
        print(f"Uploaded {file_path} to {bucket_name}/{object_key}")
    except ClientError as e:
        print(f"An error occurred while uploading {e}")
Enter fullscreen mode Exit fullscreen mode

When accessing an S3 bucket from a private subnet, ensure that the access policy of the VPC endpoint and the S3 bucket itself allows the necessary permissions. Additionally, the target ECS task must be explicitly permitted to access the bucket through its task role.

Making a Docker image

# Use Python baseimage
FROM python:3.8-slim-buster

# Set the working directory in the container to /app
WORKDIR /app

# Add the current directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
RUN pip install --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Run main.py when the container launches
ENTRYPOINT ["python", "main.py"]
CMD ["--env", "ecs"]
Enter fullscreen mode Exit fullscreen mode

This is a basic example of a Dockerfile to run a training task as a Docker image. An important point here is specifying the environment with
CMD ["--env", "ecs"] in the command argument. This is necessary because when downloading training data in a local environment, the data needs to be downloaded through the internet as shown below. By adding this argument, you can change the behavior of the program depending on the environment:

download_flag = False

# Dataset directory
data_dir = "./data"

# Download training data if it doesn't exist
if not os.path.exists(os.path.join(data_dir, "cifar-10-batches-py")):
    if env == "local":
        download_flag = True
    elif env == "ecs":
        aws_action.download_data()

# Preprocess data
# Transform PIL Image to tensor
# Normalize a tensor image in each channel with mean and standard deviation
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)

# Downaload training data of CIFAR-10
trainset = torchvision.datasets.CIFAR10(
    root="./data", train=True, download=download_flag, transform=transform
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=4, shuffle=True, num_workers=os.cpu_count()
)
Enter fullscreen mode Exit fullscreen mode

Setting up GitHub Actions

Now that we have a Dockerfile, we can build a Docker image locally and manually push it to the ECR repository. However, our initial goal is to fully automate the entire training process, so we want to avoid doing this manually. Instead, let's manage our training program on GitHub and push the image using GitHub Actions.

Here's a sample GitHub Actions workflow:

name: Push Docker image to Amazon ECR (manual)

on:
  workflow_dispatch:

env:
    AWS_REGION: ap-northeast-1

jobs:
  push:
    runs-on: ubuntu-latest

    permissions:
        id-token: write
        contents: read
        pull-requests: write

    steps:
    - name: Checkout
      uses: actions/checkout@v2

    - name: Get OIDC token
      uses: aws-actions/configure-aws-credentials@v1 # Use OIDC token
      with:
        role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
        aws-region: ${{ env.AWS_REGION }}

    - name: Login to Amazon ECR
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v1

    - name: Build, tag, and push image to Amazon ECR
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        ECR_REPOSITORY: cifar10-mlops-repository
        # IMAGE_TAG: ${{ github.sha }}
        IMAGE_TAG: latest
      run: |
        docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
Enter fullscreen mode Exit fullscreen mode

In this example, we execute the docker build command to build a container image and the docker push command to push the image to an ECR repository within the GitHub Actions workflow. For a simplified deployment process, we use the latest tag in the Dockerfile. However, using the latest tag in a production environment is not recommended because it does not provide a way to track the specific version of the image being deployed. Instead of using the latest tag, it's better to use a unique tag based on the Git commit SHA, such as github.sha, as shown in the official GitHub example.

When changing the tag name of the Docker image, you also need to update the image specified in the ECS task definition. Therefore, in the example below, the task definition is dynamically rendered and deployed with the appropriate image tag.

- name: Build, tag, and push image to Amazon ECR
id: build-image
env:
    ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
    IMAGE_TAG: ${{ github.sha }}
run: |
    # Build a docker container and
    # push it to ECR so that it can
    # be deployed to ECS.
    docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
    docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
    echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT

- name: Fill in the new image ID in the Amazon ECS task definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@c804dfbdd57f713b6c079302a4c01db7017a36fc
with:
    task-definition: ${{ env.ECS_TASK_DEFINITION }}
    container-name: ${{ env.CONTAINER_NAME }}
    image: ${{ steps.build-image.outputs.image }}

- name: Deploy Amazon ECS task definition
uses: aws-actions/amazon-ecs-deploy-task-definition@df9643053eda01f169e64a0e60233aacca83799a
with:
    task-definition: ${{ steps.task-def.outputs.task-definition }}
    service: ${{ env.ECS_SERVICE }}
    cluster: ${{ env.ECS_CLUSTER }}
    wait-for-service-stability: true
Enter fullscreen mode Exit fullscreen mode

Deploying to Amazon Elastic Container Service - GitHub Docs
https://docs.github.com/en/actions/deployment/deploying-to-your-cloud-provider/deploying-to-amazon-elastic-container-service

Testing the whole process

Since we have all the components for the entire CI/CD process, we can now test the complete training process. When you push changes to the model repository, GitHub Actions will automatically build the image and push it to the ECR repository. If it succeeds, you'll see a message like the one below in the Actions tab.

Image description

After that, the Lambda function will be triggered by an EventBridge rule. Let's check if the Lambda function is running properly.

Image description

In the response of the lambda function, you'll see an in information like below:

{
  "tasks": [
    {
      "attachments": [
        {
          "id": "********-****-****-****-************",
          "type": "ElasticNetworkInterface",
          "status": "PRECREATED",
          "details": [
            {
              "name": "subnetId",
              "value": "subnet-************"
            }
          ]
        }
      ],
      "attributes": [
        {
          "name": "ecs.cpu-architecture",
          "value": "x86_64"
        }
      ],
      "availabilityZone": "ap-northeast-1a",
      "clusterArn": "arn:aws:ecs:ap-northeast-1:************:cluster/*************-cluster",
      "containers": [
        {
          "containerArn": "arn:aws:ecs:ap-northeast-1:************:container/*************-cluster/*************/************",
          "taskArn": "arn:aws:ecs:ap-northeast-1:************:task/*************-cluster/*************",
          "name": "*************-container",
          "image": "************.dkr.ecr.ap-northeast-1.amazonaws.com/*************-repository:latest",
          "lastStatus": "PENDING",
          "networkInterfaces": [],
          "cpu": "2048",
          "memory": "4098"
        }
      ],
      "cpu": "2048",
      "createdAt": "2024-05-26T03:40:35.437000+00:00",
      "desiredStatus": "RUNNING",
      "enableExecuteCommand": false,
      "group": "family:*************-task",
      "lastStatus": "PROVISIONING",
      "launchType": "FARGATE",
      "memory": "8192",
      "overrides": {
        "containerOverrides": [
          {
            "name": "*************-container"
          }
        ],
        "inferenceAcceleratorOverrides": []
      },
      "platformVersion": "1.4.0",
      "platformFamily": "Linux",
      "tags": [],
      "taskArn": "arn:aws:ecs:ap-northeast-1:************:task/*************-cluster/*************",
      "taskDefinitionArn": "arn:aws:ecs:ap-northeast-1:************:task-definition/*************-task:15",
      "version": 1,
      "ephemeralStorage": {
        "sizeInGiB": 20
      }
    }
  ],
  "failures": [],
  "ResponseMetadata": {
    "RequestId": "********-****-****-****-************",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "********-****-****-****-************",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "1556",
      "date": "Sun, 26 May 2024 03:40:34 GMT"
    },
    "RetryAttempts": 0
  }
}
Enter fullscreen mode Exit fullscreen mode

The response is quite long, but essentially, what you need to check is that the status code is 200 and that the "failures" array is empty. By verifying this information, you can confirm that the Lambda function is properly triggered by the EventBridge rule.
After that, let's also check whether the ECS task is properly invoked. If it's properly invoked, you'll see the new task created on the ECS console screen

Image description

If the training is successfully getting started, you can view the current status of the training process in the logs.

Image description

After finishing the training, if a pre-trained model is successfully uploaded to an S3 bucket, you can confirm that the training process has completed properly.

Image description

Well done! We've completed the entire workflow! 👏
There are indeed many improvements and tasks to implement if you want to work on a production-level MLOps setup, but it's perfectly fine to start from what you're familiar with now. Building upon your current knowledge and gradually expanding your skills is a solid approach to mastering MLOps.

Top comments (0)