Tendong Brain Nkengafac

Posted on May 9

How to Deploy a Machine Learning Project on AWS Using ECR, ECS Fargate, and EFS.

#ai #aws #python #machinelearning

A step-by-step walkthrough from Docker image to a live, serverless ML application running in the cloud

Introduction

Deploying a machine learning project is often where things get humbling. You've trained a model, built a pipeline, maybe even wired up a slick dashboard, and then you stare at your terminal wondering how to get any of it onto a real server that other people can access. I've been there.

In this article, I'll walk you through exactly how I deployed a real-time machine learning application on AWS, from pushing a Docker image to ECR, to running two live containers on ECS Fargate, to watching my dashboard update in real time. Every command here actually worked. I'll explain what each step does, why it matters, and what you should expect to see as output.

Whether you're deploying your first ML project or looking for a reference you can actually trust, this guide is for you.

The Project: Real-Time Bike Rental Forecasting

Before we get into the AWS infrastructure, let me briefly describe what we're deploying.

I built a Real-Time Bike Rental Forecasting Application a live machine learning system that predicts hourly bike rental demand and visualizes the results in real time. The dashboard refreshes every second, showing a live comparison of predicted vs. actual bike rental counts over a rolling time window you control.

The application is built around three core services running simultaneously:

Model Training Pipeline : trains a CatBoost forecasting model on historical bike rental data
Inference Service : runs every second, simulating 1 hour of dataset time per tick, and produces predictions
Real-Time Dashboard : a Dash web app that fetches and plots the latest predicted vs. actual counts

The interface gives operators a simple control panel (you can adjust the display window from as little as 1 hour to 24 hours of history) and a live chart that updates continuously. When the model is tracking well, you see the green predicted line closely shadowing the pink actual line. It's a satisfying thing to watch.

Tech stack:

Kedro: ML Pipeline Orchestration
Forecasting ModelCatBoost: Forecasting Model
Dash (Plotly): Dashboard
Docker: Containerization
Amazon ECR: Image Registry
Amazon ECS with Fargate: Container Orchestration
Amazon EFS: Shared Storage

The UI and inference service run as separate Docker containers. They share data through a mounted volume, the inference container writes predictions, and the dashboard reads them. On AWS, that shared volume is handled by Amazon EFS.

Now let's deploy it.

Prerequisites

Before starting, make sure you have the following:

AWS CLI installed and configured (aws configure with your credentials)
Docker installed and running
PowerShell (these commands use PowerShell syntax. Adapt to bash if on Linux/Mac)
An AWS account with sufficient permissions (IAM, ECR, ECS, EC2, EFS)

Step 1 — Define Your Session Variables

Start every deployment session by setting these variables. They're referenced throughout every subsequent command, so getting them right upfront saves you a lot of pain.

$REGION = "us-east-1"
$ACCOUNT_ID = aws sts get-caller-identity --query Account --output text
$SERVICE_NAME = "bike-rental-forecasting"

What this does: $REGION sets the AWS region for all operations. $ACCOUNT_ID dynamically fetches your 12-digit AWS account number using the STS (Security Token Service) API, no hardcoding needed. $SERVICE_NAME is the name we'll use consistently for ECR repositories, images, and services.

Expected output for $ACCOUNT_ID:

048908710060

A 12-digit number. If this fails, your AWS CLI isn't configured. Run aws configure first.

Step 2 — Push Your Docker Image to Amazon ECR

Amazon ECR (Elastic Container Registry) is AWS's managed Docker image registry. Think of it like Docker Hub, but private and tightly integrated with the rest of AWS. ECS will pull your image from here when launching containers.

2.1 — Authenticate Docker with ECR

aws ecr get-login-password --region $REGION |
docker login --username AWS --password-stdin "$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com"

This pipes a temporary ECR authentication token directly into the Docker login command. The token is valid for 12 hours.

Expected output:

Login Succeeded

If you see an authentication error, verify your IAM user has the AmazonEC2ContainerRegistryFullAccess policy attached.

2.2 — Create the ECR Repository (First Time Only)

If you haven't already created a repository for this project:

aws ecr create-repository --repository-name $SERVICE_NAME --region $REGION

Expected output: A JSON block describing your new repository, including the repositoryUri which looks like:

048908710060.dkr.ecr.us-east-1.amazonaws.com/bike-rental-forecasting

2.3 — Build Your Docker Image

docker build -f Dockerfile.aws -t $SERVICE_NAME .

We use a dedicated Dockerfile.aws here. It's good practice to have a separate Dockerfile for cloud deployments that strips out local dev dependencies and optimises for image size.

Expected output: A series of build steps ending in:

Successfully built <image-id>
Successfully tagged bike-rental-forecasting:latest

2.4 — Tag and Push the Image

docker tag "${SERVICE_NAME}:latest" `
  "$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/${SERVICE_NAME}:latest"

docker push "$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/${SERVICE_NAME}:latest"

The tag command gives your local image the full ECR address. The push uploads it layer by layer.

Expected output:

The push refers to repository [048908710060.dkr.ecr.us-east-1.amazonaws.com/bike-rental-forecasting]
latest: digest: sha256:abc123... size: 1234

Tip: If the push fails mid-way with a TLS timeout, don't panic — just re-run the push command. Docker is smart enough to detect already-uploaded layers and resume from where it left off. Only the remaining layers will be transferred.

Step 3 — Create the IAM Task Execution Role

ECS containers don't have your credentials. They need an IAM role to do things like pull images from ECR and write logs to CloudWatch. This role is called the task execution role, and it must exist before you can register any task definition.

3.1 — Write the Trust Policy and Create the Role

$trustPolicy = @"
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "ecs-tasks.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}
"@

[System.IO.File]::WriteAllText(
  "$PWD\ecs-trust-policy.json", $trustPolicy,
  [System.Text.UTF8Encoding]::new($false))

aws iam create-role --role-name ecsTaskExecutionRole `
  --assume-role-policy-document file://ecs-trust-policy.json

The trust policy tells AWS: "ECS tasks are allowed to assume this role." We write it to a file first (without a BOM, hence the UTF8Encoding trick) because the AWS CLI reads it from disk.

Expected output: A JSON block describing the newly created role, including its ARN:

{
  "Role": {
    "RoleName": "ecsTaskExecutionRole",
    "Arn": "arn:aws:iam::048908710060:role/ecsTaskExecutionRole",
    ...
  }
}

3.2 — Attach the Required Policies

aws iam attach-role-policy --role-name ecsTaskExecutionRole `
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

aws iam attach-role-policy --role-name ecsTaskExecutionRole `
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

AmazonECSTaskExecutionRolePolicy covers CloudWatch Logs and Secrets Manager access. AmazonEC2ContainerRegistryReadOnly lets the task pull images from ECR. Both are AWS-managed policies, so they're always up to date.

Expected output: No output means success. AWS CLI returns nothing for successful attach-role-policy calls.

Note: If the role already exists from a previous deployment, you'll get an error on create-role. That's fine — skip creation and jump straight to the attach commands.

Step 4 — Create the ECS Cluster

The cluster is the logical grouping that holds all your services and tasks. With Fargate, there are no EC2 instances to manage, the cluster is purely an organizational boundary.

aws ecs create-cluster --cluster-name bike-rental-cluster --region $REGION

Expected output:

{
  "cluster": {
    "clusterName": "bike-rental-cluster",
    "status": "ACTIVE",
    "registeredContainerInstancesCount": 0,
    "runningTasksCount": 0,
    ...
  }
}

registeredContainerInstancesCount: 0 is expected with Fargate, there are no managed EC2 instances. Fargate provisions compute on-demand when tasks launch.

Step 5 — Configure Networking (VPC, Subnet, Security Group)

ECS Fargate tasks run inside your VPC. We need to identify the right subnet for them to launch in, and create a security group that controls what traffic can reach them.

5.1 — Get the Default VPC and Subnet

$VPC_ID = aws ec2 describe-vpcs `
  --filters "Name=isDefault,Values=true" `
  --query "Vpcs[0].VpcId" --output text --region $REGION

$SUBNET_ID = aws ec2 describe-subnets `
  --filters "Name=vpc-id,Values=$VPC_ID" `
  --query "Subnets[0].SubnetId" --output text --region $REGION

Expected output:

vpc-0ebc1234dar56786a
subnet-0328456779xbvhet0

We use the default VPC here for simplicity. In a production setup, you'd use a custom VPC with private subnets and a load balancer in front.

5.2 — Create a Security Group and Open Required Ports

$SG_ID = aws ec2 create-security-group `
  --group-name bike-rental-sg `
  --description "Bike Rental App Security Group" `
  --vpc-id $VPC_ID --query "GroupId" --output text --region $REGION

Expected output:

sg-1abc026def476580a

Now open the two ports we need:

# Port 8050 — Dash dashboard (HTTP traffic from the internet)
aws ec2 authorize-security-group-ingress `
  --group-id $SG_ID --protocol tcp --port 8050 `
  --cidr 0.0.0.0/0 --region $REGION

# Port 2049 — NFS/EFS (shared storage between containers)
aws ec2 authorize-security-group-ingress `
  --group-id $SG_ID --protocol tcp --port 2049 `
  --cidr 0.0.0.0/0 --region $REGION

Port 8050 is where Dash serves the dashboard. Port 2049 is the NFS protocol port that EFS uses. Both rules accept traffic from anywhere (0.0.0.0/0). You can restrict this by IP range for a more secure setup.

Expected output for each: A JSON block confirming the rule was added, showing the protocol, port range, and IP range.

Step 6 — Register Task Definitions

A task definition is ECS's equivalent of a docker-compose.yml. It describes what image to run, how much CPU and memory to allocate, what command to execute, and what ports to expose. We need two: one for the dashboard and one for the inference service.

6.1 — UI / Dashboard Task Definition

$taskDef = @"
{
  "family": "bike-rental-task",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::${ACCOUNT_ID}:role/ecsTaskExecutionRole",
  "containerDefinitions": [{
    "name": "bike-rental-app",
    "image": "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${SERVICE_NAME}:latest",
    "command": ["python", "entrypoints/app_ui.py"],
    "portMappings": [{"containerPort": 8050, "protocol": "tcp"}],
    "environment": [{"name": "PORT", "value": "8050"}]
  }]
}
"@

[System.IO.File]::WriteAllText(
  "$PWD\task-def.json", $taskDef,
  [System.Text.UTF8Encoding]::new($false))

aws ecs register-task-definition `
  --cli-input-json file://task-def.json --region $REGION

A few things worth noting:

"cpu": "256" means 0.25 vCPU. "memory": "512" means 512 MB. These are the smallest Fargate allocations, enough for a lightweight dashboard.
networkMode: awsvpc gives each task its own elastic network interface (ENI) with its own IP required for Fargate.
The command overrides the Dockerfile's default CMD, so we can reuse one image for both services.

Expected output: A large JSON block containing taskDefinitionArn:

arn:aws:ecs:us-east-1:038304770010:task-definition/bike-rental-task:1

The :1 at the end is the revision number. Every time you register a new version of the same task definition, this increments.

6.2 — Inference Task Definition

$inferenceTaskDef = @"
{
  "family": "bike-rental-inference-task",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::${ACCOUNT_ID}:role/ecsTaskExecutionRole",
  "containerDefinitions": [{
    "name": "bike-rental-inference",
    "image": "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${SERVICE_NAME}:latest",
    "command": ["python", "entrypoints/inference.py"]
  }]
}
"@

[System.IO.File]::WriteAllText(
  "$PWD\inference-task-def.json", $inferenceTaskDef,
  [System.Text.UTF8Encoding]::new($false))

aws ecs register-task-definition `
  --cli-input-json file://inference-task-def.json --region $REGION

Notice this task has no portMappings.The inference container doesn't serve HTTP traffic. It just runs the pipeline and writes results to the shared volume. The only difference from the UI task is the command.

Expected output:

arn:aws:ecs:us-east-1:038304770010:task-definition/bike-rental-inference-task:1

Step 7 — Set Up EFS Shared Storage

Both containers need to share data, the inference service writes predictions, and the dashboard reads them. In Docker locally, this is a named volume. On AWS, we use Amazon EFS (Elastic File System) — a managed, serverless NFS file system that multiple containers can mount simultaneously.

$EFS_ID = aws efs create-file-system `
  --performance-mode generalPurpose `
  --query "FileSystemId" --output text --region $REGION

Expected output:

fs-0abc1234def56789a

Then create a mount target. This is the network endpoint that makes EFS accessible from within your subnet:

aws efs create-mount-target `
  --file-system-id $EFS_ID `
  --subnet-id $SUBNET_ID `
  --security-groups $SG_ID --region $REGION

Expected output: A JSON block describing the mount target, including its IP address within the subnet. Wait a minute or two for the state to transition from creating to available before proceeding.

Why EFS? Unlike S3, EFS behaves like a traditional file system containers can open, write, and read files using normal file I/O. No special SDK needed. For ML pipelines that exchange files between services, this is the simplest approach on AWS.

To use EFS in your task definitions, add a volumes block and mountPoints to each container definition. Here's the pattern to add to your task JSON:

"volumes": [{
  "name": "shared-data",
  "efsVolumeConfiguration": {
    "fileSystemId": "<YOUR_EFS_ID>",
    "rootDirectory": "/"
  }
}],
"containerDefinitions": [{
  ...
  "mountPoints": [{
    "sourceVolume": "shared-data",
    "containerPath": "/app/data",
    "readOnly": false
  }]
}]

Step 8 — Create and Deploy ECS Services

An ECS service is what keeps your task running. If a task crashes, the service restarts it automatically. It also handles scaling, load balancer integration, and rolling deployments.

8.1 — Create the UI Service

aws ecs create-service `
  --cluster bike-rental-cluster `
  --service-name bike-rental-service `
  --task-definition bike-rental-task `
  --desired-count 1 `
  --launch-type FARGATE `
  --network-configuration `
  "awsvpcConfiguration={subnets=[$SUBNET_ID],securityGroups=[$SG_ID],assignPublicIp=ENABLED}" `
  --region $REGION

--desired-count 1 tells ECS to keep exactly one task running at all times. assignPublicIp=ENABLED gives the task a public IP so we can reach the dashboard from a browser.

Expected output: A large JSON block with "status": "ACTIVE" and "runningCount": 0. The count starts at 0 and increments as the task pulls the image and starts — usually takes 30–90 seconds.

8.2 — Create the Inference Service

aws ecs create-service `
  --cluster bike-rental-cluster `
  --service-name bike-rental-inference `
  --task-definition bike-rental-inference-task `
  --desired-count 1 `
  --launch-type FARGATE `
  --network-configuration `
  "awsvpcConfiguration={subnets=[$SUBNET_ID],securityGroups=[$SG_ID],assignPublicIp=ENABLED}" `
  --region $REGION

Expected output: Same structure as above with "serviceName": "bike-rental-inference".

8.3 — Force a Re-Deploy After Updates

Whenever you push a new image or update the task definition, trigger a forced re-deployment:

aws ecs update-service `
  --cluster bike-rental-cluster `
  --service bike-rental-service `
  --task-definition bike-rental-task `
  --force-new-deployment --region $REGION

aws ecs update-service `
  --cluster bike-rental-cluster `
  --service bike-rental-inference `
  --task-definition bike-rental-inference-task `
  --force-new-deployment --region $REGION

ECS will launch new tasks with the updated image, wait for them to become healthy, then stop the old ones, zero-downtime rolling deployment by default.

Step 9 — Get the Public IP of Your Running Task

Once the services are running, you need the public IP to access the dashboard.

$TASKS = aws ecs list-tasks --cluster bike-rental-cluster `
  --query "taskArns" --output text --region $REGION

foreach ($TASK in $TASKS.Split()) {
  $ENI = aws ecs describe-tasks --cluster bike-rental-cluster `
    --tasks $TASK `
    --query "tasks[0].attachments[0].details[?name=='networkInterfaceId'].value" `
    --output text --region $REGION

  $IP = aws ec2 describe-network-interfaces `
    --network-interface-ids $ENI `
    --query "NetworkInterfaces[0].Association.PublicIp" `
    --output text --region $REGION

  $CONTAINER = aws ecs describe-tasks --cluster bike-rental-cluster `
    --tasks $TASK `
    --query "tasks[0].containers[0].name" --output text --region $REGION

  echo "Container: $CONTAINER -> http://${IP}:8050"
}

This script loops over all running tasks, finds their network interface, resolves the public IP, and prints a clickable URL.

Expected output:

Container: bike-rental-app -> http://54.123.45.68:8050
Container: bike-rental-inference -> http://54.123.45.69:8050

Open the UI container's URL in your browser and you should see the live dashboard — predictions updating every second, the chart scrolling in real time.

Step 10 — Monitoring and Debugging

Once deployed, here are the commands you'll reach for most often.

Check overall service health:

aws ecs describe-services `
  --cluster bike-rental-cluster `
  --services bike-rental-service bike-rental-inference `
  --region $REGION `
  --query "services[*].{Name:serviceName,Running:runningCount,Pending:pendingCount,Status:status}"

Expected output:

[
  { "Name": "bike-rental-service", "Running": 1, "Pending": 0, "Status": "ACTIVE" },
  { "Name": "bike-rental-inference", "Running": 1, "Pending": 0, "Status": "ACTIVE" }
]

Running: 1 for both means everything is healthy.

Check recent service events (where errors appear):

aws ecs describe-services `
  --cluster bike-rental-cluster `
  --services bike-rental-service `
  --region $REGION --query "services[0].events"

This is the first place to look when a deployment fails. ECS logs events like image pull failures, task launch errors, and health check failures here in chronological order.

List all running tasks:

aws ecs list-tasks --cluster bike-rental-cluster --region $REGION

Verify security group rules are correct:

aws ec2 describe-security-groups `
  --group-ids $SG_ID --region $REGION `
  --query "SecurityGroups[0].IpPermissions"

If your dashboard is unreachable, a missing or misconfigured security group rule is the most common culprit. Check that port 8050 is open with 0.0.0.0/0 CIDR.

Architecture Summary

Here's a bird's-eye view of what we built:

Key Takeaways

Deploying an ML application on AWS doesn't require Kubernetes or a platform team. With ECR, ECS Fargate, and EFS, you get:

Managed container orchestration : no EC2 instances to patch or maintain
Serverless scaling : Fargate provisions compute on-demand
Shared persistent storage : EFS lets multiple containers share a file system like they're on the same machine
Fast iteration : push a new image, run update-service --force-new-deployment, done

The full deployment from a working Docker image to a live public URL took me less than an hour following this sequence. The trickiest parts were the IAM role setup (easy to get wrong, hard to debug) and remembering that ECS events — not CloudWatch Logs — is where deployment errors show up first.

I hope this guide saves you some of the trial and error I went through. If you have questions or run into something unexpected, drop a comment below.

All commands were tested and verified during an actual deployment. Account IDs in examples are illustrative.

Find Attached the Link to the deployed application: https://drive.google.com/file/d/1tQp4gdg3AauQ8eNbeLZFdivRdCnvMRB9/view?usp=sharing