ARAFAT O. OLAYIWOLA

Posted on Jun 6 • Edited on Jun 11

Deploying a Production-Grade Containerized System on AWS: ECS Fargate + ALB + RDS + ElastiCache + EventBridge

#devops #aws #cloud #docker

Author: Arafat Olayiwola — 5x AWS Community Builder

Stack: Python 3.12 · FastAPI · Docker · AWS ECS Fargate · ALB · RDS · ElastiCache · EventBridge
Description: A complete, battle-tested AWS deployment walkthrough for containerized Python APIs — ECS Fargate, Application Load Balancer with ACM, RDS PostgreSQL, ElastiCache Redis, SSM secrets, and EventBridge-triggered scheduled jobs. ~$86/month, no servers to manage.

I've helped dozens of teams get their first serious AWS deployment off the ground. The same questions keep coming up: "Should I use Lambda or ECS?", "Where do I put my secrets?", "How do I run cron jobs without a server?"

This article is the end-to-end answer. It covers the exact architecture I ship for containerized API services in production — one that:

Handles live webhooks with zero cold starts
Keeps all secrets encrypted at rest and out of source control
Runs scheduled jobs on a shared Docker image without packaging nightmares
Costs around $86/month at the MVP tier and scales predictably

No half-baked tutorials. Every command runs against real AWS CLI.

Architecture at a Glance

Internet
    │
    ▼
ALB  (HTTPS :443, ACM cert, custom domain)
    │
    ▼
ECS Fargate  (your-api task, 1 vCPU / 2 GB) ◄── ECR (container image)
    │
    ├──► RDS PostgreSQL 16  t3.micro  (private subnet)
    └──► ElastiCache Redis 7  t3.micro  (private subnet)

EventBridge Scheduler (3 cron rules)
    │
    └──► ECS Fargate one-shot tasks  (same image, same VPC)
         └──► RDS + Redis + external APIs

Why ECS Fargate over Lambda?

Lambda is fantastic for true event-driven workloads, but it hits friction fast when your dependency footprint grows. A full production Python stack ORM, async DB driver, Redis client, third-party SDKs can easily exceed Lambda's 250 MB unzipped limit. Fargate sidesteps that entirely, your Dockerfile is the deployment artifact, and AWS manages the underlying compute.

Why not EC2?

With Fargate you pay per second of task runtime and never SSH into an instance. The tradeoff you lose the ability to tune the OS is almost always the right one for API workloads.

What You'll Build

Layer	Service	Spec	Monthly
Compute	ECS Fargate	1 vCPU / 2 GB, 24/7	$36
Ingress	ALB + ACM	1 LB, HTTPS, free cert	$18
Database	RDS PostgreSQL 16	db.t3.micro, 20 GB gp2	$17
Cache	ElastiCache Redis 7	cache.t3.micro	$14
Scheduled jobs	EventBridge + ECS	3 cron rules	$0
Registry	ECR	< 1 GB image	$0.05
Observability	CloudWatch Logs	~1.5 GB/month	$1
Total			~$86/mo

Reserve RDS and ElastiCache for 1 year and that drops to ~$74/month — a 14% saving for committing to services you're running anyway.

Prerequisites

Install the tools:

# AWS CLI v2
brew install awscli       # macOS
# or: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html

aws configure
# Default region: eu-west-1  (or your preferred region)
# Default output: json

export AWS_DEFAULT_REGION=eu-west-1

Your IAM user needs these managed policies (scope them down after the first deploy):

AmazonRDSFullAccess
AmazonElastiCacheFullAccess
AmazonEC2FullAccess
AmazonECS_FullAccess
AmazonEC2ContainerRegistryFullAccess
AWSLambda_FullAccess
CloudWatchFullAccess
AmazonSSMFullAccess
IAMFullAccess
ElasticLoadBalancingFullAccess
AmazonRoute53FullAccess

Step 1 — VPC and Security Groups

The network model is the most important thing to get right up front. Every security group follows the principle of least privilege: only the minimum source/destination pair is opened.

# Use the default VPC — fine for single-region MVPs
VPC_ID=$(aws ec2 describe-vpcs \
  --filters "Name=isDefault,Values=true" \
  --query "Vpcs[0].VpcId" --output text)

SUBNET_IDS=$(aws ec2 describe-subnets \
  --filters "Name=vpc-id,Values=$VPC_ID" \
  --query "Subnets[0:2].SubnetId" --output text | tr '\t' ',')
SUBNET_1=$(echo $SUBNET_IDS | cut -d',' -f1)
SUBNET_2=$(echo $SUBNET_IDS | cut -d',' -f2)

echo "VPC: $VPC_ID  |  Subnets: $SUBNET_1, $SUBNET_2"

# ALB security group — public-facing HTTPS + HTTP
ALB_SG=$(aws ec2 create-security-group \
  --group-name myapp-alb-sg \
  --description "App ALB public HTTPS" \
  --vpc-id $VPC_ID --query GroupId --output text)

aws ec2 authorize-security-group-ingress \
  --group-id $ALB_SG --protocol tcp --port 443 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress \
  --group-id $ALB_SG --protocol tcp --port 80 --cidr 0.0.0.0/0

# App security group — port 8000, from ALB only
APP_SG=$(aws ec2 create-security-group \
  --group-name myapp-app-sg \
  --description "ECS tasks" \
  --vpc-id $VPC_ID --query GroupId --output text)

aws ec2 authorize-security-group-ingress \
  --group-id $APP_SG --protocol tcp --port 8000 --source-group $ALB_SG
aws ec2 authorize-security-group-egress \
  --group-id $APP_SG --protocol all --port -1 --cidr 0.0.0.0/0

# Database security group — port 5432, from app only
DB_SG=$(aws ec2 create-security-group \
  --group-name myapp-db-sg \
  --description "RDS PostgreSQL" \
  --vpc-id $VPC_ID --query GroupId --output text)

aws ec2 authorize-security-group-ingress \
  --group-id $DB_SG --protocol tcp --port 5432 --source-group $APP_SG

# Redis security group — port 6379, from app only
REDIS_SG=$(aws ec2 create-security-group \
  --group-name myapp-redis-sg \
  --description "ElastiCache Redis" \
  --vpc-id $VPC_ID --query GroupId --output text)

aws ec2 authorize-security-group-ingress \
  --group-id $REDIS_SG --protocol tcp --port 6379 --source-group $APP_SG

echo "ALB_SG=$ALB_SG  APP_SG=$APP_SG  DB_SG=$DB_SG  REDIS_SG=$REDIS_SG"

The chain: Internet → ALB SG → App SG → DB/Redis SG. No direct internet access to your database or cache. Ever.

Step 2 — RDS PostgreSQL

aws rds create-db-subnet-group \
  --db-subnet-group-name myapp-db-subnet \
  --db-subnet-group-description "App RDS subnet group" \
  --subnet-ids $SUBNET_1 $SUBNET_2

aws rds create-db-instance \
  --db-instance-identifier myapp-db \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --engine-version 16 \
  --db-name myapp \
  --master-username myapp \
  --master-user-password "<STRONG_PASSWORD>" \
  --allocated-storage 20 \
  --storage-type gp2 \
  --no-multi-az \
  --no-publicly-accessible \
  --db-subnet-group-name myapp-db-subnet \
  --vpc-security-group-ids $DB_SG \
  --backup-retention-period 7 \
  --deletion-protection

aws rds wait db-instance-available --db-instance-identifier myapp-db

DB_HOST=$(aws rds describe-db-instances \
  --db-instance-identifier myapp-db \
  --query "DBInstances[0].Endpoint.Address" --output text)

echo "DB host: $DB_HOST"

A few deliberate choices here:

--no-publicly-accessible — RDS never gets a public IP. To run migrations locally, temporarily open port 5432 in the DB SG to your IP, run the migration, then close it.
--backup-retention-period 7 — 7 days of automated snapshots at no extra cost on t3.micro.
--deletion-protection — prevents accidental aws rds delete-db-instance.

Connecting locally for migrations (temporary):

MY_IP=$(curl -s https://checkip.amazonaws.com)
aws ec2 authorize-security-group-ingress \
  --group-id $DB_SG --protocol tcp --port 5432 --cidr "${MY_IP}/32"
aws rds modify-db-instance \
  --db-instance-identifier myapp-db --publicly-accessible --apply-immediately
aws rds wait db-instance-available --db-instance-identifier myapp-db

# run your migrations here

# Lock it back down immediately
aws ec2 revoke-security-group-ingress \
  --group-id $DB_SG --protocol tcp --port 5432 --cidr "${MY_IP}/32"
aws rds modify-db-instance \
  --db-instance-identifier myapp-db --no-publicly-accessible --apply-immediately

Step 3 — ElastiCache Redis

aws elasticache create-cache-subnet-group \
  --cache-subnet-group-name myapp-redis-subnet \
  --cache-subnet-group-description "App Redis subnet group" \
  --subnet-ids $SUBNET_1 $SUBNET_2

aws elasticache create-cache-cluster \
  --cache-cluster-id myapp-redis \
  --cache-node-type cache.t3.micro \
  --engine redis \
  --engine-version 7.0 \
  --num-cache-nodes 1 \
  --cache-subnet-group-name myapp-redis-subnet \
  --security-group-ids $REDIS_SG

aws elasticache wait cache-cluster-available --cache-cluster-id myapp-redis

REDIS_HOST=$(aws elasticache describe-cache-clusters \
  --cache-cluster-id myapp-redis \
  --show-cache-node-info \
  --query "CacheClusters[0].CacheNodes[0].Endpoint.Address" --output text)

echo "Redis host: $REDIS_HOST"

Redis 7 on a cache.t3.micro handles rate limiting, session state, and caching for thousands of concurrent users at this tier.

Step 4 — Secrets Management with SSM Parameter Store

This is where most teams make mistakes. Environment variables hardcoded in Dockerfiles or task definitions end up in git history and CloudFormation console outputs. Don't do that.

SSM Parameter Store with SecureString parameters encrypts secrets with KMS, keeps an audit trail in CloudTrail, and integrates natively with ECS task definitions.

put_param() {
  aws ssm put-parameter \
    --name "/myapp/production/$1" \
    --value "$2" \
    --type SecureString \
    --overwrite
}

get_param() {
  aws ssm get-parameter \
    --name "/myapp/production/$(echo $1 | tr '[:lower:]' '[:upper:]')" \
    --with-decryption \
    --query Parameter.Value \
    --output text
}

put_param "APP_ENV"        "production"
put_param "DATABASE_URL"   "postgresql+asyncpg://myapp:<PASSWORD>@${DB_HOST}:5432/myapp"
put_param "REDIS_URL"      "redis://${REDIS_HOST}:6379/0"
put_param "APP_SECRET_KEY" "$(openssl rand -hex 32)"
# ... all your other secrets

Verify:

aws ssm get-parameters-by-path \
  --path "/myapp/production/" \
  --query 'Parameters[*].Name' --output table

The naming convention /app/environment/KEY is important — it lets you scope IAM policies to a path prefix, so your ECS task role can only read its own environment's secrets.

Step 5 — Container Registry (ECR)

aws ecr create-repository \
  --repository-name myapp \
  --image-scanning-configuration scanOnPush=true

ECR_URI=$(aws ecr describe-repositories \
  --repository-names myapp \
  --query "repositories[0].repositoryUri" --output text)

aws ecr get-login-password | \
  docker login --username AWS --password-stdin \
  "$(echo $ECR_URI | cut -d'/' -f1)"

scanOnPush=true runs ECR image scanning on every push using the enhanced scanning mode (powered by Amazon Inspector) — you get CVE reports in the AWS console for free.

Step 6 — Dockerfile That's Production Ready

Here's the multi-stage Dockerfile pattern that keeps your image lean:

FROM python:3.12.7-slim AS builder

WORKDIR /build

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential libpq-dev \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt


FROM python:3.12.7-slim AS production

WORKDIR /app

RUN apt-get update && apt-get install -y --no-install-recommends \
    libpq5 curl \
    && rm -rf /var/lib/apt/lists/* \
    && groupadd --gid 1001 appuser \
    && useradd --uid 1001 --gid appuser --no-create-home appuser

COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser . .

USER appuser

ENV PATH=/home/appuser/.local/bin:$PATH \
    PYTHONPATH=/app \
    PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

# 2 workers per container — Fargate scales containers horizontally
# 120s timeout covers slow AI/external API responses
CMD ["gunicorn", "app.main:app", \
     "--workers", "2", \
     "--worker-class", "uvicorn.workers.UvicornWorker", \
     "--bind", "0.0.0.0:8000", \
     "--timeout", "120", \
     "--keep-alive", "5", \
     "--access-logfile", "-", \
     "--error-logfile", "-"]

Two things worth calling out:

Non-root user: USER appuser is not optional in production. Many compliance frameworks flag containers running as root.
Multi-stage build: the builder stage has gcc, libpq-dev, etc. None of that lands in the final image. The runtime image only has the compiled wheels.

Build and push:

docker build --platform linux/amd64 -t myapp:latest .
docker tag myapp:latest $ECR_URI:latest
docker push $ECR_URI:latest

VERSION=$(git rev-parse --short HEAD)
docker tag myapp:latest $ECR_URI:$VERSION
docker push $ECR_URI:$VERSION

Always push a git SHA tag alongside latest. When something breaks at 3 AM, you want to know exactly which commit is running.

Step 7 — ECS Task Execution Role

The execution role is what ECS uses to pull your image from ECR and read SSM parameters at container startup. This is separate from the task role (what your application code uses).

cat > /tmp/ecs-trust.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "ecs-tasks.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}
EOF

aws iam create-role \
  --role-name MyAppECSTaskExecutionRole \
  --assume-role-policy-document file:///tmp/ecs-trust.json

aws iam attach-role-policy \
  --role-name MyAppECSTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

# Allow ECS to read secrets from SSM
aws iam attach-role-policy \
  --role-name MyAppECSTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess

EXEC_ROLE_ARN=$(aws iam get-role \
  --role-name MyAppECSTaskExecutionRole \
  --query Role.Arn --output text)

Step 8 — ECS Cluster, Task Definition, and ALB

This is the longest step but it's the core of the deployment.

Create the ECS Cluster

aws ecs create-cluster \
  --cluster-name myapp \
  --capacity-providers FARGATE \
  --region eu-west-1

Build the Environment Variables from SSM

Rather than injecting secrets as ECS secrets references (which adds latency and IAM complexity at scale), I pull all SSM values at deploy time into the task definition's environment block. This is a deliberate tradeoff: simpler IAM, faster task startup, secrets rotate on redeploy.

# generate-env.py — run this at deploy time
import json, subprocess

keys = [
  'APP_ENV', 'DATABASE_URL', 'REDIS_URL', 'APP_SECRET_KEY',
  # ... all your param names
]

env_list = []
for k in keys:
    val = subprocess.check_output([
      'aws', 'ssm', 'get-parameter',
      '--name', f'/myapp/production/{k}',
      '--with-decryption',
      '--query', 'Parameter.Value',
      '--output', 'text'
    ]).decode().strip()
    env_list.append({'name': k, 'value': val})

print(json.dumps(env_list, indent=2))

python3 generate-env.py > /tmp/myapp-env.json

Register the Task Definition

aws logs create-log-group --log-group-name /ecs/myapp-api --region eu-west-1

cat > /tmp/myapp-task.json << EOF
{
  "family": "myapp-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "executionRoleArn": "$EXEC_ROLE_ARN",
  "containerDefinitions": [{
    "name": "myapp-api",
    "image": "$ECR_URI:latest",
    "portMappings": [{"containerPort": 8000, "protocol": "tcp"}],
    "environment": $(cat /tmp/myapp-env.json),
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/myapp-api",
        "awslogs-region": "eu-west-1",
        "awslogs-stream-prefix": "ecs"
      }
    },
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"],
      "interval": 30,
      "timeout": 10,
      "retries": 3,
      "startPeriod": 20
    }
  }]
}
EOF

aws ecs register-task-definition --cli-input-json file:///tmp/myapp-task.json

Create the Application Load Balancer

ALB_ARN=$(aws elbv2 create-load-balancer \
  --name myapp-alb \
  --subnets $SUBNET_1 $SUBNET_2 \
  --security-groups $ALB_SG \
  --scheme internet-facing \
  --type application \
  --query 'LoadBalancers[0].LoadBalancerArn' --output text)

ALB_DNS=$(aws elbv2 describe-load-balancers \
  --load-balancer-arns $ALB_ARN \
  --query 'LoadBalancers[0].DNSName' --output text)

# Create target group — ALB forwards to ECS tasks by IP
TG_ARN=$(aws elbv2 create-target-group \
  --name myapp-tg \
  --protocol HTTP --port 8000 \
  --vpc-id $VPC_ID \
  --target-type ip \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --query 'TargetGroups[0].TargetGroupArn' --output text)

# Request a free TLS certificate from ACM
CERT_ARN=$(aws acm request-certificate \
  --domain-name api.yourdomain.com \
  --validation-method DNS \
  --query CertificateArn --output text)

echo "Add the DNS CNAME validation record shown in the ACM console, then continue."
# Wait for ACM to validate — typically < 5 minutes with Route53

# HTTPS listener (after cert validates)
aws elbv2 create-listener \
  --load-balancer-arn $ALB_ARN \
  --protocol HTTPS --port 443 \
  --certificates CertificateArn=$CERT_ARN \
  --default-actions Type=forward,TargetGroupArn=$TG_ARN

# HTTP → HTTPS redirect
aws elbv2 create-listener \
  --load-balancer-arn $ALB_ARN \
  --protocol HTTP --port 80 \
  --default-actions \
    Type=redirect,RedirectConfig='{Protocol=HTTPS,Port=443,StatusCode=HTTP_301}'

Route53 Alias Record

HOSTED_ZONE_ID=$(aws route53 list-hosted-zones \
  --query "HostedZones[?Name=='yourdomain.com.'].Id" \
  --output text | cut -d'/' -f3)

# Note: Z32O12XQLNTSW2 is the ALB hosted zone ID for eu-west-1
# See: https://docs.aws.amazon.com/general/latest/gr/elb.html
aws route53 change-resource-record-sets \
  --hosted-zone-id $HOSTED_ZONE_ID \
  --change-batch "{
    \"Changes\": [{
      \"Action\": \"UPSERT\",
      \"ResourceRecordSet\": {
        \"Name\": \"api.yourdomain.com\",
        \"Type\": \"A\",
        \"AliasTarget\": {
          \"HostedZoneId\": \"Z32O12XQLNTSW2\",
          \"DNSName\": \"$ALB_DNS\",
          \"EvaluateTargetHealth\": true
        }
      }
    }]
  }"

Create the ECS Service

aws ecs create-service \
  --cluster myapp \
  --service-name myapp-api \
  --task-definition myapp-api \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[$SUBNET_1,$SUBNET_2],
    securityGroups=[$APP_SG],
    assignPublicIp=ENABLED
  }" \
  --load-balancers "targetGroupArn=$TG_ARN,containerName=myapp-api,containerPort=8000" \
  --health-check-grace-period-seconds 60 \
  --region eu-west-1

aws ecs wait services-stable --cluster myapp --services myapp-api

curl https://api.yourdomain.com/health
# Expected: {"status": "ok", "env": "production"}

Step 9 — Scheduled Jobs with EventBridge Scheduler + ECS

This is the architecture decision I'm most proud of, and the one that trips people up most.

The naive approach is Lambda. The problem: a production Python API's dependency closure ORM, async DB driver, HTTP clients, AI SDK can hit 300-400 MB unzipped. Lambda's limit is 250 MB. You'd need custom Docker Lambda images, a separate build pipeline, and a second ECR repository just for jobs.

The better approach: EventBridge Scheduler triggers one-shot ECS Fargate tasks using the exact same Docker image as your API. No packaging. No separate build. Jobs pick up every dependency update automatically when you deploy a new image.

# IAM role for EventBridge to trigger ECS
cat > /tmp/eb-trust.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"Service": "scheduler.amazonaws.com"},
    "Action": "sts:AssumeRole"
  }]
}
EOF

aws iam create-role --role-name MyAppSchedulerRole \
  --assume-role-policy-document file:///tmp/eb-trust.json
aws iam attach-role-policy --role-name MyAppSchedulerRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonECS_FullAccess

SCHEDULER_ROLE_ARN=$(aws iam get-role \
  --role-name MyAppSchedulerRole --query Role.Arn --output text)

Create a task definition per job — the only difference from the API task is the command override:

create_job_task() {
  local NAME=$1 HANDLER=$2
  cat > /tmp/task-${NAME}.json << EOF
{
  "family": "myapp-${NAME}",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "$EXEC_ROLE_ARN",
  "containerDefinitions": [{
    "name": "myapp-${NAME}",
    "image": "$ECR_URI:latest",
    "command": ["python3", "-m", "jobs.${HANDLER}"],
    "environment": $(cat /tmp/myapp-env.json),
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/myapp-api",
        "awslogs-region": "eu-west-1",
        "awslogs-stream-prefix": "${NAME}"
      }
    }
  }]
}
EOF
  aws ecs register-task-definition \
    --cli-input-json file:///tmp/task-${NAME}.json \
    --query 'taskDefinition.taskDefinitionArn' --output text
}

CLEANUP_ARN=$(create_job_task "cleanup" "expire_stale_records")
NOTIFY_ARN=$(create_job_task "notify"  "send_daily_digest")

Wire each task definition to an EventBridge schedule:

create_schedule() {
  local NAME=$1 TASK_ARN=$2 CRON=$3

  aws scheduler create-schedule --name $NAME \
    --schedule-expression "cron($CRON)" \
    --flexible-time-window '{"Mode":"OFF"}' \
    --target "{
      \"Arn\": \"arn:aws:ecs:eu-west-1:$(aws sts get-caller-identity --query Account --output text):cluster/myapp\",
      \"RoleArn\": \"$SCHEDULER_ROLE_ARN\",
      \"EcsParameters\": {
        \"TaskDefinitionArn\": \"$TASK_ARN\",
        \"LaunchType\": \"FARGATE\",
        \"TaskCount\": 1,
        \"NetworkConfiguration\": {
          \"awsvpcConfiguration\": {
            \"Subnets\": [\"$SUBNET_1\", \"$SUBNET_2\"],
            \"SecurityGroups\": [\"$APP_SG\"],
            \"AssignPublicIp\": \"ENABLED\"
          }
        }
      }
    }" --region eu-west-1
}

# Midnight UTC
create_schedule "myapp-nightly-cleanup" "$CLEANUP_ARN" "0 23 * * ? *"
# 8 AM UTC
create_schedule "myapp-daily-notify"    "$NOTIFY_ARN"  "0 8 * * ? *"

Test a job manually:

aws ecs run-task \
  --cluster myapp \
  --task-definition myapp-cleanup \
  --launch-type FARGATE \
  --region eu-west-1 \
  --network-configuration "awsvpcConfiguration={
    subnets=[$SUBNET_1,$SUBNET_2],
    securityGroups=[$APP_SG],
    assignPublicIp=ENABLED
  }"

# Watch it run
aws logs tail /ecs/myapp-api --follow --region eu-west-1

Step 10 — Deploying Updates

The deploy loop is three commands:

# 1. Build and push
docker build --platform linux/amd64 -t myapp:latest .
VERSION=$(git rev-parse --short HEAD)
docker tag myapp:latest $ECR_URI:latest
docker tag myapp:latest $ECR_URI:$VERSION
docker push $ECR_URI:latest
docker push $ECR_URI:$VERSION

# 2. Force a new deployment (ECS pulls the new :latest image)
aws ecs update-service \
  --cluster myapp \
  --service myapp-api \
  --force-new-deployment

# 3. Wait for stability
aws ecs wait services-stable --cluster myapp --services myapp-api
echo "Deployed: $VERSION"

ECS runs a rolling deployment by default and the old task keeps serving traffic until the new task passes health checks.

Observability

All container stdout/stderr goes to CloudWatch Logs automatically via the awslogs driver configured in the task definition.

# Stream live logs
aws logs tail /ecs/myapp-api --follow --region eu-west-1

# Job logs are prefixed — easier to filter
aws logs tail /ecs/myapp-api --follow \
  --log-stream-name-prefix "cleanup/" --region eu-west-1

Set up a billing alert so you're never surprised:

aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "myapp-monthly-cap",
    "BudgetLimit": {"Amount": "120", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[{
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80
    },
    "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "you@yourcompany.com"}]
  }]'

Security Checklist

Before you call it production:

[ ] APP_ENV=production set in ECS environment
[ ] RDS --no-publicly-accessible — verified with describe-db-instances
[ ] All secrets in SSM under /myapp/production/ — nothing hardcoded in Dockerfiles or task definitions
[ ] .env in .gitignore and confirmed not in git history
[ ] ALB SG: only ports 80/443 from 0.0.0.0/0
[ ] App SG: only port 8000 from ALB SG
[ ] DB SG: only port 5432 from App SG
[ ] Redis SG: only port 6379 from App SG
[ ] ECR scan-on-push enabled
[ ] Container runs as non-root user

Growth Path

The architecture scales without redesign:

DAU	Bottleneck	Upgrade	Approx. cost
0–500	—	MVP (this guide)	~$86/mo
500–2,000	Memory	2 vCPU / 4 GB Fargate task	~$110/mo
2,000–10,000	RDS IOPS	db.t3.small	~$140/mo
10,000–50,000	DB connections	RDS Proxy	~$220/mo
50,000+	DB throughput	Aurora Serverless v2	~$400+/mo

Each upgrade is a single AWS CLI command or a task definition change — no architectural rework.

Troubleshooting Reference

Symptom	Likely Cause	Fix
ECS task keeps restarting	App crash at startup	`aws logs tail /ecs/myapp-api --follow`
ALB health checks failing	App not ready in time	Increase `health-check-grace-period-seconds` to 120
DB connection refused	Security group	DB SG must allow 5432 from App SG, not from `0.0.0.0/0`
Redis connection refused	Security group	Redis SG must allow 6379 from App SG
ECR pull failure	IAM	Add `AmazonEC2ContainerRegistryReadOnly` to execution role
SSM parameter not found	Wrong path	All params must live at `/myapp/production/UPPER_CASE`
Scheduled job fails	Job SG can't reach RDS/Redis	Use the same App SG for job task definitions
Container exits immediately	Missing env var	Check CloudWatch logs for the startup error

Closing Thoughts

This stack handles everything from zero users to tens of thousands without infrastructure rewrites. The pattern is boring in the best possible way and ECS Fargate is a managed scheduler, RDS and ElastiCache are managed data stores, EventBridge Scheduler is a managed cron. AWS handles patching, availability, and failover for all of them.

The two decisions that have the most leverage:

Security groups as your firewall. Every layer only opens to the layer above it. No shortcuts.
EventBridge Scheduler → ECS one-shot tasks over Lambda for heavy jobs. Lambda is excellent for lightweight event handlers. Once your dependency tree gets serious, reuse your existing image and let Fargate handle it.

The full setup including cluster, databases, registry, jobs, and custom domain takes about 45 minutes end-to-end following this guide. The Dockerfile multi-stage pattern, SSM parameterization, and EventBridge-ECS job design all translate directly to other languages and frameworks.

If you have questions or want to share how you've adapted this for your stack, drop a comment below.

Arafat Olayiwola is a 5x AWS Community Builder specializing in cloud-native backend architecture and developer tooling. He writes about practical AWS patterns for production systems.

DEV Community