Author: Arafat Olayiwola — 5x AWS Community Builder
Stack: Python 3.12 · FastAPI · Docker · AWS ECS Fargate · ALB · RDS · ElastiCache · EventBridge
Description: A complete, battle-tested AWS deployment walkthrough for containerized Python APIs — ECS Fargate, Application Load Balancer with ACM, RDS PostgreSQL, ElastiCache Redis, SSM secrets, and EventBridge-triggered scheduled jobs. ~$86/month, no servers to manage.
I've helped dozens of teams get their first serious AWS deployment off the ground. The same questions keep coming up: "Should I use Lambda or ECS?", "Where do I put my secrets?", "How do I run cron jobs without a server?"
This article is the end-to-end answer. It covers the exact architecture I ship for containerized API services in production — one that:
- Handles live webhooks with zero cold starts
- Keeps all secrets encrypted at rest and out of source control
- Runs scheduled jobs on a shared Docker image without packaging nightmares
- Costs around $86/month at the MVP tier and scales predictably
No half-baked tutorials. Every command runs against real AWS CLI.
Architecture at a Glance
Internet
│
▼
ALB (HTTPS :443, ACM cert, custom domain)
│
▼
ECS Fargate (your-api task, 1 vCPU / 2 GB) ◄── ECR (container image)
│
├──► RDS PostgreSQL 16 t3.micro (private subnet)
└──► ElastiCache Redis 7 t3.micro (private subnet)
EventBridge Scheduler (3 cron rules)
│
└──► ECS Fargate one-shot tasks (same image, same VPC)
└──► RDS + Redis + external APIs
Why ECS Fargate over Lambda?
Lambda is fantastic for true event-driven workloads, but it hits friction fast when your dependency footprint grows. A full production Python stack — ORM, async DB driver, Redis client, third-party SDKs — can easily exceed Lambda's 250 MB unzipped limit. Fargate sidesteps that entirely: your Dockerfile is the deployment artifact, and AWS manages the underlying compute.
Why not EC2?
With Fargate you pay per second of task runtime and never SSH into an instance. The tradeoff — you lose the ability to tune the OS — is almost always the right one for API workloads.
What You'll Build
| Layer | Service | Spec | Monthly |
|---|---|---|---|
| Compute | ECS Fargate | 1 vCPU / 2 GB, 24/7 | $36 |
| Ingress | ALB + ACM | 1 LB, HTTPS, free cert | $18 |
| Database | RDS PostgreSQL 16 | db.t3.micro, 20 GB gp2 | $17 |
| Cache | ElastiCache Redis 7 | cache.t3.micro | $14 |
| Scheduled jobs | EventBridge + ECS | 3 cron rules | $0 |
| Registry | ECR | < 1 GB image | $0.05 |
| Observability | CloudWatch Logs | ~1.5 GB/month | $1 |
| Total | ~$86/mo |
Reserve RDS and ElastiCache for 1 year and that drops to ~$74/month — a 14% saving for committing to services you're running anyway.
Prerequisites
Install the tools:
# AWS CLI v2
brew install awscli # macOS
# or: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
aws configure
# Default region: eu-west-1 (or your preferred region)
# Default output: json
export AWS_DEFAULT_REGION=eu-west-1
Your IAM user needs these managed policies (scope them down after the first deploy):
AmazonRDSFullAccess
AmazonElastiCacheFullAccess
AmazonEC2FullAccess
AmazonECS_FullAccess
AmazonEC2ContainerRegistryFullAccess
AWSLambda_FullAccess
CloudWatchFullAccess
AmazonSSMFullAccess
IAMFullAccess
ElasticLoadBalancingFullAccess
AmazonRoute53FullAccess
Step 1 — VPC and Security Groups
The network model is the most important thing to get right up front. Every security group follows the principle of least privilege: only the minimum source/destination pair is opened.
# Use the default VPC — fine for single-region MVPs
VPC_ID=$(aws ec2 describe-vpcs \
--filters "Name=isDefault,Values=true" \
--query "Vpcs[0].VpcId" --output text)
SUBNET_IDS=$(aws ec2 describe-subnets \
--filters "Name=vpc-id,Values=$VPC_ID" \
--query "Subnets[0:2].SubnetId" --output text | tr '\t' ',')
SUBNET_1=$(echo $SUBNET_IDS | cut -d',' -f1)
SUBNET_2=$(echo $SUBNET_IDS | cut -d',' -f2)
echo "VPC: $VPC_ID | Subnets: $SUBNET_1, $SUBNET_2"
# ALB security group — public-facing HTTPS + HTTP
ALB_SG=$(aws ec2 create-security-group \
--group-name myapp-alb-sg \
--description "App ALB public HTTPS" \
--vpc-id $VPC_ID --query GroupId --output text)
aws ec2 authorize-security-group-ingress \
--group-id $ALB_SG --protocol tcp --port 443 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress \
--group-id $ALB_SG --protocol tcp --port 80 --cidr 0.0.0.0/0
# App security group — port 8000, from ALB only
APP_SG=$(aws ec2 create-security-group \
--group-name myapp-app-sg \
--description "ECS tasks" \
--vpc-id $VPC_ID --query GroupId --output text)
aws ec2 authorize-security-group-ingress \
--group-id $APP_SG --protocol tcp --port 8000 --source-group $ALB_SG
aws ec2 authorize-security-group-egress \
--group-id $APP_SG --protocol all --port -1 --cidr 0.0.0.0/0
# Database security group — port 5432, from app only
DB_SG=$(aws ec2 create-security-group \
--group-name myapp-db-sg \
--description "RDS PostgreSQL" \
--vpc-id $VPC_ID --query GroupId --output text)
aws ec2 authorize-security-group-ingress \
--group-id $DB_SG --protocol tcp --port 5432 --source-group $APP_SG
# Redis security group — port 6379, from app only
REDIS_SG=$(aws ec2 create-security-group \
--group-name myapp-redis-sg \
--description "ElastiCache Redis" \
--vpc-id $VPC_ID --query GroupId --output text)
aws ec2 authorize-security-group-ingress \
--group-id $REDIS_SG --protocol tcp --port 6379 --source-group $APP_SG
echo "ALB_SG=$ALB_SG APP_SG=$APP_SG DB_SG=$DB_SG REDIS_SG=$REDIS_SG"
The chain: Internet → ALB SG → App SG → DB/Redis SG. No direct internet access to your database or cache. Ever.
Step 2 — RDS PostgreSQL
aws rds create-db-subnet-group \
--db-subnet-group-name myapp-db-subnet \
--db-subnet-group-description "App RDS subnet group" \
--subnet-ids $SUBNET_1 $SUBNET_2
aws rds create-db-instance \
--db-instance-identifier myapp-db \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 16 \
--db-name myapp \
--master-username myapp \
--master-user-password "<STRONG_PASSWORD>" \
--allocated-storage 20 \
--storage-type gp2 \
--no-multi-az \
--no-publicly-accessible \
--db-subnet-group-name myapp-db-subnet \
--vpc-security-group-ids $DB_SG \
--backup-retention-period 7 \
--deletion-protection
aws rds wait db-instance-available --db-instance-identifier myapp-db
DB_HOST=$(aws rds describe-db-instances \
--db-instance-identifier myapp-db \
--query "DBInstances[0].Endpoint.Address" --output text)
echo "DB host: $DB_HOST"
A few deliberate choices here:
-
--no-publicly-accessible— RDS never gets a public IP. To run migrations locally, temporarily open port 5432 in the DB SG to your IP, run the migration, then close it. -
--backup-retention-period 7— 7 days of automated snapshots at no extra cost on t3.micro. -
--deletion-protection— prevents accidentalaws rds delete-db-instance.
Connecting locally for migrations (temporary):
MY_IP=$(curl -s https://checkip.amazonaws.com)
aws ec2 authorize-security-group-ingress \
--group-id $DB_SG --protocol tcp --port 5432 --cidr "${MY_IP}/32"
aws rds modify-db-instance \
--db-instance-identifier myapp-db --publicly-accessible --apply-immediately
aws rds wait db-instance-available --db-instance-identifier myapp-db
# run your migrations here
# Lock it back down immediately
aws ec2 revoke-security-group-ingress \
--group-id $DB_SG --protocol tcp --port 5432 --cidr "${MY_IP}/32"
aws rds modify-db-instance \
--db-instance-identifier myapp-db --no-publicly-accessible --apply-immediately
Step 3 — ElastiCache Redis
aws elasticache create-cache-subnet-group \
--cache-subnet-group-name myapp-redis-subnet \
--cache-subnet-group-description "App Redis subnet group" \
--subnet-ids $SUBNET_1 $SUBNET_2
aws elasticache create-cache-cluster \
--cache-cluster-id myapp-redis \
--cache-node-type cache.t3.micro \
--engine redis \
--engine-version 7.0 \
--num-cache-nodes 1 \
--cache-subnet-group-name myapp-redis-subnet \
--security-group-ids $REDIS_SG
aws elasticache wait cache-cluster-available --cache-cluster-id myapp-redis
REDIS_HOST=$(aws elasticache describe-cache-clusters \
--cache-cluster-id myapp-redis \
--show-cache-node-info \
--query "CacheClusters[0].CacheNodes[0].Endpoint.Address" --output text)
echo "Redis host: $REDIS_HOST"
Redis 7 on a cache.t3.micro handles rate limiting, session state, and caching for thousands of concurrent users at this tier.
Step 4 — Secrets Management with SSM Parameter Store
This is where most teams make mistakes. Environment variables hardcoded in Dockerfiles or task definitions end up in git history and CloudFormation console outputs. Don't do that.
SSM Parameter Store with SecureString parameters encrypts secrets with KMS, keeps an audit trail in CloudTrail, and integrates natively with ECS task definitions.
put_param() {
aws ssm put-parameter \
--name "/myapp/production/$1" \
--value "$2" \
--type SecureString \
--overwrite
}
get_param() {
aws ssm get-parameter \
--name "/myapp/production/$(echo $1 | tr '[:lower:]' '[:upper:]')" \
--with-decryption \
--query Parameter.Value \
--output text
}
put_param "APP_ENV" "production"
put_param "DATABASE_URL" "postgresql+asyncpg://myapp:<PASSWORD>@${DB_HOST}:5432/myapp"
put_param "REDIS_URL" "redis://${REDIS_HOST}:6379/0"
put_param "APP_SECRET_KEY" "$(openssl rand -hex 32)"
# ... all your other secrets
Verify:
aws ssm get-parameters-by-path \
--path "/myapp/production/" \
--query 'Parameters[*].Name' --output table
The naming convention /app/environment/KEY is important — it lets you scope IAM policies to a path prefix, so your ECS task role can only read its own environment's secrets.
Step 5 — Container Registry (ECR)
aws ecr create-repository \
--repository-name myapp \
--image-scanning-configuration scanOnPush=true
ECR_URI=$(aws ecr describe-repositories \
--repository-names myapp \
--query "repositories[0].repositoryUri" --output text)
aws ecr get-login-password | \
docker login --username AWS --password-stdin \
"$(echo $ECR_URI | cut -d'/' -f1)"
scanOnPush=true runs ECR image scanning on every push using the enhanced scanning mode (powered by Amazon Inspector) — you get CVE reports in the AWS console for free.
Step 6 — Dockerfile That's Production Ready
Here's the multi-stage Dockerfile pattern that keeps your image lean:
FROM python:3.12.7-slim AS builder
WORKDIR /build
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential libpq-dev \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
FROM python:3.12.7-slim AS production
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends \
libpq5 curl \
&& rm -rf /var/lib/apt/lists/* \
&& groupadd --gid 1001 appuser \
&& useradd --uid 1001 --gid appuser --no-create-home appuser
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser . .
USER appuser
ENV PATH=/home/appuser/.local/bin:$PATH \
PYTHONPATH=/app \
PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# 2 workers per container — Fargate scales containers horizontally
# 120s timeout covers slow AI/external API responses
CMD ["gunicorn", "app.main:app", \
"--workers", "2", \
"--worker-class", "uvicorn.workers.UvicornWorker", \
"--bind", "0.0.0.0:8000", \
"--timeout", "120", \
"--keep-alive", "5", \
"--access-logfile", "-", \
"--error-logfile", "-"]
Two things worth calling out:
-
Non-root user —
USER appuseris not optional in production. Many compliance frameworks flag containers running as root. -
Multi-stage build — the
builderstage has gcc, libpq-dev, etc. None of that lands in the final image. The runtime image only has the compiled wheels.
Build and push:
docker build --platform linux/amd64 -t myapp:latest .
docker tag myapp:latest $ECR_URI:latest
docker push $ECR_URI:latest
VERSION=$(git rev-parse --short HEAD)
docker tag myapp:latest $ECR_URI:$VERSION
docker push $ECR_URI:$VERSION
Always push a git SHA tag alongside latest. When something breaks at 3 AM, you want to know exactly which commit is running.
Step 7 — ECS Task Execution Role
The execution role is what ECS uses to pull your image from ECR and read SSM parameters at container startup. This is separate from the task role (what your application code uses).
cat > /tmp/ecs-trust.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Service": "ecs-tasks.amazonaws.com" },
"Action": "sts:AssumeRole"
}]
}
EOF
aws iam create-role \
--role-name MyAppECSTaskExecutionRole \
--assume-role-policy-document file:///tmp/ecs-trust.json
aws iam attach-role-policy \
--role-name MyAppECSTaskExecutionRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
# Allow ECS to read secrets from SSM
aws iam attach-role-policy \
--role-name MyAppECSTaskExecutionRole \
--policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess
EXEC_ROLE_ARN=$(aws iam get-role \
--role-name MyAppECSTaskExecutionRole \
--query Role.Arn --output text)
Step 8 — ECS Cluster, Task Definition, and ALB
This is the longest step but it's the core of the deployment.
Create the ECS Cluster
aws ecs create-cluster \
--cluster-name myapp \
--capacity-providers FARGATE \
--region eu-west-1
Build the Environment Variables from SSM
Rather than injecting secrets as ECS secrets references (which adds latency and IAM complexity at scale), I pull all SSM values at deploy time into the task definition's environment block. This is a deliberate tradeoff: simpler IAM, faster task startup, secrets rotate on redeploy.
# generate-env.py — run this at deploy time
import json, subprocess
keys = [
'APP_ENV', 'DATABASE_URL', 'REDIS_URL', 'APP_SECRET_KEY',
# ... all your param names
]
env_list = []
for k in keys:
val = subprocess.check_output([
'aws', 'ssm', 'get-parameter',
'--name', f'/myapp/production/{k}',
'--with-decryption',
'--query', 'Parameter.Value',
'--output', 'text'
]).decode().strip()
env_list.append({'name': k, 'value': val})
print(json.dumps(env_list, indent=2))
python3 generate-env.py > /tmp/myapp-env.json
Register the Task Definition
aws logs create-log-group --log-group-name /ecs/myapp-api --region eu-west-1
cat > /tmp/myapp-task.json << EOF
{
"family": "myapp-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"executionRoleArn": "$EXEC_ROLE_ARN",
"containerDefinitions": [{
"name": "myapp-api",
"image": "$ECR_URI:latest",
"portMappings": [{"containerPort": 8000, "protocol": "tcp"}],
"environment": $(cat /tmp/myapp-env.json),
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/myapp-api",
"awslogs-region": "eu-west-1",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"],
"interval": 30,
"timeout": 10,
"retries": 3,
"startPeriod": 20
}
}]
}
EOF
aws ecs register-task-definition --cli-input-json file:///tmp/myapp-task.json
Create the Application Load Balancer
ALB_ARN=$(aws elbv2 create-load-balancer \
--name myapp-alb \
--subnets $SUBNET_1 $SUBNET_2 \
--security-groups $ALB_SG \
--scheme internet-facing \
--type application \
--query 'LoadBalancers[0].LoadBalancerArn' --output text)
ALB_DNS=$(aws elbv2 describe-load-balancers \
--load-balancer-arns $ALB_ARN \
--query 'LoadBalancers[0].DNSName' --output text)
# Create target group — ALB forwards to ECS tasks by IP
TG_ARN=$(aws elbv2 create-target-group \
--name myapp-tg \
--protocol HTTP --port 8000 \
--vpc-id $VPC_ID \
--target-type ip \
--health-check-path /health \
--health-check-interval-seconds 30 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3 \
--query 'TargetGroups[0].TargetGroupArn' --output text)
# Request a free TLS certificate from ACM
CERT_ARN=$(aws acm request-certificate \
--domain-name api.yourdomain.com \
--validation-method DNS \
--query CertificateArn --output text)
echo "Add the DNS CNAME validation record shown in the ACM console, then continue."
# Wait for ACM to validate — typically < 5 minutes with Route53
# HTTPS listener (after cert validates)
aws elbv2 create-listener \
--load-balancer-arn $ALB_ARN \
--protocol HTTPS --port 443 \
--certificates CertificateArn=$CERT_ARN \
--default-actions Type=forward,TargetGroupArn=$TG_ARN
# HTTP → HTTPS redirect
aws elbv2 create-listener \
--load-balancer-arn $ALB_ARN \
--protocol HTTP --port 80 \
--default-actions \
Type=redirect,RedirectConfig='{Protocol=HTTPS,Port=443,StatusCode=HTTP_301}'
Route53 Alias Record
HOSTED_ZONE_ID=$(aws route53 list-hosted-zones \
--query "HostedZones[?Name=='yourdomain.com.'].Id" \
--output text | cut -d'/' -f3)
# Note: Z32O12XQLNTSW2 is the ALB hosted zone ID for eu-west-1
# See: https://docs.aws.amazon.com/general/latest/gr/elb.html
aws route53 change-resource-record-sets \
--hosted-zone-id $HOSTED_ZONE_ID \
--change-batch "{
\"Changes\": [{
\"Action\": \"UPSERT\",
\"ResourceRecordSet\": {
\"Name\": \"api.yourdomain.com\",
\"Type\": \"A\",
\"AliasTarget\": {
\"HostedZoneId\": \"Z32O12XQLNTSW2\",
\"DNSName\": \"$ALB_DNS\",
\"EvaluateTargetHealth\": true
}
}
}]
}"
Create the ECS Service
aws ecs create-service \
--cluster myapp \
--service-name myapp-api \
--task-definition myapp-api \
--desired-count 1 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={
subnets=[$SUBNET_1,$SUBNET_2],
securityGroups=[$APP_SG],
assignPublicIp=ENABLED
}" \
--load-balancers "targetGroupArn=$TG_ARN,containerName=myapp-api,containerPort=8000" \
--health-check-grace-period-seconds 60 \
--region eu-west-1
aws ecs wait services-stable --cluster myapp --services myapp-api
curl https://api.yourdomain.com/health
# Expected: {"status": "ok", "env": "production"}
Step 9 — Scheduled Jobs with EventBridge Scheduler + ECS
This is the architecture decision I'm most proud of, and the one that trips people up most.
The naive approach is Lambda. The problem: a production Python API's dependency closure — ORM, async DB driver, HTTP clients, AI SDK — can hit 300-400 MB unzipped. Lambda's limit is 250 MB. You'd need custom Docker Lambda images, a separate build pipeline, and a second ECR repository just for jobs.
The better approach: EventBridge Scheduler triggers one-shot ECS Fargate tasks using the exact same Docker image as your API. No packaging. No separate build. Jobs pick up every dependency update automatically when you deploy a new image.
# IAM role for EventBridge to trigger ECS
cat > /tmp/eb-trust.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "scheduler.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}
EOF
aws iam create-role --role-name MyAppSchedulerRole \
--assume-role-policy-document file:///tmp/eb-trust.json
aws iam attach-role-policy --role-name MyAppSchedulerRole \
--policy-arn arn:aws:iam::aws:policy/AmazonECS_FullAccess
SCHEDULER_ROLE_ARN=$(aws iam get-role \
--role-name MyAppSchedulerRole --query Role.Arn --output text)
Create a task definition per job — the only difference from the API task is the command override:
create_job_task() {
local NAME=$1 HANDLER=$2
cat > /tmp/task-${NAME}.json << EOF
{
"family": "myapp-${NAME}",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "$EXEC_ROLE_ARN",
"containerDefinitions": [{
"name": "myapp-${NAME}",
"image": "$ECR_URI:latest",
"command": ["python3", "-m", "jobs.${HANDLER}"],
"environment": $(cat /tmp/myapp-env.json),
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/myapp-api",
"awslogs-region": "eu-west-1",
"awslogs-stream-prefix": "${NAME}"
}
}
}]
}
EOF
aws ecs register-task-definition \
--cli-input-json file:///tmp/task-${NAME}.json \
--query 'taskDefinition.taskDefinitionArn' --output text
}
CLEANUP_ARN=$(create_job_task "cleanup" "expire_stale_records")
NOTIFY_ARN=$(create_job_task "notify" "send_daily_digest")
Wire each task definition to an EventBridge schedule:
create_schedule() {
local NAME=$1 TASK_ARN=$2 CRON=$3
aws scheduler create-schedule --name $NAME \
--schedule-expression "cron($CRON)" \
--flexible-time-window '{"Mode":"OFF"}' \
--target "{
\"Arn\": \"arn:aws:ecs:eu-west-1:$(aws sts get-caller-identity --query Account --output text):cluster/myapp\",
\"RoleArn\": \"$SCHEDULER_ROLE_ARN\",
\"EcsParameters\": {
\"TaskDefinitionArn\": \"$TASK_ARN\",
\"LaunchType\": \"FARGATE\",
\"TaskCount\": 1,
\"NetworkConfiguration\": {
\"awsvpcConfiguration\": {
\"Subnets\": [\"$SUBNET_1\", \"$SUBNET_2\"],
\"SecurityGroups\": [\"$APP_SG\"],
\"AssignPublicIp\": \"ENABLED\"
}
}
}
}" --region eu-west-1
}
# Midnight UTC
create_schedule "myapp-nightly-cleanup" "$CLEANUP_ARN" "0 23 * * ? *"
# 8 AM UTC
create_schedule "myapp-daily-notify" "$NOTIFY_ARN" "0 8 * * ? *"
Test a job manually:
aws ecs run-task \
--cluster myapp \
--task-definition myapp-cleanup \
--launch-type FARGATE \
--region eu-west-1 \
--network-configuration "awsvpcConfiguration={
subnets=[$SUBNET_1,$SUBNET_2],
securityGroups=[$APP_SG],
assignPublicIp=ENABLED
}"
# Watch it run
aws logs tail /ecs/myapp-api --follow --region eu-west-1
Step 10 — Deploying Updates
The deploy loop is three commands:
# 1. Build and push
docker build --platform linux/amd64 -t myapp:latest .
VERSION=$(git rev-parse --short HEAD)
docker tag myapp:latest $ECR_URI:latest
docker tag myapp:latest $ECR_URI:$VERSION
docker push $ECR_URI:latest
docker push $ECR_URI:$VERSION
# 2. Force a new deployment (ECS pulls the new :latest image)
aws ecs update-service \
--cluster myapp \
--service myapp-api \
--force-new-deployment
# 3. Wait for stability
aws ecs wait services-stable --cluster myapp --services myapp-api
echo "Deployed: $VERSION"
ECS runs a rolling deployment by default — the old task keeps serving traffic until the new task passes health checks.
Observability
All container stdout/stderr goes to CloudWatch Logs automatically via the awslogs driver configured in the task definition.
# Stream live logs
aws logs tail /ecs/myapp-api --follow --region eu-west-1
# Job logs are prefixed — easier to filter
aws logs tail /ecs/myapp-api --follow \
--log-stream-name-prefix "cleanup/" --region eu-west-1
Set up a billing alert so you're never surprised:
aws budgets create-budget \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget '{
"BudgetName": "myapp-monthly-cap",
"BudgetLimit": {"Amount": "120", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80
},
"Subscribers": [{"SubscriptionType": "EMAIL", "Address": "you@yourcompany.com"}]
}]'
Security Checklist
Before you call it production:
- [ ]
APP_ENV=productionset in ECS environment - [ ] RDS
--no-publicly-accessible— verified withdescribe-db-instances - [ ] All secrets in SSM under
/myapp/production/— nothing hardcoded in Dockerfiles or task definitions - [ ]
.envin.gitignoreand confirmed not in git history - [ ] ALB SG: only ports 80/443 from
0.0.0.0/0 - [ ] App SG: only port 8000 from ALB SG
- [ ] DB SG: only port 5432 from App SG
- [ ] Redis SG: only port 6379 from App SG
- [ ] ECR scan-on-push enabled
- [ ] Container runs as non-root user
Growth Path
The architecture scales without redesign:
| DAU | Bottleneck | Upgrade | Approx. cost |
|---|---|---|---|
| 0–500 | — | MVP (this guide) | ~$86/mo |
| 500–2,000 | Memory | 2 vCPU / 4 GB Fargate task | ~$110/mo |
| 2,000–10,000 | RDS IOPS | db.t3.small | ~$140/mo |
| 10,000–50,000 | DB connections | RDS Proxy | ~$220/mo |
| 50,000+ | DB throughput | Aurora Serverless v2 | ~$400+/mo |
Each upgrade is a single AWS CLI command or a task definition change — no architectural rework.
Troubleshooting Reference
| Symptom | Likely Cause | Fix |
|---|---|---|
| ECS task keeps restarting | App crash at startup | aws logs tail /ecs/myapp-api --follow |
| ALB health checks failing | App not ready in time | Increase health-check-grace-period-seconds to 120 |
| DB connection refused | Security group | DB SG must allow 5432 from App SG, not from 0.0.0.0/0
|
| Redis connection refused | Security group | Redis SG must allow 6379 from App SG |
| ECR pull failure | IAM | Add AmazonEC2ContainerRegistryReadOnly to execution role |
| SSM parameter not found | Wrong path | All params must live at /myapp/production/UPPER_CASE
|
| Scheduled job fails | Job SG can't reach RDS/Redis | Use the same App SG for job task definitions |
| Container exits immediately | Missing env var | Check CloudWatch logs for the startup error |
Closing Thoughts
This stack handles everything from zero users to tens of thousands without infrastructure rewrites. The pattern is boring in the best possible way — ECS Fargate is a managed scheduler, RDS and ElastiCache are managed data stores, EventBridge Scheduler is a managed cron. AWS handles patching, availability, and failover for all of them.
The two decisions that have the most leverage:
- Security groups as your firewall. Every layer only opens to the layer above it. No shortcuts.
- EventBridge Scheduler → ECS one-shot tasks over Lambda for heavy jobs. Lambda is excellent for lightweight event handlers. Once your dependency tree gets serious, reuse your existing image and let Fargate handle it.
The full setup including cluster, databases, registry, jobs, and custom domain takes about 45 minutes end-to-end following this guide. The Dockerfile multi-stage pattern, SSM parameterization, and EventBridge-ECS job design all translate directly to other languages and frameworks.
If you have questions or want to share how you've adapted this for your stack, drop a comment below.
Arafat Olayiwola is a 5x AWS Community Builder specializing in cloud-native backend architecture and developer tooling. He writes about practical AWS patterns for production systems.
Top comments (0)