Most AWS tutorials stop at "it works."
I wanted to build something closer to what a real engineering team would operate: network isolation, IAM least privilege, blue-green deployments, secrets management, and clean teardown—all defined as code.
This article walks through the architecture, design decisions, tradeoffs, and the 8 real issues I encountered along the way.
GitHub Repository: ecs-production-platform
Total Cost: $0.12 for complete validation (4-hour session)
Table of Contents
- What Was Built
- Architecture Overview
- Security Design
- Blue-Green Deployment Mechanics
- What Broke (8 Issues)
- Production Tradeoffs
- Cost Analysis
What Was Built
A production-aligned ECS Fargate platform running a Flask API backed by PostgreSQL:
Networking
- Custom VPC (
10.0.0.0/16) across 2 Availability Zones - Public subnets for ALB and ECS tasks
- Private subnets for RDS (no internet route)
Compute
- ECS Fargate services (no EC2 instance management)
- Application Load Balancer with HTTPS (ACM certificate, TLS 1.3)
- Blue-green target groups for zero-downtime deployments
Data
- RDS PostgreSQL 15.12 (single-AZ for free tier)
- Private subnet only, no public endpoint
Security
- IAM role separation (execution vs task)
- SSM Parameter Store for secrets (KMS encrypted)
- Security group layering (internet → ALB → ECS → RDS)
Infrastructure as Code
- 100% Terraform (modular design)
- Remote state in S3 with DynamoDB locking
- Reusable modules (networking, IAM, ALB, ECS, RDS)
36 AWS resources deployed, tested, and cleanly destroyed.
Architecture Overview
┌─────────────────────────────────────────────┐
│ Internet │
└──────────────────┬──────────────────────────┘
│
┌─────────▼─────────┐
│ Route 53 DNS │
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ ACM Certificate │
│ (TLS 1.3) │
└─────────┬───────────┘
│
┌──────────────▼──────────────┐
│ Application Load Balancer │
│ (Public Subnets) │
└──────┬──────────────┬───────┘
│ │
┌──────▼─────┐ ┌─────▼──────┐
│ Blue TG │ │ Green TG │
│ (Weight: │ │ (Weight: │
│ 100%) │ │ 0%) │
└──────┬─────┘ └─────┬──────┘
│ │
┌──────▼──────────────▼──────┐
│ ECS Fargate Services │
│ (Public Subnets) │
│ • 2 tasks (blue) │
│ • 0 tasks (green standby) │
└──────────────┬──────────────┘
│
┌─────────▼──────────┐
│ RDS PostgreSQL │
│ (Private Subnet) │
│ • No public IP │
│ • Port 5432 only │
└─────────────────────┘
Key Design Choices:
- ECS in Public Subnets: Cost optimization—saves $33/month on NAT Gateway
- Single-AZ RDS: Free tier constraint—production would use Multi-AZ
- Security Groups: Each layer enforces isolation for the next
Terraform Module Structure
Instead of one monolithic configuration, concerns are separated into focused modules:
terraform/
├── modules/
│ ├── networking/ # VPC, subnets, security groups, DB subnet group
│ ├── iam/ # Task execution role, task role, policies
│ ├── alb/ # Load balancer, target groups, listeners
│ ├── ecs/ # Cluster, services, task definitions
│ ├── rds/ # PostgreSQL instance, parameter group
│ └── cicd/ # GitHub Actions IAM role (design only)
└── environments/
└── prod/
├── main.tf # Module composition
├── variables.tf # Input variables
├── outputs.tf # Stack outputs
├── backend.tf # S3 + DynamoDB state
└── versions.tf # Provider versions
Module Communication:
- Explicit inputs and outputs only
- No cross-module resource references
- No hidden dependencies
Example Module Call:
module "ecs" {
source = "../../modules/ecs"
project_name = var.project_name
container_image = var.container_image
ecs_task_execution_role_arn = module.iam.ecs_task_execution_role_arn
ecs_task_role_arn = module.iam.ecs_task_role_arn
public_subnet_ids = module.networking.public_subnet_ids
ecs_security_group_id = module.networking.ecs_tasks_security_group_id
target_group_arn = module.alb.blue_target_group_arn
# Database connection
db_host = module.rds.db_address
db_password_ssm_param = "/ecs-prod/db/password"
}
This keeps the architecture composable and prevents circular dependency hell.
Security Design
1. Network Isolation (Security Groups)
Traffic flows in one direction only:
Internet (0.0.0.0/0)
↓ Port 443/80
┌───────────────────┐
│ ALB SG │
│ sg-0373513dd... │
└────────┬──────────┘
↓ Port 8000 (from ALB SG only)
┌────────────────────┐
│ ECS Tasks SG │
│ sg-09a3082e31... │
└────────┬───────────┘
↓ Port 5432 (from ECS SG only)
┌────────────────────┐
│ RDS SG │
│ sg-07a5aae1f9... │
└────────────────────┘
Security Group Rules:
ALB Security Group:
ingress {
description = "HTTPS from internet"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
description = "To ECS tasks only"
from_port = 8000
to_port = 8000
protocol = "tcp"
security_groups = [aws_security_group.ecs_tasks.id]
}
ECS Tasks Security Group:
ingress {
description = "From ALB only"
from_port = 8000
to_port = 8000
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
description = "To RDS only"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.rds.id]
}
RDS Security Group:
ingress {
description = "PostgreSQL from ECS only"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.ecs_tasks.id]
}
# No egress rules - database doesn't need outbound
Result:
- No direct internet access to ECS tasks
- No public database endpoint
- No bypass paths
2. IAM Role Separation
Two distinct roles prevent privilege escalation:
Task Execution Role (infrastructure operations):
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents",
"ssm:GetParameters"
],
"Resource": "*"
}]
}
Task Role (application runtime):
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "ssm:GetParameter",
"Resource": "arn:aws:ssm:us-east-1:*:parameter/ecs-prod/*"
}]
}
Why This Matters:
If a container is compromised, the attacker inherits only the task role—not the execution role. They can't:
- Pull arbitrary images from ECR
- Write to CloudWatch logs outside their stream
- Access SSM parameters outside
/ecs-prod/*
3. Secrets Management Flow
# 1. Generate password
PASSWORD=$(openssl rand -base64 32 | tr -d "=+/" | cut -c1-25)
# 2. Store in SSM (KMS encrypted)
aws ssm put-parameter \
--name "/ecs-prod/db/password" \
--value "$PASSWORD" \
--type "SecureString"
# 3. Reference in task definition
{
"name": "DB_PASSWORD_SSM_PARAM",
"value": "/ecs-prod/db/password"
}
# 4. Application fetches at runtime
import boto3
ssm = boto3.client('ssm')
password = ssm.get_parameter(
Name=os.environ['DB_PASSWORD_SSM_PARAM'],
WithDecryption=True
)['Parameter']['Value']
Never in:
- Source control
- Docker image
- Environment variables (plaintext)
- Terraform state (marked
sensitive)
Blue-Green Deployment Mechanics
The ALB HTTPS listener can route traffic to two separate target groups with configurable weights.
Initial State
Blue Target Group: ████████████████████ 100% (2 healthy tasks)
Green Target Group: -------------------- 0% (0 tasks)
Deployment Process
Step 1: Build New Version
# Update app version in app.py
# Build Docker image
docker build -t 758620460011.dkr.ecr.us-east-1.amazonaws.com/ecs-prod/flask-app:v2 .
# Push to ECR
docker push 758620460011.dkr.ecr.us-east-1.amazonaws.com/ecs-prod/flask-app:v2
Step 2: Deploy to Green
# Scale up green service
aws ecs update-service \
--cluster ecs-prod-cluster \
--service ecs-prod-service-green \
--desired-count 2
# Wait for health checks (90 seconds)
aws ecs wait services-stable \
--cluster ecs-prod-cluster \
--services ecs-prod-service-green
Step 3: Switch Traffic (< 1 Second)
# Get listener and target group ARNs
LISTENER_ARN=$(terraform output -raw alb_listener_arn)
GREEN_TG=$(terraform output -raw green_target_group_arn)
# Instant traffic switch
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$GREEN_TG
Step 4: Validate and Cleanup
# Monitor green deployment
curl https://app.cipherpol.xyz/health
# {"version":"2.0.0","deployment":"green","status":"healthy"}
# After 15 minutes of monitoring, scale down blue
aws ecs update-service \
--cluster ecs-prod-cluster \
--service ecs-prod-service \
--desired-count 0
Why This Works
No DNS propagation delays — Traffic switches at ALB layer
No container restarts — Only listener weight changes
Instant rollback — Reverse the listener modification
No downtime — ALB handles connection draining
Rollback Command (Same as Deploy):
# Get blue target group
BLUE_TG=$(terraform output -raw blue_target_group_arn)
# Instant rollback
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$BLUE_TG
Data Persistence Validation
Both blue and green services connect to the same RDS instance. I validated this explicitly:
# 1. Create items while blue is active
curl -X POST https://app.cipherpol.xyz/items \
-H "Content-Type: application/json" \
-d '{"name":"Test Item 1"}'
curl -X POST https://app.cipherpol.xyz/items \
-H "Content-Type: application/json" \
-d '{"name":"Test Item 2"}'
# 2. Switch traffic to green
aws elbv2 modify-listener --listener-arn $LISTENER_ARN --default-actions Type=forward,TargetGroupArn=$GREEN_TG
# 3. Verify data persists
curl https://app.cipherpol.xyz/items | jq
{
"count": 2,
"items": [
{"id": 2, "name": "Test Item 2", "created_at": "2026-02-19T10:54:03"},
{"id": 1, "name": "Test Item 1", "created_at": "2026-02-19T10:53:02"}
]
}
# 4. Create new item on green
curl -X POST https://app.cipherpol.xyz/items \
-H "Content-Type: application/json" \
-d '{"name":"Created on Green v2.0"}'
# 5. Rollback to blue
aws elbv2 modify-listener --listener-arn $LISTENER_ARN --default-actions Type=forward,TargetGroupArn=$BLUE_TG
# 6. Confirm all items still present
curl https://app.cipherpol.xyz/items | jq '.count'
# 3
Result:
5 items persisted across 3 deployment cycles
The deployment layer is stateless
The database is the single source of truth
What Broke (And What I Learned)
Issue 1: RDS Connection Delay After "Available" Status
Problem: ECS tasks failed health checks immediately after terraform apply completed RDS creation.
Symptoms:
ecs-prod-service: unhealthy targets: 2/2
Task stopped reason: Task failed container health checks
Root Cause:
RDS reports available status when the instance is running, but doesn't accept connections for another 60-90 seconds while:
- Background processes initialize
- Performance schema loads
- Cache warms up
Fix:
Health check grace period (60s) absorbed the delay:
resource "aws_ecs_service" "app" {
health_check_grace_period_seconds = 60
# Tasks retry connection until RDS is ready
}
Lesson: AWS resource statuses don't always mean "ready for traffic." Plan for initialization time.
Issue 2: Docker HEALTHCHECK Missing Dependency
Problem: 100+ ECS task restarts. ALB target health showed healthy, but ECS kept replacing tasks.
Symptoms:
# ALB target group
Target health: healthy (2/2)
# ECS service events
Unhealthy container: flask-app
Task stopped, starting replacement
Root Cause:
# Original (broken)
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
But curl wasn't in python:3.11-slim base image.
Why ALB Passed but Container Failed:
- ALB health check: HTTP request from outside the container (port 8000)
- Container health check: Command executed inside the container
Fix:
# Use Python instead of curl
HEALTHCHECK CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health', timeout=2)"
Lesson:
ALB health checks and container health checks are independent control loops:
- ALB health: Determines if task receives traffic
- Container health: Determines if ECS replaces the task
ECS uses container health, not ALB health.
Issue 3: Git Bash Path Conversion on Windows
Problem: SSM parameter /ecs-prod/db/password became C:\Program Files\Git\ecs-prod\db\password
Error:
ParameterNotFound: /C:/Program Files/Git/ecs-prod/db/password
Root Cause: Git Bash on Windows auto-converts Unix-style paths to Windows paths.
Fix:
# Disable path conversion
export MSYS_NO_PATHCONV=1
# Then run AWS CLI
aws ssm get-parameter --name /ecs-prod/db/password --with-decryption
Alternative: Use Windows Command Prompt or PowerShell for AWS CLI commands.
Lesson: Git Bash is great for Unix tools, but AWS CLI needs special handling on Windows.
Issue 4: ALB Listener Syntax Constraint
Problem: After configuring blue-green with weighted routing, subsequent updates using simple syntax failed.
Error:
An error occurred (ValidationError): Cannot use both TargetGroupArn and ForwardConfig in the same action
Root Cause: Once you use ForwardConfig (weighted routing), the ALB API remembers this and requires full JSON syntax for all future updates.
Simple syntax (stopped working):
--default-actions Type=forward,TargetGroupArn=arn:aws:...
Required syntax:
--default-actions '[{
"Type": "forward",
"ForwardConfig": {
"TargetGroups": [{
"TargetGroupArn": "arn:aws:...",
"Weight": 100
}]
}
}]'
Lesson: ALB API is stateful. Once you use advanced features, you can't revert to simple syntax. Document this in runbooks.
Issue 5: PostgreSQL Minor Version Retirement
Problem: Terraform apply failed with:
InvalidParameterValue: Cannot find version 15.4 for postgres
Root Cause: AWS retired PostgreSQL 15.4 in favor of 15.12 (latest patch version).
Fix:
# Before
engine_version = "15.4"
# After
engine_version = "15.12"
Better Fix (Production):
# Pin major version, allow minor updates
engine_version = "15"
Lesson: AWS manages minor version lifecycle. Pin major versions intentionally, but expect patch version changes.
Issue 6: Terraform State Lock Timeout
Problem: terraform plan hung for 5 minutes, then failed:
Error acquiring state lock: timeout waiting for lock
Root Cause: DynamoDB lock table had wrong key schema:
# Wrong
AttributeName=id,KeyType=HASH
# Correct
AttributeName=LockID,KeyType=HASH
Fix:
# Delete wrong table
aws dynamodb delete-table --table-name terraform-state-lock
# Recreate with correct schema
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
Lesson: Terraform's DynamoDB lock table requires exactly LockID as the partition key. Case-sensitive.
Issue 7: ECR Image Pull Failures (Intermittent)
Problem: Some task launches failed with:
CannotPullContainerError: API error: manifest unknown
Root Cause: Task execution role was missing ecr:BatchGetImage permission.
Fix: Attached AWS managed policy:
resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
role = aws_iam_role.ecs_task_execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
Lesson: Custom IAM policies are error-prone. Use AWS managed policies where possible, then restrict with conditions if needed.
Issue 8: Terraform vs Manual Scaling Conflict
Problem: Terraform tried to update blue service's desired_count while I was manually scaling for testing:
Error: concurrent modification detected
Service is being modified by another operation
Fix: Added lifecycle rule to ignore runtime changes:
resource "aws_ecs_service" "app" {
# ... other config
lifecycle {
ignore_changes = [desired_count]
}
}
Lesson: When testing blue-green manually, let Terraform manage infrastructure but ignore runtime scaling changes. Use ignore_changes selectively.
What I'd Change in Production
| This Project | Production Standard | Reason | Cost Impact |
|---|---|---|---|
| ECS in public subnets | Private subnets + NAT Gateway | Defense-in-depth, reduced attack surface | +$33/month |
| Single-AZ RDS | Multi-AZ RDS | 99.95% SLA vs 99.5%, automatic failover | +$15/month |
| Manual ALB switch | AWS CodeDeploy blue-green | Automated rollback based on CloudWatch alarms | $0 (free) |
| No autoscaling | ECS Service Auto Scaling | Handle traffic spikes, reduce idle costs | Variable |
| SSM Parameter Store | AWS Secrets Manager | Automatic rotation, better audit | +$0.40/secret |
| No read replicas | RDS read replica | Offload read traffic from primary | +$15/month |
| 7-day log retention | 30-90 day retention | Compliance, longer incident investigation | +$2/month |
Total Production Cost: ~$80-100/month
This Project Cost: $0.12 for validation
Cost Savings Breakdown:
- NAT Gateway: $33 saved
- Multi-AZ RDS: $15 saved
- Secrets Manager: $0.40 saved
- Read replica: $15 saved
- Total: $63.40/month saved
Cost Breakdown (Actual)
4-Hour Validation Session
| Service | Hourly Rate | Hours | Cost |
|---|---|---|---|
| ALB | $0.0225 | 3 | $0.0675 |
| ECS Fargate (2 tasks × 0.25 vCPU) | Free tier | 3 | $0 |
| RDS db.t3.micro (single-AZ) | Free tier | 7 | $0 |
| Route 53 Hosted Zone | $0.50/month prorated | 3 days | $0.05 |
| S3 State Bucket | <$0.01 | — | $0 |
| CloudWatch Logs | Free tier (5GB) | — | $0 |
| Data Transfer | Free tier (100GB) | <1GB | $0 |
| Total | $0.12 |
Monthly Cost If Kept Running
| Service | Monthly Cost | Notes |
|---|---|---|
| ALB | $16.20 | 720 hours × $0.0225 |
| ECS Fargate | $0 | Within 400 vCPU-hour free tier |
| RDS db.t3.micro | $0 | Within 750-hour free tier |
| Route 53 | $0.50 | Hosted zone |
| Storage | <$1 | S3, CloudWatch |
| Total | ~$17/month |
Comparison to EKS:
- EKS control plane: $72/month
- ECS Fargate control plane: $0
- Savings: $72/month for equivalent compute
Key Takeaways
What Worked Well
Modular Terraform — Each module had single responsibility, debugging was easier
Blue-green switching — True zero downtime, instant rollback capability
Security group layering — Network isolation without complexity
SSM secrets — No credentials in code, images, or state files
Free tier optimization — Validated production patterns for $0.12
Documentation — 8 real issues documented with root causes and fixes
What I'd Improve
Terragrunt for DRY — Multi-environment deployments without duplication
Automated testing — Pre-deployment health checks in CI/CD
CodeDeploy integration — Production should automate blue-green
Observability — CloudWatch dashboards for latency, errors, saturation
Database migrations — Flyway or Liquibase for schema versioning
Chaos engineering — Terminate random tasks to test resilience
Repository & Evidence
Full source code with detailed documentation:
GitHub: cypher682/ecs-production-platform
What's included:
- Complete Terraform modules (networking, IAM, ALB, ECS, RDS)
- Flask application with Dockerfile and health checks
- GitHub Actions workflow (OIDC design, not tested live)
- Operational runbooks (deployment failure, database connection, rollback)
- Lessons learned documentation (8 issues with root cause analysis)
- Cost analysis (actual vs production projection)
- Evidence files (Terraform outputs, CloudWatch logs, test results)
Documentation structure:
docs/
├── ARCHITECTURE.md # Design decisions and network diagrams
├── 01_IMPLEMENTATION.md # Phase-by-phase build log
├── 02_LESSONS_LEARNED.md # 8 issues with fixes
├── 03_COST_ANALYSIS.md # Detailed cost breakdown
├── 04_SECURITY.md # IAM policies, secrets flow
├── 05_CICD_DESIGN.md # GitHub Actions workflow design
└── runbooks/
├── deployment-failure.md # What to do when deploy fails
├── database-connection.md # Troubleshooting RDS connectivity
└── rollback-procedure.md # Step-by-step rollback guide
Questions or Feedback?
If you're building something similar or have questions about:
- Blue-green deployment patterns
- IAM least privilege design
- AWS cost optimization strategies
- Terraform module architecture
- Debugging ECS task failures
Drop a comment below! I'll respond with specific examples from this build.
This project was built as a portfolio sprint to demonstrate production-ready AWS skills. The platform was deployed, validated with 5 CRUD operations, and destroyed within 4 hours—total cost $0.12. All code and documentation available on GitHub.
#AWS #Terraform #DevOps #ECS #InfrastructureAsCode #BlueGreenDeployment #CloudEngineering #Docker #PostgreSQL
Top comments (0)