DEV Community

Cover image for Designing a Production-Grade Blue-Green ECS Platform on AWS with Terraform
cypher682
cypher682

Posted on

Designing a Production-Grade Blue-Green ECS Platform on AWS with Terraform

Most AWS tutorials stop at "it works."

I wanted to build something closer to what a real engineering team would operate: network isolation, IAM least privilege, blue-green deployments, secrets management, and clean teardown—all defined as code.

This article walks through the architecture, design decisions, tradeoffs, and the 8 real issues I encountered along the way.

GitHub Repository: ecs-production-platform

Total Cost: $0.12 for complete validation (4-hour session)


Table of Contents

  1. What Was Built
  2. Architecture Overview
  3. Security Design
  4. Blue-Green Deployment Mechanics
  5. What Broke (8 Issues)
  6. Production Tradeoffs
  7. Cost Analysis

What Was Built

A production-aligned ECS Fargate platform running a Flask API backed by PostgreSQL:

Networking

  • Custom VPC (10.0.0.0/16) across 2 Availability Zones
  • Public subnets for ALB and ECS tasks
  • Private subnets for RDS (no internet route)

Compute

  • ECS Fargate services (no EC2 instance management)
  • Application Load Balancer with HTTPS (ACM certificate, TLS 1.3)
  • Blue-green target groups for zero-downtime deployments

Data

  • RDS PostgreSQL 15.12 (single-AZ for free tier)
  • Private subnet only, no public endpoint

Security

  • IAM role separation (execution vs task)
  • SSM Parameter Store for secrets (KMS encrypted)
  • Security group layering (internet → ALB → ECS → RDS)

Infrastructure as Code

  • 100% Terraform (modular design)
  • Remote state in S3 with DynamoDB locking
  • Reusable modules (networking, IAM, ALB, ECS, RDS)

36 AWS resources deployed, tested, and cleanly destroyed.


Architecture Overview

┌─────────────────────────────────────────────┐
│              Internet                        │
└──────────────────┬──────────────────────────┘
                   │
         ┌─────────▼─────────┐
         │    Route 53 DNS    │
         └─────────┬──────────┘
                   │
         ┌─────────▼──────────┐
         │  ACM Certificate    │
         │    (TLS 1.3)        │
         └─────────┬───────────┘
                   │
    ┌──────────────▼──────────────┐
    │  Application Load Balancer  │
    │   (Public Subnets)          │
    └──────┬──────────────┬───────┘
           │              │
    ┌──────▼─────┐ ┌─────▼──────┐
    │ Blue TG    │ │ Green TG   │
    │ (Weight:   │ │ (Weight:   │
    │  100%)     │ │   0%)      │
    └──────┬─────┘ └─────┬──────┘
           │              │
    ┌──────▼──────────────▼──────┐
    │   ECS Fargate Services      │
    │   (Public Subnets)          │
    │   • 2 tasks (blue)          │
    │   • 0 tasks (green standby) │
    └──────────────┬──────────────┘
                   │
         ┌─────────▼──────────┐
         │  RDS PostgreSQL     │
         │  (Private Subnet)   │
         │  • No public IP     │
         │  • Port 5432 only   │
         └─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key Design Choices:

  • ECS in Public Subnets: Cost optimization—saves $33/month on NAT Gateway
  • Single-AZ RDS: Free tier constraint—production would use Multi-AZ
  • Security Groups: Each layer enforces isolation for the next

Terraform Module Structure

Instead of one monolithic configuration, concerns are separated into focused modules:

terraform/
├── modules/
│   ├── networking/     # VPC, subnets, security groups, DB subnet group
│   ├── iam/            # Task execution role, task role, policies
│   ├── alb/            # Load balancer, target groups, listeners
│   ├── ecs/            # Cluster, services, task definitions
│   ├── rds/            # PostgreSQL instance, parameter group
│   └── cicd/           # GitHub Actions IAM role (design only)
└── environments/
    └── prod/
        ├── main.tf         # Module composition
        ├── variables.tf    # Input variables
        ├── outputs.tf      # Stack outputs
        ├── backend.tf      # S3 + DynamoDB state
        └── versions.tf     # Provider versions
Enter fullscreen mode Exit fullscreen mode

Module Communication:

  • Explicit inputs and outputs only
  • No cross-module resource references
  • No hidden dependencies

Example Module Call:

module "ecs" {
  source = "../../modules/ecs"

  project_name                = var.project_name
  container_image             = var.container_image
  ecs_task_execution_role_arn = module.iam.ecs_task_execution_role_arn
  ecs_task_role_arn           = module.iam.ecs_task_role_arn
  public_subnet_ids           = module.networking.public_subnet_ids
  ecs_security_group_id       = module.networking.ecs_tasks_security_group_id
  target_group_arn            = module.alb.blue_target_group_arn

  # Database connection
  db_host                     = module.rds.db_address
  db_password_ssm_param       = "/ecs-prod/db/password"
}
Enter fullscreen mode Exit fullscreen mode

This keeps the architecture composable and prevents circular dependency hell.


Security Design

1. Network Isolation (Security Groups)

Traffic flows in one direction only:

Internet (0.0.0.0/0)
    ↓ Port 443/80
┌───────────────────┐
│  ALB SG           │
│  sg-0373513dd...  │
└────────┬──────────┘
         ↓ Port 8000 (from ALB SG only)
┌────────────────────┐
│  ECS Tasks SG      │
│  sg-09a3082e31...  │
└────────┬───────────┘
         ↓ Port 5432 (from ECS SG only)
┌────────────────────┐
│  RDS SG            │
│  sg-07a5aae1f9...  │
└────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Security Group Rules:

ALB Security Group:

ingress {
  description = "HTTPS from internet"
  from_port   = 443
  to_port     = 443
  protocol    = "tcp"
  cidr_blocks = ["0.0.0.0/0"]
}

egress {
  description     = "To ECS tasks only"
  from_port       = 8000
  to_port         = 8000
  protocol        = "tcp"
  security_groups = [aws_security_group.ecs_tasks.id]
}
Enter fullscreen mode Exit fullscreen mode

ECS Tasks Security Group:

ingress {
  description     = "From ALB only"
  from_port       = 8000
  to_port         = 8000
  protocol        = "tcp"
  security_groups = [aws_security_group.alb.id]
}

egress {
  description     = "To RDS only"
  from_port       = 5432
  to_port         = 5432
  protocol        = "tcp"
  security_groups = [aws_security_group.rds.id]
}
Enter fullscreen mode Exit fullscreen mode

RDS Security Group:

ingress {
  description     = "PostgreSQL from ECS only"
  from_port       = 5432
  to_port         = 5432
  protocol        = "tcp"
  security_groups = [aws_security_group.ecs_tasks.id]
}

# No egress rules - database doesn't need outbound
Enter fullscreen mode Exit fullscreen mode

Result:

  • No direct internet access to ECS tasks
  • No public database endpoint
  • No bypass paths

2. IAM Role Separation

Two distinct roles prevent privilege escalation:

Task Execution Role (infrastructure operations):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "ecr:GetAuthorizationToken",
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:BatchGetImage",
      "logs:CreateLogStream",
      "logs:PutLogEvents",
      "ssm:GetParameters"
    ],
    "Resource": "*"
  }]
}
Enter fullscreen mode Exit fullscreen mode

Task Role (application runtime):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "ssm:GetParameter",
    "Resource": "arn:aws:ssm:us-east-1:*:parameter/ecs-prod/*"
  }]
}
Enter fullscreen mode Exit fullscreen mode

Why This Matters:

If a container is compromised, the attacker inherits only the task role—not the execution role. They can't:

  • Pull arbitrary images from ECR
  • Write to CloudWatch logs outside their stream
  • Access SSM parameters outside /ecs-prod/*

3. Secrets Management Flow

# 1. Generate password
PASSWORD=$(openssl rand -base64 32 | tr -d "=+/" | cut -c1-25)

# 2. Store in SSM (KMS encrypted)
aws ssm put-parameter \
  --name "/ecs-prod/db/password" \
  --value "$PASSWORD" \
  --type "SecureString"

# 3. Reference in task definition
{
  "name": "DB_PASSWORD_SSM_PARAM",
  "value": "/ecs-prod/db/password"
}

# 4. Application fetches at runtime
import boto3
ssm = boto3.client('ssm')
password = ssm.get_parameter(
    Name=os.environ['DB_PASSWORD_SSM_PARAM'],
    WithDecryption=True
)['Parameter']['Value']
Enter fullscreen mode Exit fullscreen mode

Never in:

  • Source control
  • Docker image
  • Environment variables (plaintext)
  • Terraform state (marked sensitive)

Blue-Green Deployment Mechanics

The ALB HTTPS listener can route traffic to two separate target groups with configurable weights.

Initial State

Blue Target Group:  ████████████████████ 100% (2 healthy tasks)
Green Target Group: -------------------- 0%  (0 tasks)
Enter fullscreen mode Exit fullscreen mode

Deployment Process

Step 1: Build New Version

# Update app version in app.py
# Build Docker image
docker build -t 758620460011.dkr.ecr.us-east-1.amazonaws.com/ecs-prod/flask-app:v2 .

# Push to ECR
docker push 758620460011.dkr.ecr.us-east-1.amazonaws.com/ecs-prod/flask-app:v2
Enter fullscreen mode Exit fullscreen mode

Step 2: Deploy to Green

# Scale up green service
aws ecs update-service \
  --cluster ecs-prod-cluster \
  --service ecs-prod-service-green \
  --desired-count 2

# Wait for health checks (90 seconds)
aws ecs wait services-stable \
  --cluster ecs-prod-cluster \
  --services ecs-prod-service-green
Enter fullscreen mode Exit fullscreen mode

Step 3: Switch Traffic (< 1 Second)

# Get listener and target group ARNs
LISTENER_ARN=$(terraform output -raw alb_listener_arn)
GREEN_TG=$(terraform output -raw green_target_group_arn)

# Instant traffic switch
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$GREEN_TG
Enter fullscreen mode Exit fullscreen mode

Step 4: Validate and Cleanup

# Monitor green deployment
curl https://app.cipherpol.xyz/health
# {"version":"2.0.0","deployment":"green","status":"healthy"}

# After 15 minutes of monitoring, scale down blue
aws ecs update-service \
  --cluster ecs-prod-cluster \
  --service ecs-prod-service \
  --desired-count 0
Enter fullscreen mode Exit fullscreen mode

Why This Works

No DNS propagation delays — Traffic switches at ALB layer

No container restarts — Only listener weight changes

Instant rollback — Reverse the listener modification

No downtime — ALB handles connection draining

Rollback Command (Same as Deploy):

# Get blue target group
BLUE_TG=$(terraform output -raw blue_target_group_arn)

# Instant rollback
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$BLUE_TG
Enter fullscreen mode Exit fullscreen mode

Data Persistence Validation

Both blue and green services connect to the same RDS instance. I validated this explicitly:

# 1. Create items while blue is active
curl -X POST https://app.cipherpol.xyz/items \
  -H "Content-Type: application/json" \
  -d '{"name":"Test Item 1"}'

curl -X POST https://app.cipherpol.xyz/items \
  -H "Content-Type: application/json" \
  -d '{"name":"Test Item 2"}'

# 2. Switch traffic to green
aws elbv2 modify-listener --listener-arn $LISTENER_ARN --default-actions Type=forward,TargetGroupArn=$GREEN_TG

# 3. Verify data persists
curl https://app.cipherpol.xyz/items | jq
{
  "count": 2,
  "items": [
    {"id": 2, "name": "Test Item 2", "created_at": "2026-02-19T10:54:03"},
    {"id": 1, "name": "Test Item 1", "created_at": "2026-02-19T10:53:02"}
  ]
}

# 4. Create new item on green
curl -X POST https://app.cipherpol.xyz/items \
  -H "Content-Type: application/json" \
  -d '{"name":"Created on Green v2.0"}'

# 5. Rollback to blue
aws elbv2 modify-listener --listener-arn $LISTENER_ARN --default-actions Type=forward,TargetGroupArn=$BLUE_TG

# 6. Confirm all items still present
curl https://app.cipherpol.xyz/items | jq '.count'
# 3
Enter fullscreen mode Exit fullscreen mode

Result:

5 items persisted across 3 deployment cycles

The deployment layer is stateless

The database is the single source of truth


What Broke (And What I Learned)

Issue 1: RDS Connection Delay After "Available" Status

Problem: ECS tasks failed health checks immediately after terraform apply completed RDS creation.

Symptoms:

ecs-prod-service: unhealthy targets: 2/2
Task stopped reason: Task failed container health checks
Enter fullscreen mode Exit fullscreen mode

Root Cause:

RDS reports available status when the instance is running, but doesn't accept connections for another 60-90 seconds while:

  • Background processes initialize
  • Performance schema loads
  • Cache warms up

Fix:

Health check grace period (60s) absorbed the delay:

resource "aws_ecs_service" "app" {
  health_check_grace_period_seconds = 60
  # Tasks retry connection until RDS is ready
}
Enter fullscreen mode Exit fullscreen mode

Lesson: AWS resource statuses don't always mean "ready for traffic." Plan for initialization time.


Issue 2: Docker HEALTHCHECK Missing Dependency

Problem: 100+ ECS task restarts. ALB target health showed healthy, but ECS kept replacing tasks.

Symptoms:

# ALB target group
Target health: healthy (2/2)

# ECS service events
Unhealthy container: flask-app
Task stopped, starting replacement
Enter fullscreen mode Exit fullscreen mode

Root Cause:

# Original (broken)
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
Enter fullscreen mode Exit fullscreen mode

But curl wasn't in python:3.11-slim base image.

Why ALB Passed but Container Failed:

  • ALB health check: HTTP request from outside the container (port 8000)
  • Container health check: Command executed inside the container

Fix:

# Use Python instead of curl
HEALTHCHECK CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health', timeout=2)"
Enter fullscreen mode Exit fullscreen mode

Lesson:

ALB health checks and container health checks are independent control loops:

  • ALB health: Determines if task receives traffic
  • Container health: Determines if ECS replaces the task

ECS uses container health, not ALB health.


Issue 3: Git Bash Path Conversion on Windows

Problem: SSM parameter /ecs-prod/db/password became C:\Program Files\Git\ecs-prod\db\password

Error:

ParameterNotFound: /C:/Program Files/Git/ecs-prod/db/password
Enter fullscreen mode Exit fullscreen mode

Root Cause: Git Bash on Windows auto-converts Unix-style paths to Windows paths.

Fix:

# Disable path conversion
export MSYS_NO_PATHCONV=1

# Then run AWS CLI
aws ssm get-parameter --name /ecs-prod/db/password --with-decryption
Enter fullscreen mode Exit fullscreen mode

Alternative: Use Windows Command Prompt or PowerShell for AWS CLI commands.

Lesson: Git Bash is great for Unix tools, but AWS CLI needs special handling on Windows.


Issue 4: ALB Listener Syntax Constraint

Problem: After configuring blue-green with weighted routing, subsequent updates using simple syntax failed.

Error:

An error occurred (ValidationError): Cannot use both TargetGroupArn and ForwardConfig in the same action
Enter fullscreen mode Exit fullscreen mode

Root Cause: Once you use ForwardConfig (weighted routing), the ALB API remembers this and requires full JSON syntax for all future updates.

Simple syntax (stopped working):

--default-actions Type=forward,TargetGroupArn=arn:aws:...
Enter fullscreen mode Exit fullscreen mode

Required syntax:

--default-actions '[{
  "Type": "forward",
  "ForwardConfig": {
    "TargetGroups": [{
      "TargetGroupArn": "arn:aws:...",
      "Weight": 100
    }]
  }
}]'
Enter fullscreen mode Exit fullscreen mode

Lesson: ALB API is stateful. Once you use advanced features, you can't revert to simple syntax. Document this in runbooks.


Issue 5: PostgreSQL Minor Version Retirement

Problem: Terraform apply failed with:

InvalidParameterValue: Cannot find version 15.4 for postgres
Enter fullscreen mode Exit fullscreen mode

Root Cause: AWS retired PostgreSQL 15.4 in favor of 15.12 (latest patch version).

Fix:

# Before
engine_version = "15.4"

# After
engine_version = "15.12"
Enter fullscreen mode Exit fullscreen mode

Better Fix (Production):

# Pin major version, allow minor updates
engine_version = "15"
Enter fullscreen mode Exit fullscreen mode

Lesson: AWS manages minor version lifecycle. Pin major versions intentionally, but expect patch version changes.


Issue 6: Terraform State Lock Timeout

Problem: terraform plan hung for 5 minutes, then failed:

Error acquiring state lock: timeout waiting for lock
Enter fullscreen mode Exit fullscreen mode

Root Cause: DynamoDB lock table had wrong key schema:

# Wrong
AttributeName=id,KeyType=HASH

# Correct
AttributeName=LockID,KeyType=HASH
Enter fullscreen mode Exit fullscreen mode

Fix:

# Delete wrong table
aws dynamodb delete-table --table-name terraform-state-lock

# Recreate with correct schema
aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST
Enter fullscreen mode Exit fullscreen mode

Lesson: Terraform's DynamoDB lock table requires exactly LockID as the partition key. Case-sensitive.


Issue 7: ECR Image Pull Failures (Intermittent)

Problem: Some task launches failed with:

CannotPullContainerError: API error: manifest unknown
Enter fullscreen mode Exit fullscreen mode

Root Cause: Task execution role was missing ecr:BatchGetImage permission.

Fix: Attached AWS managed policy:

resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
Enter fullscreen mode Exit fullscreen mode

Lesson: Custom IAM policies are error-prone. Use AWS managed policies where possible, then restrict with conditions if needed.


Issue 8: Terraform vs Manual Scaling Conflict

Problem: Terraform tried to update blue service's desired_count while I was manually scaling for testing:

Error: concurrent modification detected
Service is being modified by another operation
Enter fullscreen mode Exit fullscreen mode

Fix: Added lifecycle rule to ignore runtime changes:

resource "aws_ecs_service" "app" {
  # ... other config

  lifecycle {
    ignore_changes = [desired_count]
  }
}
Enter fullscreen mode Exit fullscreen mode

Lesson: When testing blue-green manually, let Terraform manage infrastructure but ignore runtime scaling changes. Use ignore_changes selectively.


What I'd Change in Production

This Project Production Standard Reason Cost Impact
ECS in public subnets Private subnets + NAT Gateway Defense-in-depth, reduced attack surface +$33/month
Single-AZ RDS Multi-AZ RDS 99.95% SLA vs 99.5%, automatic failover +$15/month
Manual ALB switch AWS CodeDeploy blue-green Automated rollback based on CloudWatch alarms $0 (free)
No autoscaling ECS Service Auto Scaling Handle traffic spikes, reduce idle costs Variable
SSM Parameter Store AWS Secrets Manager Automatic rotation, better audit +$0.40/secret
No read replicas RDS read replica Offload read traffic from primary +$15/month
7-day log retention 30-90 day retention Compliance, longer incident investigation +$2/month

Total Production Cost: ~$80-100/month

This Project Cost: $0.12 for validation

Cost Savings Breakdown:

  • NAT Gateway: $33 saved
  • Multi-AZ RDS: $15 saved
  • Secrets Manager: $0.40 saved
  • Read replica: $15 saved
  • Total: $63.40/month saved

Cost Breakdown (Actual)

4-Hour Validation Session

Service Hourly Rate Hours Cost
ALB $0.0225 3 $0.0675
ECS Fargate (2 tasks × 0.25 vCPU) Free tier 3 $0
RDS db.t3.micro (single-AZ) Free tier 7 $0
Route 53 Hosted Zone $0.50/month prorated 3 days $0.05
S3 State Bucket <$0.01 $0
CloudWatch Logs Free tier (5GB) $0
Data Transfer Free tier (100GB) <1GB $0
Total $0.12

Monthly Cost If Kept Running

Service Monthly Cost Notes
ALB $16.20 720 hours × $0.0225
ECS Fargate $0 Within 400 vCPU-hour free tier
RDS db.t3.micro $0 Within 750-hour free tier
Route 53 $0.50 Hosted zone
Storage <$1 S3, CloudWatch
Total ~$17/month

Comparison to EKS:

  • EKS control plane: $72/month
  • ECS Fargate control plane: $0
  • Savings: $72/month for equivalent compute

Key Takeaways

What Worked Well

Modular Terraform — Each module had single responsibility, debugging was easier

Blue-green switching — True zero downtime, instant rollback capability

Security group layering — Network isolation without complexity

SSM secrets — No credentials in code, images, or state files

Free tier optimization — Validated production patterns for $0.12

Documentation — 8 real issues documented with root causes and fixes

What I'd Improve

Terragrunt for DRY — Multi-environment deployments without duplication

Automated testing — Pre-deployment health checks in CI/CD

CodeDeploy integration — Production should automate blue-green

Observability — CloudWatch dashboards for latency, errors, saturation

Database migrations — Flyway or Liquibase for schema versioning

Chaos engineering — Terminate random tasks to test resilience


Repository & Evidence

Full source code with detailed documentation:

GitHub: cypher682/ecs-production-platform

What's included:

  • Complete Terraform modules (networking, IAM, ALB, ECS, RDS)
  • Flask application with Dockerfile and health checks
  • GitHub Actions workflow (OIDC design, not tested live)
  • Operational runbooks (deployment failure, database connection, rollback)
  • Lessons learned documentation (8 issues with root cause analysis)
  • Cost analysis (actual vs production projection)
  • Evidence files (Terraform outputs, CloudWatch logs, test results)

Documentation structure:

docs/
├── ARCHITECTURE.md            # Design decisions and network diagrams
├── 01_IMPLEMENTATION.md       # Phase-by-phase build log
├── 02_LESSONS_LEARNED.md      # 8 issues with fixes
├── 03_COST_ANALYSIS.md        # Detailed cost breakdown
├── 04_SECURITY.md             # IAM policies, secrets flow
├── 05_CICD_DESIGN.md          # GitHub Actions workflow design
└── runbooks/
    ├── deployment-failure.md  # What to do when deploy fails
    ├── database-connection.md # Troubleshooting RDS connectivity
    └── rollback-procedure.md  # Step-by-step rollback guide
Enter fullscreen mode Exit fullscreen mode

Questions or Feedback?

If you're building something similar or have questions about:

  • Blue-green deployment patterns
  • IAM least privilege design
  • AWS cost optimization strategies
  • Terraform module architecture
  • Debugging ECS task failures

Drop a comment below! I'll respond with specific examples from this build.


This project was built as a portfolio sprint to demonstrate production-ready AWS skills. The platform was deployed, validated with 5 CRUD operations, and destroyed within 4 hours—total cost $0.12. All code and documentation available on GitHub.

#AWS #Terraform #DevOps #ECS #InfrastructureAsCode #BlueGreenDeployment #CloudEngineering #Docker #PostgreSQL

Top comments (0)