cypher682

Posted on Feb 24

Designing a Production-Grade Blue-Green ECS Platform on AWS with Terraform

#devops #aws #docker #terraform

Most AWS tutorials stop at "it works."

I wanted to build something closer to what a real engineering team would operate: network isolation, IAM least privilege, blue-green deployments, secrets management, and clean teardown—all defined as code.

This article walks through the architecture, design decisions, tradeoffs, and the 8 real issues I encountered along the way.

GitHub Repository: ecs-production-platform

Total Cost: $0.12 for complete validation (4-hour session)

What Was Built
Architecture Overview
Security Design
Blue-Green Deployment Mechanics
What Broke (8 Issues)
Production Tradeoffs
Cost Analysis

What Was Built

A production-aligned ECS Fargate platform running a Flask API backed by PostgreSQL:

Networking

Custom VPC (10.0.0.0/16) across 2 Availability Zones
Public subnets for ALB and ECS tasks
Private subnets for RDS (no internet route)

Compute

ECS Fargate services (no EC2 instance management)
Application Load Balancer with HTTPS (ACM certificate, TLS 1.3)
Blue-green target groups for zero-downtime deployments

Data

RDS PostgreSQL 15.12 (single-AZ for free tier)
Private subnet only, no public endpoint

Security

IAM role separation (execution vs task)
SSM Parameter Store for secrets (KMS encrypted)
Security group layering (internet → ALB → ECS → RDS)

Infrastructure as Code

100% Terraform (modular design)
Remote state in S3 with DynamoDB locking
Reusable modules (networking, IAM, ALB, ECS, RDS)

36 AWS resources deployed, tested, and cleanly destroyed.

Architecture Overview

┌─────────────────────────────────────────────┐
│              Internet                        │
└──────────────────┬──────────────────────────┘
                   │
         ┌─────────▼─────────┐
         │    Route 53 DNS    │
         └─────────┬──────────┘
                   │
         ┌─────────▼──────────┐
         │  ACM Certificate    │
         │    (TLS 1.3)        │
         └─────────┬───────────┘
                   │
    ┌──────────────▼──────────────┐
    │  Application Load Balancer  │
    │   (Public Subnets)          │
    └──────┬──────────────┬───────┘
           │              │
    ┌──────▼─────┐ ┌─────▼──────┐
    │ Blue TG    │ │ Green TG   │
    │ (Weight:   │ │ (Weight:   │
    │  100%)     │ │   0%)      │
    └──────┬─────┘ └─────┬──────┘
           │              │
    ┌──────▼──────────────▼──────┐
    │   ECS Fargate Services      │
    │   (Public Subnets)          │
    │   • 2 tasks (blue)          │
    │   • 0 tasks (green standby) │
    └──────────────┬──────────────┘
                   │
         ┌─────────▼──────────┐
         │  RDS PostgreSQL     │
         │  (Private Subnet)   │
         │  • No public IP     │
         │  • Port 5432 only   │
         └─────────────────────┘

Key Design Choices:

ECS in Public Subnets: Cost optimization—saves $33/month on NAT Gateway
Single-AZ RDS: Free tier constraint—production would use Multi-AZ
Security Groups: Each layer enforces isolation for the next

Terraform Module Structure

Instead of one monolithic configuration, concerns are separated into focused modules:

terraform/
├── modules/
│   ├── networking/     # VPC, subnets, security groups, DB subnet group
│   ├── iam/            # Task execution role, task role, policies
│   ├── alb/            # Load balancer, target groups, listeners
│   ├── ecs/            # Cluster, services, task definitions
│   ├── rds/            # PostgreSQL instance, parameter group
│   └── cicd/           # GitHub Actions IAM role (design only)
└── environments/
    └── prod/
        ├── main.tf         # Module composition
        ├── variables.tf    # Input variables
        ├── outputs.tf      # Stack outputs
        ├── backend.tf      # S3 + DynamoDB state
        └── versions.tf     # Provider versions

Module Communication:

Explicit inputs and outputs only
No cross-module resource references
No hidden dependencies

Example Module Call:

module "ecs" {
  source = "../../modules/ecs"

  project_name                = var.project_name
  container_image             = var.container_image
  ecs_task_execution_role_arn = module.iam.ecs_task_execution_role_arn
  ecs_task_role_arn           = module.iam.ecs_task_role_arn
  public_subnet_ids           = module.networking.public_subnet_ids
  ecs_security_group_id       = module.networking.ecs_tasks_security_group_id
  target_group_arn            = module.alb.blue_target_group_arn

  # Database connection
  db_host                     = module.rds.db_address
  db_password_ssm_param       = "/ecs-prod/db/password"
}

This keeps the architecture composable and prevents circular dependency hell.

Security Design

1. Network Isolation (Security Groups)

Traffic flows in one direction only:

Internet (0.0.0.0/0)
    ↓ Port 443/80
┌───────────────────┐
│  ALB SG           │
│  sg-0373513dd...  │
└────────┬──────────┘
         ↓ Port 8000 (from ALB SG only)
┌────────────────────┐
│  ECS Tasks SG      │
│  sg-09a3082e31...  │
└────────┬───────────┘
         ↓ Port 5432 (from ECS SG only)
┌────────────────────┐
│  RDS SG            │
│  sg-07a5aae1f9...  │
└────────────────────┘

Security Group Rules:

ALB Security Group:

ingress {
  description = "HTTPS from internet"
  from_port   = 443
  to_port     = 443
  protocol    = "tcp"
  cidr_blocks = ["0.0.0.0/0"]
}

egress {
  description     = "To ECS tasks only"
  from_port       = 8000
  to_port         = 8000
  protocol        = "tcp"
  security_groups = [aws_security_group.ecs_tasks.id]
}

ECS Tasks Security Group:

ingress {
  description     = "From ALB only"
  from_port       = 8000
  to_port         = 8000
  protocol        = "tcp"
  security_groups = [aws_security_group.alb.id]
}

egress {
  description     = "To RDS only"
  from_port       = 5432
  to_port         = 5432
  protocol        = "tcp"
  security_groups = [aws_security_group.rds.id]
}

RDS Security Group:

ingress {
  description     = "PostgreSQL from ECS only"
  from_port       = 5432
  to_port         = 5432
  protocol        = "tcp"
  security_groups = [aws_security_group.ecs_tasks.id]
}

# No egress rules - database doesn't need outbound

Result:

No direct internet access to ECS tasks
No public database endpoint
No bypass paths

2. IAM Role Separation

Two distinct roles prevent privilege escalation:

Task Execution Role (infrastructure operations):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "ecr:GetAuthorizationToken",
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:BatchGetImage",
      "logs:CreateLogStream",
      "logs:PutLogEvents",
      "ssm:GetParameters"
    ],
    "Resource": "*"
  }]
}

Task Role (application runtime):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "ssm:GetParameter",
    "Resource": "arn:aws:ssm:us-east-1:*:parameter/ecs-prod/*"
  }]
}

Why This Matters:

If a container is compromised, the attacker inherits only the task role—not the execution role. They can't:

Pull arbitrary images from ECR
Write to CloudWatch logs outside their stream
Access SSM parameters outside /ecs-prod/*

3. Secrets Management Flow

# 1. Generate password
PASSWORD=$(openssl rand -base64 32 | tr -d "=+/" | cut -c1-25)

# 2. Store in SSM (KMS encrypted)
aws ssm put-parameter \
  --name "/ecs-prod/db/password" \
  --value "$PASSWORD" \
  --type "SecureString"

# 3. Reference in task definition
{
  "name": "DB_PASSWORD_SSM_PARAM",
  "value": "/ecs-prod/db/password"
}

# 4. Application fetches at runtime
import boto3
ssm = boto3.client('ssm')
password = ssm.get_parameter(
    Name=os.environ['DB_PASSWORD_SSM_PARAM'],
    WithDecryption=True
)['Parameter']['Value']

Never in:

Source control
Docker image
Environment variables (plaintext)
Terraform state (marked sensitive)

Blue-Green Deployment Mechanics

The ALB HTTPS listener can route traffic to two separate target groups with configurable weights.

Initial State

Blue Target Group:  ████████████████████ 100% (2 healthy tasks)
Green Target Group: -------------------- 0%  (0 tasks)

Deployment Process

Step 1: Build New Version

# Update app version in app.py
# Build Docker image
docker build -t 758620460011.dkr.ecr.us-east-1.amazonaws.com/ecs-prod/flask-app:v2 .

# Push to ECR
docker push 758620460011.dkr.ecr.us-east-1.amazonaws.com/ecs-prod/flask-app:v2

Step 2: Deploy to Green

# Scale up green service
aws ecs update-service \
  --cluster ecs-prod-cluster \
  --service ecs-prod-service-green \
  --desired-count 2

# Wait for health checks (90 seconds)
aws ecs wait services-stable \
  --cluster ecs-prod-cluster \
  --services ecs-prod-service-green

Step 3: Switch Traffic (< 1 Second)

# Get listener and target group ARNs
LISTENER_ARN=$(terraform output -raw alb_listener_arn)
GREEN_TG=$(terraform output -raw green_target_group_arn)

# Instant traffic switch
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$GREEN_TG

Step 4: Validate and Cleanup

# Monitor green deployment
curl https://app.cipherpol.xyz/health
# {"version":"2.0.0","deployment":"green","status":"healthy"}

# After 15 minutes of monitoring, scale down blue
aws ecs update-service \
  --cluster ecs-prod-cluster \
  --service ecs-prod-service \
  --desired-count 0

Why This Works

No DNS propagation delays — Traffic switches at ALB layer

No container restarts — Only listener weight changes

Instant rollback — Reverse the listener modification

No downtime — ALB handles connection draining

Rollback Command (Same as Deploy):

# Get blue target group
BLUE_TG=$(terraform output -raw blue_target_group_arn)

# Instant rollback
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$BLUE_TG

Data Persistence Validation

Both blue and green services connect to the same RDS instance. I validated this explicitly:

# 1. Create items while blue is active
curl -X POST https://app.cipherpol.xyz/items \
  -H "Content-Type: application/json" \
  -d '{"name":"Test Item 1"}'

curl -X POST https://app.cipherpol.xyz/items \
  -H "Content-Type: application/json" \
  -d '{"name":"Test Item 2"}'

# 2. Switch traffic to green
aws elbv2 modify-listener --listener-arn $LISTENER_ARN --default-actions Type=forward,TargetGroupArn=$GREEN_TG

# 3. Verify data persists
curl https://app.cipherpol.xyz/items | jq
{
  "count": 2,
  "items": [
    {"id": 2, "name": "Test Item 2", "created_at": "2026-02-19T10:54:03"},
    {"id": 1, "name": "Test Item 1", "created_at": "2026-02-19T10:53:02"}
  ]
}

# 4. Create new item on green
curl -X POST https://app.cipherpol.xyz/items \
  -H "Content-Type: application/json" \
  -d '{"name":"Created on Green v2.0"}'

# 5. Rollback to blue
aws elbv2 modify-listener --listener-arn $LISTENER_ARN --default-actions Type=forward,TargetGroupArn=$BLUE_TG

# 6. Confirm all items still present
curl https://app.cipherpol.xyz/items | jq '.count'
# 3

Result:

5 items persisted across 3 deployment cycles

The deployment layer is stateless

The database is the single source of truth

What Broke (And What I Learned)

Issue 1: RDS Connection Delay After "Available" Status

Problem: ECS tasks failed health checks immediately after terraform apply completed RDS creation.

Symptoms:

ecs-prod-service: unhealthy targets: 2/2
Task stopped reason: Task failed container health checks

Root Cause:

RDS reports available status when the instance is running, but doesn't accept connections for another 60-90 seconds while:

Background processes initialize
Performance schema loads
Cache warms up

Fix:

Health check grace period (60s) absorbed the delay:

resource "aws_ecs_service" "app" {
  health_check_grace_period_seconds = 60
  # Tasks retry connection until RDS is ready
}

Lesson: AWS resource statuses don't always mean "ready for traffic." Plan for initialization time.

Issue 2: Docker HEALTHCHECK Missing Dependency

Problem: 100+ ECS task restarts. ALB target health showed healthy, but ECS kept replacing tasks.

Symptoms:

# ALB target group
Target health: healthy (2/2)

# ECS service events
Unhealthy container: flask-app
Task stopped, starting replacement

Root Cause:

# Original (broken)
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1

But curl wasn't in python:3.11-slim base image.

Why ALB Passed but Container Failed:

ALB health check: HTTP request from outside the container (port 8000)
Container health check: Command executed inside the container

Fix:

# Use Python instead of curl
HEALTHCHECK CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health', timeout=2)"

Lesson:

ALB health checks and container health checks are independent control loops:

ALB health: Determines if task receives traffic
Container health: Determines if ECS replaces the task

ECS uses container health, not ALB health.

Issue 3: Git Bash Path Conversion on Windows

Problem: SSM parameter /ecs-prod/db/password became C:\Program Files\Git\ecs-prod\db\password

Error:

ParameterNotFound: /C:/Program Files/Git/ecs-prod/db/password

Root Cause: Git Bash on Windows auto-converts Unix-style paths to Windows paths.

Fix:

# Disable path conversion
export MSYS_NO_PATHCONV=1

# Then run AWS CLI
aws ssm get-parameter --name /ecs-prod/db/password --with-decryption

Alternative: Use Windows Command Prompt or PowerShell for AWS CLI commands.

Lesson: Git Bash is great for Unix tools, but AWS CLI needs special handling on Windows.

Issue 4: ALB Listener Syntax Constraint

Problem: After configuring blue-green with weighted routing, subsequent updates using simple syntax failed.

Error:

An error occurred (ValidationError): Cannot use both TargetGroupArn and ForwardConfig in the same action

Root Cause: Once you use ForwardConfig (weighted routing), the ALB API remembers this and requires full JSON syntax for all future updates.

Simple syntax (stopped working):

--default-actions Type=forward,TargetGroupArn=arn:aws:...

Required syntax:

--default-actions '[{
  "Type": "forward",
  "ForwardConfig": {
    "TargetGroups": [{
      "TargetGroupArn": "arn:aws:...",
      "Weight": 100
    }]
  }
}]'

Lesson: ALB API is stateful. Once you use advanced features, you can't revert to simple syntax. Document this in runbooks.

Issue 5: PostgreSQL Minor Version Retirement

Problem: Terraform apply failed with:

InvalidParameterValue: Cannot find version 15.4 for postgres

Root Cause: AWS retired PostgreSQL 15.4 in favor of 15.12 (latest patch version).

Fix:

# Before
engine_version = "15.4"

# After
engine_version = "15.12"

Better Fix (Production):

# Pin major version, allow minor updates
engine_version = "15"

Lesson: AWS manages minor version lifecycle. Pin major versions intentionally, but expect patch version changes.

Issue 6: Terraform State Lock Timeout

Problem: terraform plan hung for 5 minutes, then failed:

Error acquiring state lock: timeout waiting for lock

Root Cause: DynamoDB lock table had wrong key schema:

# Wrong
AttributeName=id,KeyType=HASH

# Correct
AttributeName=LockID,KeyType=HASH

Fix:

# Delete wrong table
aws dynamodb delete-table --table-name terraform-state-lock

# Recreate with correct schema
aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Lesson: Terraform's DynamoDB lock table requires exactly LockID as the partition key. Case-sensitive.

Issue 7: ECR Image Pull Failures (Intermittent)

Problem: Some task launches failed with:

CannotPullContainerError: API error: manifest unknown

Root Cause: Task execution role was missing ecr:BatchGetImage permission.

Fix: Attached AWS managed policy:

resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

Lesson: Custom IAM policies are error-prone. Use AWS managed policies where possible, then restrict with conditions if needed.

Issue 8: Terraform vs Manual Scaling Conflict

Problem: Terraform tried to update blue service's desired_count while I was manually scaling for testing:

Error: concurrent modification detected
Service is being modified by another operation

Fix: Added lifecycle rule to ignore runtime changes:

resource "aws_ecs_service" "app" {
  # ... other config

  lifecycle {
    ignore_changes = [desired_count]
  }
}

Lesson: When testing blue-green manually, let Terraform manage infrastructure but ignore runtime scaling changes. Use ignore_changes selectively.

What I'd Change in Production

This Project	Production Standard	Reason	Cost Impact
ECS in public subnets	Private subnets + NAT Gateway	Defense-in-depth, reduced attack surface	+$33/month
Single-AZ RDS	Multi-AZ RDS	99.95% SLA vs 99.5%, automatic failover	+$15/month
Manual ALB switch	AWS CodeDeploy blue-green	Automated rollback based on CloudWatch alarms	$0 (free)
No autoscaling	ECS Service Auto Scaling	Handle traffic spikes, reduce idle costs	Variable
SSM Parameter Store	AWS Secrets Manager	Automatic rotation, better audit	+$0.40/secret
No read replicas	RDS read replica	Offload read traffic from primary	+$15/month
7-day log retention	30-90 day retention	Compliance, longer incident investigation	+$2/month

Total Production Cost: ~$80-100/month

This Project Cost: $0.12 for validation

Cost Savings Breakdown:

NAT Gateway: $33 saved
Multi-AZ RDS: $15 saved
Secrets Manager: $0.40 saved
Read replica: $15 saved
Total: $63.40/month saved

Cost Breakdown (Actual)

4-Hour Validation Session

Service	Hourly Rate	Hours	Cost
ALB	$0.0225	3	$0.0675
ECS Fargate (2 tasks × 0.25 vCPU)	Free tier	3	$0
RDS db.t3.micro (single-AZ)	Free tier	7	$0
Route 53 Hosted Zone	$0.50/month prorated	3 days	$0.05
S3 State Bucket	<$0.01	—	$0
CloudWatch Logs	Free tier (5GB)	—	$0
Data Transfer	Free tier (100GB)	<1GB	$0
Total			$0.12

Monthly Cost If Kept Running

Service	Monthly Cost	Notes
ALB	$16.20	720 hours × $0.0225
ECS Fargate	$0	Within 400 vCPU-hour free tier
RDS db.t3.micro	$0	Within 750-hour free tier
Route 53	$0.50	Hosted zone
Storage	<$1	S3, CloudWatch
Total	~$17/month

Comparison to EKS:

EKS control plane: $72/month
ECS Fargate control plane: $0
Savings: $72/month for equivalent compute

Key Takeaways

What Worked Well

Modular Terraform — Each module had single responsibility, debugging was easier

Blue-green switching — True zero downtime, instant rollback capability

Security group layering — Network isolation without complexity

SSM secrets — No credentials in code, images, or state files

Free tier optimization — Validated production patterns for $0.12

Documentation — 8 real issues documented with root causes and fixes

What I'd Improve

Terragrunt for DRY — Multi-environment deployments without duplication

Automated testing — Pre-deployment health checks in CI/CD

CodeDeploy integration — Production should automate blue-green

Observability — CloudWatch dashboards for latency, errors, saturation

Database migrations — Flyway or Liquibase for schema versioning

Chaos engineering — Terminate random tasks to test resilience

Repository & Evidence

Full source code with detailed documentation:

GitHub: cypher682/ecs-production-platform

What's included:

Complete Terraform modules (networking, IAM, ALB, ECS, RDS)
Flask application with Dockerfile and health checks
GitHub Actions workflow (OIDC design, not tested live)
Operational runbooks (deployment failure, database connection, rollback)
Lessons learned documentation (8 issues with root cause analysis)
Cost analysis (actual vs production projection)
Evidence files (Terraform outputs, CloudWatch logs, test results)

Documentation structure:

docs/
├── ARCHITECTURE.md            # Design decisions and network diagrams
├── 01_IMPLEMENTATION.md       # Phase-by-phase build log
├── 02_LESSONS_LEARNED.md      # 8 issues with fixes
├── 03_COST_ANALYSIS.md        # Detailed cost breakdown
├── 04_SECURITY.md             # IAM policies, secrets flow
├── 05_CICD_DESIGN.md          # GitHub Actions workflow design
└── runbooks/
    ├── deployment-failure.md  # What to do when deploy fails
    ├── database-connection.md # Troubleshooting RDS connectivity
    └── rollback-procedure.md  # Step-by-step rollback guide

Questions or Feedback?

If you're building something similar or have questions about:

Blue-green deployment patterns
IAM least privilege design
AWS cost optimization strategies
Terraform module architecture
Debugging ECS task failures

Drop a comment below! I'll respond with specific examples from this build.

This project was built as a portfolio sprint to demonstrate production-ready AWS skills. The platform was deployed, validated with 5 CRUD operations, and destroyed within 4 hours—total cost $0.12. All code and documentation available on GitHub.

#AWS #Terraform #DevOps #ECS #InfrastructureAsCode #BlueGreenDeployment #CloudEngineering #Docker #PostgreSQL

DEV Community

Designing a Production-Grade Blue-Green ECS Platform on AWS with Terraform

Table of Contents

What Was Built

Architecture Overview

Terraform Module Structure

Security Design

1. Network Isolation (Security Groups)

2. IAM Role Separation

3. Secrets Management Flow

Blue-Green Deployment Mechanics

Initial State

Deployment Process

Why This Works

Data Persistence Validation

What Broke (And What I Learned)

Issue 1: RDS Connection Delay After "Available" Status

Issue 2: Docker HEALTHCHECK Missing Dependency

Issue 3: Git Bash Path Conversion on Windows

Issue 4: ALB Listener Syntax Constraint

Issue 5: PostgreSQL Minor Version Retirement

Issue 6: Terraform State Lock Timeout

Issue 7: ECR Image Pull Failures (Intermittent)

Issue 8: Terraform vs Manual Scaling Conflict

What I'd Change in Production

Cost Breakdown (Actual)

4-Hour Validation Session

Monthly Cost If Kept Running

Key Takeaways

What Worked Well

What I'd Improve

Repository & Evidence

Questions or Feedback?

Top comments (0)