Victor Ojeje

Posted on Jan 31

I built auto-scaling infrastructure that actually survives failures

#terraform #devops #aws #cloudcomputing

Most deployment tutorials get you to "it works on my machine" status. Few show you how to build infrastructure that survives database connection failures, instance crashes during deployment, or availability zone outages.

I deployed a production-grade Flask API on AWS with auto-scaling that maintains availability through failures. Here's what separates infrastructure that works in demos from infrastructure that works in production.

What I Built

A three-tier Flask API deployment on AWS with PostgreSQL, designed to survive common failure scenarios.

Live Repository: github.com/escanut/terraform-aws-flask-oidc

Architecture:

Application Load Balancer (Multi-AZ)
        ↓
Auto Scaling Group (2-4 instances)
        ↓
RDS PostgreSQL (Multi-AZ with automatic failover)

Key Stats:

Zero-downtime deployments (50% minimum healthy capacity)
6-8 minute deployment from commit to live
380-second instance warmup prevents premature failures
Multi-AZ redundancy across database and networking
Runtime credential retrieval (no secrets in code or Docker images)
Monthly cost: $158 (us-east-1)

Health Check Integration: The Critical Path

Health checks are where most auto-scaling setups fail. You need three layers working together:

1. Docker Container Health Check

HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3

40-second start period accounts for ECR image pull (15-20 seconds) and application initialization.

2. Flask Application Health Check

@app.route('/health')
def health():
    # Executes real database query
    conn = get_db()
    cur.execute('SELECT 1')
    return jsonify({'status': 'healthy'}), 200

Validates the application can actually reach RDS. An instance passing Docker checks but unable to connect to the database is useless.

3. ALB Health Check

health_check {
  healthy_threshold   = 2
  unhealthy_threshold = 3
  interval            = 10
  path                = "/health"
}

2 consecutive successes (20 seconds) confirms stability. 3 consecutive failures (30 seconds) removes flapping instances.

Deregistration delay: 30 seconds allows in-flight requests to complete before removing unhealthy instances.

Instance Warmup: Preventing Cascading Failures

This configuration prevented deployment failures:

instance_refresh {
  strategy = "Rolling"
  preferences {
    min_healthy_percentage = 50
    instance_warmup        = 380
  }
}

Instance warmup: 380 seconds (6+ minutes)

What happens during warmup:

EC2 instance launches (30-60 seconds)
User data script: apt updates, Docker install, AWS CLI install, ECR login, image pull (4-5 minutes)
Docker health checks stabilize (40 seconds)
ALB health checks pass 2 consecutive times (20 seconds)

What happens if warmup is too short (e.g., 120 seconds):

Auto Scaling Group marks instance "healthy" before Docker finishes pulling the image
ASG immediately terminates old instances
New instances aren't ready for traffic
ALB routes traffic to instances returning connection refused
502 Bad Gateway errors cascade to users

380-second warmup ensures:

Old instances stay alive until new instances are genuinely ready
Zero user-facing errors during deployments
Rolling updates complete without capacity loss

This number was determined empirically by monitoring actual instance launch times, not guessing.

Multi-AZ Architecture: Eliminating Single Points of Failure

RDS Multi-AZ Configuration:

resource "aws_db_instance" "rds" {
  multi_az              = true
  instance_class        = "db.t4g.micro"
  storage_encrypted     = true
}

Cost: $24.82/month (vs $12.41 for single-AZ)

What Multi-AZ provides:

Synchronous replication to standby in different availability zone
Automatic failover in 1-2 minutes (zero manual intervention)
Zero data loss during AZ failure

What happens during AZ failure without Multi-AZ:

Database becomes unreachable
Application returns 503 errors to all users
Manual recovery required (10-30+ minutes)
Potential data loss depending on last backup

AWS availability zone failures occur 2-3 times per year per region. Production systems justify the cost.

Dual NAT Gateways:

resource "aws_nat_gateway" "nat" {
  count = 2  # One per AZ
}

Cost: $65.70/month (vs $32.85 for single NAT)

EC2 instances in private subnets use NAT Gateways for internet access (apt updates, Docker installation, security patches). Single NAT Gateway creates single point of failure. If the NAT Gateway's AZ fails, instances in the other AZ can't bootstrap or update.

Total redundancy cost: $45/month ($12.41 for Multi-AZ RDS + $32.85 for second NAT Gateway)

This is insurance against extended outages, manual recovery procedures, and potential data loss.

Security: Runtime Credentials, Not Hardcoded Secrets

Database credentials are retrieved at runtime from AWS Secrets Manager:

secret_name = os.getenv('DB_SECRET_NAME')
client = boto3.client('secretsmanager')
secret = json.loads(client.get_secret_value(SecretId=secret_name)['SecretString'])

Where credentials are NOT:

Application code
Docker image layers
Environment variables (visible in process listings)

Where credentials ARE:

AWS Secrets Manager (encrypted at rest, access logged via CloudTrail)
EC2 instance memory only (ephemeral, destroyed on termination)

IAM instance profile grants read-only access to one specific secret. Even if an instance is compromised, the attacker can't access other secrets.

Rolling Deployment Strategy

Configuration:

min_healthy_percentage = 50

Deployment flow:

GitHub Actions builds Docker image, pushes to ECR (tagged with commit SHA)
Triggers Auto Scaling Group instance refresh
ASG replaces 50% of instances at a time
New instances pull latest image, complete 380-second warmup
ALB confirms new instances are healthy
Old instances terminated
Process repeats for remaining instances

Deployment time: 6-8 minutes

Docker build/push: 3 minutes
Rolling update with warmup: 3-5 minutes

Zero downtime: ALB continues routing traffic to healthy instances throughout deployment.

Failure Scenarios and Recovery

EC2 instance crash:

ALB health checks fail after 30 seconds
ALB stops routing traffic to failed instance
Auto Scaling Group replaces unhealthy instance
New instance completes warmup and joins rotation
Impact: Zero user-facing errors

Availability zone failure:

Multi-AZ RDS fails over to standby (1-2 minutes)
EC2 instances in failed AZ marked unhealthy
Auto Scaling Group launches replacements in healthy AZ
NAT Gateway redundancy ensures new instances bootstrap successfully
Impact: 1-2 minutes degraded performance (reduced capacity)

Docker image pull failure:

Instance warmup prevents premature traffic routing
Failed instance never receives traffic
Auto Scaling Group retries launch
Impact: Zero (deployment time increases, no user errors)

Key Lessons

Instance warmup must account for real bootstrapping time

I initially set warmup to 120 seconds. Deployments failed because Docker image pull took 40-60 seconds and health checks needed 20 seconds to stabilize. 380 seconds was determined by monitoring actual instance launch times.

Health checks must validate actual functionality

A health check that only verifies "the container is running" is useless. The /health endpoint executes a real database query. If the database is unreachable, the instance is genuinely unhealthy.

Multi-AZ costs are insurance premiums

The extra $45/month for Multi-AZ RDS and dual NAT Gateways prevents manual failover procedures, potential data loss, and extended outages. For production systems, this is a rounding error compared to downtime costs.

Conclusion

Production infrastructure is measured by how it handles failures, not how it performs in demos.

Multi-layer health checks, proper warmup periods, Multi-AZ redundancy, and zero-downtime deployments separate infrastructure that works in tutorials from infrastructure that works when databases fail, instances crash, or entire availability zones go offline.

Repository: github.com/escanut/terraform-aws-flask-oidc

Reach out: linkedin.com/in/victorojeje

I'm seeking remote opportunities in Cloud Engineering, DevOps, and Infrastructure Engineering. If your team needs someone who thinks about failure scenarios and production resilience, let's connect.

DEV Community