Most deployment tutorials get you to "it works on my machine" status. Few show you how to build infrastructure that survives database connection failures, instance crashes during deployment, or availability zone outages.
I deployed a production-grade Flask API on AWS with auto-scaling that maintains availability through failures. Here's what separates infrastructure that works in demos from infrastructure that works in production.
What I Built
A three-tier Flask API deployment on AWS with PostgreSQL, designed to survive common failure scenarios.
Live Repository: github.com/escanut/terraform-aws-flask-oidc
Architecture:
Application Load Balancer (Multi-AZ)
↓
Auto Scaling Group (2-4 instances)
↓
RDS PostgreSQL (Multi-AZ with automatic failover)
Key Stats:
- Zero-downtime deployments (50% minimum healthy capacity)
- 6-8 minute deployment from commit to live
- 380-second instance warmup prevents premature failures
- Multi-AZ redundancy across database and networking
- Runtime credential retrieval (no secrets in code or Docker images)
- Monthly cost: $158 (us-east-1)
Health Check Integration: The Critical Path
Health checks are where most auto-scaling setups fail. You need three layers working together:
1. Docker Container Health Check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3
40-second start period accounts for ECR image pull (15-20 seconds) and application initialization.
2. Flask Application Health Check
@app.route('/health')
def health():
# Executes real database query
conn = get_db()
cur.execute('SELECT 1')
return jsonify({'status': 'healthy'}), 200
Validates the application can actually reach RDS. An instance passing Docker checks but unable to connect to the database is useless.
3. ALB Health Check
health_check {
healthy_threshold = 2
unhealthy_threshold = 3
interval = 10
path = "/health"
}
2 consecutive successes (20 seconds) confirms stability. 3 consecutive failures (30 seconds) removes flapping instances.
Deregistration delay: 30 seconds allows in-flight requests to complete before removing unhealthy instances.
Instance Warmup: Preventing Cascading Failures
This configuration prevented deployment failures:
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 50
instance_warmup = 380
}
}
Instance warmup: 380 seconds (6+ minutes)
What happens during warmup:
- EC2 instance launches (30-60 seconds)
- User data script: apt updates, Docker install, AWS CLI install, ECR login, image pull (4-5 minutes)
- Docker health checks stabilize (40 seconds)
- ALB health checks pass 2 consecutive times (20 seconds)
What happens if warmup is too short (e.g., 120 seconds):
- Auto Scaling Group marks instance "healthy" before Docker finishes pulling the image
- ASG immediately terminates old instances
- New instances aren't ready for traffic
- ALB routes traffic to instances returning connection refused
- 502 Bad Gateway errors cascade to users
380-second warmup ensures:
- Old instances stay alive until new instances are genuinely ready
- Zero user-facing errors during deployments
- Rolling updates complete without capacity loss
This number was determined empirically by monitoring actual instance launch times, not guessing.
Multi-AZ Architecture: Eliminating Single Points of Failure
RDS Multi-AZ Configuration:
resource "aws_db_instance" "rds" {
multi_az = true
instance_class = "db.t4g.micro"
storage_encrypted = true
}
Cost: $24.82/month (vs $12.41 for single-AZ)
What Multi-AZ provides:
- Synchronous replication to standby in different availability zone
- Automatic failover in 1-2 minutes (zero manual intervention)
- Zero data loss during AZ failure
What happens during AZ failure without Multi-AZ:
- Database becomes unreachable
- Application returns 503 errors to all users
- Manual recovery required (10-30+ minutes)
- Potential data loss depending on last backup
AWS availability zone failures occur 2-3 times per year per region. Production systems justify the cost.
Dual NAT Gateways:
resource "aws_nat_gateway" "nat" {
count = 2 # One per AZ
}
Cost: $65.70/month (vs $32.85 for single NAT)
EC2 instances in private subnets use NAT Gateways for internet access (apt updates, Docker installation, security patches). Single NAT Gateway creates single point of failure. If the NAT Gateway's AZ fails, instances in the other AZ can't bootstrap or update.
Total redundancy cost: $45/month ($12.41 for Multi-AZ RDS + $32.85 for second NAT Gateway)
This is insurance against extended outages, manual recovery procedures, and potential data loss.
Security: Runtime Credentials, Not Hardcoded Secrets
Database credentials are retrieved at runtime from AWS Secrets Manager:
secret_name = os.getenv('DB_SECRET_NAME')
client = boto3.client('secretsmanager')
secret = json.loads(client.get_secret_value(SecretId=secret_name)['SecretString'])
Where credentials are NOT:
- Application code
- Docker image layers
- Environment variables (visible in process listings)
Where credentials ARE:
- AWS Secrets Manager (encrypted at rest, access logged via CloudTrail)
- EC2 instance memory only (ephemeral, destroyed on termination)
IAM instance profile grants read-only access to one specific secret. Even if an instance is compromised, the attacker can't access other secrets.
Rolling Deployment Strategy
Configuration:
min_healthy_percentage = 50
Deployment flow:
- GitHub Actions builds Docker image, pushes to ECR (tagged with commit SHA)
- Triggers Auto Scaling Group instance refresh
- ASG replaces 50% of instances at a time
- New instances pull latest image, complete 380-second warmup
- ALB confirms new instances are healthy
- Old instances terminated
- Process repeats for remaining instances
Deployment time: 6-8 minutes
- Docker build/push: 3 minutes
- Rolling update with warmup: 3-5 minutes
Zero downtime: ALB continues routing traffic to healthy instances throughout deployment.
Failure Scenarios and Recovery
EC2 instance crash:
- ALB health checks fail after 30 seconds
- ALB stops routing traffic to failed instance
- Auto Scaling Group replaces unhealthy instance
- New instance completes warmup and joins rotation
- Impact: Zero user-facing errors
Availability zone failure:
- Multi-AZ RDS fails over to standby (1-2 minutes)
- EC2 instances in failed AZ marked unhealthy
- Auto Scaling Group launches replacements in healthy AZ
- NAT Gateway redundancy ensures new instances bootstrap successfully
- Impact: 1-2 minutes degraded performance (reduced capacity)
Docker image pull failure:
- Instance warmup prevents premature traffic routing
- Failed instance never receives traffic
- Auto Scaling Group retries launch
- Impact: Zero (deployment time increases, no user errors)
Key Lessons
Instance warmup must account for real bootstrapping time
I initially set warmup to 120 seconds. Deployments failed because Docker image pull took 40-60 seconds and health checks needed 20 seconds to stabilize. 380 seconds was determined by monitoring actual instance launch times.
Health checks must validate actual functionality
A health check that only verifies "the container is running" is useless. The /health endpoint executes a real database query. If the database is unreachable, the instance is genuinely unhealthy.
Multi-AZ costs are insurance premiums
The extra $45/month for Multi-AZ RDS and dual NAT Gateways prevents manual failover procedures, potential data loss, and extended outages. For production systems, this is a rounding error compared to downtime costs.
Conclusion
Production infrastructure is measured by how it handles failures, not how it performs in demos.
Multi-layer health checks, proper warmup periods, Multi-AZ redundancy, and zero-downtime deployments separate infrastructure that works in tutorials from infrastructure that works when databases fail, instances crash, or entire availability zones go offline.
Repository: github.com/escanut/terraform-aws-flask-oidc
Reach out: linkedin.com/in/victorojeje
I'm seeking remote opportunities in Cloud Engineering, DevOps, and Infrastructure Engineering. If your team needs someone who thinks about failure scenarios and production resilience, let's connect.
Top comments (0)