Muhammad Yawar Malik

Posted on Jan 8

10 AWS Production Incidents That Taught Me Real-World SRE

#aws #sre #monitoring #cloudwatch

After responding to hundreds of AWS production incidents, I've learned that textbook solutions rarely match production reality. Here are 10 incidents that taught me how AWS systems actually break and how to fix them fast.

1. HTTP 4XX Alarms: When Your Users Can't Reach You

3 AM wake-up call: CloudWatch alarm firing for elevated 4XX errors. Traffic looked normal, but 30% of requests were getting 403s.
What I thought: API Gateway throttling or IAM issues.
What it actually was: A code deployment changed how we validated JWT tokens. The validation was now rejecting tokens from our mobile app's older version (which 30% of users hadn't updated yet).
The approach:
Check CloudWatch Insights for specific 4XX types (400, 403, 404)
Correlate with recent deployments using AWS Systems Manager
Examine API Gateway execution logs for rejection patterns
The fix:
Quick triage query in CloudWatch Insights
fields @timestamp, @message, statusCode, userAgent | filter statusCode >= 400 and statusCode < 500 | stats count() by statusCode, userAgent | sort count desc

Fast action: Rolled back the deployment, added backward compatibility for token validation, and set up monitoring for version distribution.
Lesson learned: 4XX errors are user-facing problems. Always correlate them with deployment times and check for breaking changes in validation logic.

2. HTTP 5XX Alarms: The System Is Breaking

The scenario: 5XX errors spiking during peak traffic. Load balancer health checks passing, but 15% of requests failing.
What I thought: Backend service overwhelmed.
What it actually was: Lambda functions timing out because of cold starts during a traffic spike, returning 504 Gateway Timeout through API Gateway.
The approach:
Distinguish between different 5XX codes (500, 502, 503, 504)
Check ELB/ALB target health in real-time
Examine Lambda concurrent executions and duration
The fix:
Added provisioned concurrency for critical Lambda functions
aws lambda put-provisioned-concurrency-config \ --function-name critical-api-handler \ --provisioned-concurrent-executions 10 \ --qualifier PROD

Implemented exponential backoff in API Gateway

Fast action: Enabled Lambda provisioned concurrency for traffic-sensitive functions and added CloudWatch alarms for concurrent execution approaching limits.
Lesson learned: 5XX errors need immediate action. Set up separate alarms for 502 (bad gateway), 503 (service unavailable), and 504 (timeout)—each tells a different story.

3. Route53 Health Check Failures: DNS Thinks You're Dead

The incident: Route53 failover triggered automatically at 2 PM, routing all traffic to our secondary region, which wasn't ready for full load.
What I thought: Primary region having issues.
What it actually was: Security group change blocked Route53 health check endpoint. Service was healthy, but Route53 couldn't verify it.
The approach:
Verify health check endpoint is accessible from Route53 IP ranges
Check security groups and NACLs
Test health check URL manually from different regions
The fix:
Whitelist Route53 health checker IPs in security group
Route53 publishes IP ranges at:
https://ip-ranges.amazonaws.com/ip-ranges.json

Quick health check test
curl -v https://api.example.com/health \ -H "User-Agent: Route53-Health-Check"
Fast action: Added Route53 health checker IPs to security group, implemented internal health checks that validate both endpoint accessibility and actual service health.
Lesson learned: Route53 health checks are not the same as your service being healthy. Ensure your health check endpoint tells the full story—database connectivity, downstream dependencies, not just "service is running."

4. Database Connection Pool Exhaustion: The Silent Killer

The scenario: Application logs showing "connection pool exhausted" errors. RDS metrics looked fine—CPU at 20%, connections well below max.
What I thought: Need to increase RDS max_connections.
What it actually was: Application wasn't releasing connections properly after exceptions. Connection pool filled up with zombie connections.
The approach:
Check RDS DatabaseConnections metric vs your pool size
Examine application connection acquisition/release patterns
Look for long-running queries holding connections
The fix:
Implemented proper connection management
`from contextlib import contextmanager

@contextmanager
def get_db_connection():
conn = connection_pool.get_connection()
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close() # Critical: Always release

Added connection pool monitoring
cloudwatch.put_metric_data(
Namespace='CustomApp/Database',
MetricData=[{
'MetricName': 'ConnectionPoolUtilization',
'Value': pool.active_connections / pool.max_size * 100
}]
)`

Fast action: Implemented connection timeout, added circuit breakers, and created CloudWatch dashboard tracking connection pool health.
Lesson learned: Database connection pools need aggressive monitoring. Set alarms at 70% utilization, not 95%. By then, it's too late.

5. API Rate Limits: When AWS Says "Slow Down"

The incident: Lambda functions failing with "Rate exceeded" errors during a batch job. Processing completely stopped.
What I thought: Hit AWS service limits.
What it actually was: Batch job making 10,000 concurrent DynamoDB writes with no backoff strategy. Hit write capacity limits within seconds.
The approach:
Identify which AWS API is rate limiting (check error messages)
Check Service Quotas dashboard for current limits
Implement exponential backoff with jitter
The fix:
`import time
import random
from botocore.exceptions import ClientError

def exponential_backoff_retry(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except ClientError as e:
if e.response['Error']['Code'] in ['ThrottlingException', 'TooManyRequestsException']:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
sleep_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(sleep_time)
else:
raise

Use AWS SDK built-in retry
import boto3
from botocore.config import Config

config = Config(
retries = {
'max_attempts': 10,
'mode': 'adaptive'
}
)
dynamodb = boto3.client('dynamodb', config=config)
`
Fast action: Implemented rate limiting on application side, added CloudWatch metrics for throttled requests, and requested limit increases where justified.
Lesson learned: Don't fight AWS rate limits—work with them. Build backoff into your code from day one, not after the incident.

6. Unhealthy Target Instances: The Load Balancer Lottery

The scenario: ALB sporadically marking healthy instances as unhealthy. Some requests succeeded, others got 502 errors.
What I thought: Instances actually becoming unhealthy under load.
What it actually was: Health check interval too aggressive (5 seconds) with tight timeout (2 seconds). During brief CPU spikes, instances couldn't respond in time and got marked unhealthy.
The approach:
Review target group health check settings
Check instance metrics during health check failures
Examine health check response times
The fix:
Adjusted health check to be more forgiving
aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:... \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 5 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 3

Made health check endpoint lightweight
Don't do: health check that queries database
Do: health check that verifies process is alive

Fast action: Separated deep health checks (for monitoring) from load balancer health checks (for routing). ALB health checks should be fast and simple.
Lesson learned: Aggressive health checks cause more problems than they solve. Balance between catching real failures and avoiding false positives.

7. Lambda Cold Starts: The Hidden Latency Tax

The incident: P99 latency for API calls spiking to 8 seconds during low traffic periods, while P50 stayed at 200ms.
What I thought: Backend database performance issue.
What it actually was: Lambda cold starts. Functions were shutting down during quiet periods, causing massive latency when the next request arrived.
The approach:
Check Lambda Duration metrics and look for bimodal distribution
Examine Init Duration in CloudWatch Logs Insights
Calculate cold start frequency
The fix:
CloudWatch Insights query to identify cold starts
fields @timestamp, @duration, @initDuration | filter @type = "REPORT" | stats avg(@duration) as avg_duration, avg(@initDuration) as avg_cold_start, count(@initDuration) as cold_start_count, count(*) as total_invocations | limit 20

Solutions applied:

Provisioned concurrency for critical paths
Keep functions warm with EventBridge schedule
Optimize cold start time (smaller deployment package)

Fast action: Implemented provisioned concurrency for user-facing APIs, scheduled pings to keep functions warm, and reduced deployment package size by 60%.
Lesson learned: Cold starts are inevitable with Lambda. Design around them—use provisioned concurrency for latency-sensitive operations, or accept the trade-off for batch jobs.

8. DynamoDB Throttling: When NoSQL Says No

The incident: Writes succeeding, but reads failing with ProvisionedThroughputExceededException during daily report generation.
What I thought: Need to increase read capacity units.
What it actually was: Report query using Scan operation without pagination, creating hot partition that consumed all capacity in seconds.
The approach:
Check DynamoDB metrics: ConsumedReadCapacity, ThrottledRequests
Identify access patterns causing hot partitions
Review query patterns (Scan vs Query)
The fix:
`Before: Scan without pagination (disaster)
response = table.scan()
items = response['Items']

After: Query with pagination and exponential backoff
def query_with_pagination(table, key_condition):
items = []
last_evaluated_key = None

while True:
    if last_evaluated_key:
        response = table.query(
            KeyConditionExpression=key_condition,
            ExclusiveStartKey=last_evaluated_key
        )
    else:
        response = table.query(
            KeyConditionExpression=key_condition
        )

    items.extend(response['Items'])

    last_evaluated_key = response.get('LastEvaluatedKey')
    if not last_evaluated_key:
        break

return items

Enable DynamoDB auto scaling
aws application-autoscaling register-scalable-target \
--service-namespace dynamodb \
--resource-id "table/YourTable" \
--scalable-dimension "dynamodb:table:ReadCapacityUnits" \
--min-capacity 5 \
--max-capacity 100`

Fast action: Converted Scans to Queries where possible, implemented pagination, enabled auto-scaling, and added composite sort keys to enable efficient queries.
Lesson learned: DynamoDB throttling is almost always a design problem, not a capacity problem. Fix your access patterns before throwing money at provisioned capacity.

9. ELB Connection Draining: Killing Requests During Deployment

The incident: 5% of requests failed during every deployment with 502 errors, despite using blue-green deployments.
What I thought: Instances shutting down too quickly.
What it actually was: Connection draining timeout set to 30 seconds, but some API calls took up to 60 seconds. ALB killed connections mid-request.
The approach:
Check ALB access logs for 502s during deployment windows
Review connection draining settings
Measure actual request duration (P99)
The fix:
Increase connection draining timeoutaws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:... \
--attributes Key=deregistration_delay.timeout_seconds,Value=120
Add deployment health check Wait for active connections to drain before proceeding while [ $(aws elbv2 describe-target-health \ --target-group-arn $TG_ARN \ --query 'TargetHealthDescriptions[?TargetHealth.State==draining] | length(@)') -gt 0 ] do echo "Waiting for connections to drain..." sleep 10 done

Fast action: Increased deregistration delay, implemented graceful shutdown in application (stop accepting new requests, finish existing ones), added pre-deployment validation.
Lesson learned: Connection draining timeout should be longer than your longest request duration. Monitor P99 request latency and set draining timeout accordingly.

10. Security Group Lockout: How I Locked Myself Out of Production

The incident: Deployment script failed mid-way, leaving security groups in an inconsistent state. Couldn't SSH to instances, couldn't roll back.
What I thought: Need to manually fix security groups.
What it actually was: Automation script had no rollback mechanism. Changed security groups in production without testing.
The approach:
Use AWS Systems Manager Session Manager (doesn't need SSH)
Document security group changes before modifying
Always test infrastructure changes in staging
The fix:
Access instance without SSH using Session Manager
aws ssm start-session --target i-1234567890abcdef0
Implement security group changes with backup1. Describe current security groups
aws ec2 describe-security-groups \
--group-ids sg-12345 > security-group-backup.json`

Make changes atomically aws ec2 authorize-security-group-ingress \ --group-id sg-12345 \ --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges='[{CidrIp=0.0.0.0/0}]'
Validate change worked
Only then remove old rule

Better: Use CloudFormation for security groups
Changes are tracked, rollback is automatic

Fast action: Enabled Systems Manager Session Manager on all instances, started managing security groups through CloudFormation, implemented change approval process.

Lesson learned: Never modify security groups manually in production. One wrong click can lock you out. Use infrastructure as code and Session Manager as a safety net.

Tools That Make This Easier
When incidents happen, speed matters. I built an Incident Helper to automate the repetitive parts of incident response, gathering CloudWatch logs, checking service health, and identifying common AWS issues.
It won't solve incidents for you, but it cuts down the time spent collecting information so you can focus on fixing the actual problem.

The Real Lesson
AWS gives you powerful tools, but they don't come with training wheels. Every service has failure modes you won't discover until 3 AM on a Saturday.
The incidents that teach you the most aren't the catastrophic ones—they're the subtle ones that make you question your assumptions. The 4XX error that reveals a deployment process gap. The throttling error that exposes an architecture flaw.

Document your incidents. Build your runbooks. Test your failovers. Discuss weekly with your teams. Because the next incident is already scheduled, you just don't know when.

DEV Community