Maureen Chebet

Posted on Nov 26

Responding to a Critical Production Incident: A Fintech Case Study with AWS

#incident #aws #fintech

The 2:30 AM Wake-Up Call

Picture this: It's 2:30 AM, and you receive an alert that your payment processing system is experiencing:

45% of payment transactions failing
Database connection pool exhausted
Response times increased from 200ms to 15 seconds (75x degradation!)
Customer complaints flooding social media

The system processes over 1 million transactions daily and handles sensitive financial data. The last deployment was 6 hours ago. This is a critical production incident that requires immediate, systematic response.

In this article, I'll walk through how to handle this scenario using AWS services, covering immediate triage, diagnosis, resolution, and post-incident improvements.

Understanding the Incident

Before diving into solutions, let's understand what we're dealing with:

Scale: 1M+ transactions daily
Impact: Financial losses, customer trust, regulatory compliance
Urgency: Every minute counts
Complexity: Multiple potential root causes (deployment, capacity, database, code)

In fintech, we can't afford ad-hoc fixes. Everything must be systematic, documented, and compliant.

Phase 1: Immediate Response and Triage (First 15 Minutes)

Step 1: Establish Incident War Room

AWS Services Used:

Amazon Chime or Slack (integrated with AWS) for real-time communication
AWS Systems Manager Session Manager for secure access without SSH keys

# Quick access to EC2 instances for diagnosis
aws ssm start-session --target i-1234567890abcdef0

Step 2: Gather Critical Metrics

Amazon CloudWatch is your first line of defense. Immediately check:

# Check database connection pool metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=payment-db \
  --start-time 2024-01-15T02:00:00Z \
  --end-time 2024-01-15T02:30:00Z \
  --period 300 \
  --statistics Average,Maximum

Key CloudWatch Dashboards to Check:

RDS Performance Insights - Database connection pool, query performance
Application Load Balancer - Request count, response times, error rates
ECS/EC2 Metrics - CPU, memory, network utilization
Custom Application Metrics - Transaction success rate, error rates

Step 3: Check Recent Changes

AWS Config and CloudTrail help identify what changed:

# Check recent API calls that might have caused the issue
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ModifyDBInstance \
  --start-time 2024-01-15T20:00:00Z

Step 4: Immediate Stabilization Actions

If database connection pool is exhausted:

Check RDS Performance Insights (Real-time database monitoring):
- Navigate to RDS Console → Performance Insights
- Identify top SQL statements consuming connections
- Look for long-running queries or lock contention
Temporary Connection Pool Increase (with caution):

   -- Check current connection limit
   SHOW max_connections;

   -- In RDS, you can modify parameter group (but this requires restart)
   -- Better: Check for connection leaks in application

Kill Long-Running Queries (only if safe):

   -- Identify problematic queries
   SELECT pid, now() - pg_stat_activity.query_start AS duration, query 
   FROM pg_stat_activity 
   WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';

   -- Kill specific query (ONLY if not critical transaction)
   SELECT pg_terminate_backend(pid);

⚠️ Critical Warning: In payment systems, killing connections can cause data integrity issues. Always verify the query is not processing a critical transaction.

Phase 2: Systematic Diagnosis (15-60 Minutes)

Infrastructure Layer Diagnosis

Amazon CloudWatch Metrics:

# Check EC2/ECS resource utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=payment-service \
  --start-time 2024-01-15T02:00:00Z \
  --end-time 2024-01-15T02:30:00Z \
  --period 300 \
  --statistics Average,Maximum

Key Metrics to Check:

CPU utilization > 80%?
Memory utilization > 90%?
Network bandwidth saturated?
Disk I/O wait times high?

Database Layer Deep Dive

Amazon RDS Performance Insights provides real-time analysis:

Top SQL Statements - Which queries are consuming most time?
Wait Events - What is the database waiting on? (I/O, locks, CPU)
Database Load - Average Active Sessions (AAS) vs. capacity

RDS Enhanced Monitoring (if enabled):

Provides OS-level metrics every 60 seconds
Check for I/O bottlenecks, memory pressure

Query Performance Analysis:

-- Check for slow queries
SELECT 
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state,
  wait_event_type,
  wait_event
FROM pg_stat_activity 
WHERE state != 'idle'
ORDER BY duration DESC;

Application Layer Analysis

AWS X-Ray for distributed tracing:

# If X-Ray is integrated, check traces for slow requests
from aws_xray_sdk.core import xray_recorder

@xray_recorder.capture('process_payment')
def process_payment(transaction):
    # Your payment processing logic
    pass

CloudWatch Logs Insights for log analysis:

-- Query application logs for errors
fields @timestamp, @message
| filter @message like /ERROR/
| filter @message like /connection|pool|timeout/
| sort @timestamp desc
| limit 100

Common Issues to Look For:

Connection leaks (connections not being closed)
Memory leaks (gradual memory increase)
Thread pool exhaustion
Cascading failures (circuit breakers tripped)

Network Layer

VPC Flow Logs (if enabled):

Check for network latency between application and database
Look for packet loss or connection resets

CloudWatch Network Metrics:

Check ALB target response times
Verify no network partitions

Phase 3: Resolution Strategy

The resolution depends on the root cause. Here are common scenarios:

Scenario A: Code Issue from Recent Deployment

If the deployment 6 hours ago introduced a bug:

Immediate Rollback using AWS CodeDeploy:

# Rollback to previous deployment
aws deploy create-deployment \
  --application-name payment-app \
  --deployment-group-name production \
  --revision revisionType=AppSpecContent,appSpecContent='{
    "version": 0.0,
    "Resources": [{
      "TargetService": {
        "Type": "AWS::ECS::Service",
        "Properties": {
          "TaskDefinition": "arn:aws:ecs:region:account:task-definition/previous-version"
        }
      }
    }]
  }'

Database Migration Rollback (if applicable):
- If deployment included schema changes, check if they're reversible
- Use AWS Database Migration Service (DMS) for complex rollbacks
- Or manually reverse migrations if safe
Validation:
- Monitor CloudWatch metrics for improvement
- Verify transaction success rate returns to normal
- Check database connection pool utilization

Scenario B: Database Performance Issue

If queries are slow due to missing indexes or stale statistics:

Add Missing Indexes (carefully - can lock tables):

-- Analyze query execution plan first
EXPLAIN ANALYZE SELECT * FROM transactions WHERE customer_id = '123';

-- If missing index, add it (during low traffic if possible)
CREATE INDEX CONCURRENTLY idx_transactions_customer_id 
ON transactions(customer_id);

Update Query Planner Statistics:

ANALYZE transactions;

Enable RDS Query Insights for automatic query optimization recommendations

Scenario C: Capacity Issue

If system simply ran out of capacity:

Auto-Scaling Application Tier:

# Manually trigger scaling if auto-scaling is slow
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/payment-cluster/payment-service \
  --min-capacity 10 \
  --max-capacity 50

Scale RDS Instance (if using Provisioned IOPS):

# Modify RDS instance (requires downtime for some changes)
aws rds modify-db-instance \
  --db-instance-identifier payment-db \
  --db-instance-class db.r5.2xlarge \
  --apply-immediately

Enable Read Replicas to offload read queries:

# Create read replica (if not already exists)
aws rds create-db-instance-read-replica \
  --db-instance-identifier payment-db-read-replica \
  --source-db-instance-identifier payment-db

Implement Caching with Amazon ElastiCache:

import redis
import json

# Cache frequently accessed data
cache = redis.Redis(
    host='payment-cache.xxxxx.cache.amazonaws.com',
    port=6379,
    decode_responses=True
)

def get_customer_data(customer_id):
    # Check cache first
    cached = cache.get(f"customer:{customer_id}")
    if cached:
        return json.loads(cached)

    # If not in cache, query database
    data = query_database(customer_id)
    cache.setex(f"customer:{customer_id}", 300, json.dumps(data))
    return data

General Resolution Principles

Fix Root Cause, Not Symptoms
- Don't just increase connection pool size if connections aren't being released
- Identify why connections are stuck
Incremental Changes with Validation
- Make one change at a time
- Validate each change before proceeding
Rollback Capability
- Every change must be reversible
- Use infrastructure as code (CloudFormation/Terraform) for easy rollback
Data Integrity First
- Never compromise transaction atomicity
- Use database transactions for multi-step fixes

Phase 4: Communication Plan

Internal Communication

AWS Services:

Amazon Chime or Slack (with AWS Chatbot integration)
Amazon SNS for automated alerts

Communication Template:

INCIDENT UPDATE - Payment System

Status: ACTIVE
Time: 02:45 AM
Impact: 45% transaction failure rate
Root Cause: [Under investigation / Identified: connection pool exhaustion]
Actions: [Current actions being taken]
ETA: [Conservative estimate]
Next Update: 03:00 AM

External Communication

AWS Services:

Amazon Route 53 Health Checks for status page
CloudWatch Alarms → SNS → Status page API

Status Page Message:

"We're aware of an issue affecting some payment transactions and are working to resolve it. We'll provide updates every 30 minutes. Thank you for your patience."

Automated Status Updates

import boto3

sns = boto3.client('sns')

def send_status_update(message):
    sns.publish(
        TopicArn='arn:aws:sns:region:account:incident-updates',
        Message=message,
        Subject='Payment System Incident Update'
    )

Phase 5: Post-Incident Improvements

1. Enhanced Monitoring and Alerting

CloudWatch Alarms for Proactive Detection:

# Create alarm for database connection pool
aws cloudwatch put-metric-alarm \
  --alarm-name payment-db-connection-pool-warning \
  --alarm-description "Alert when connection pool exceeds 80%" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=DBInstanceIdentifier,Value=payment-db

CloudWatch Composite Alarms for complex conditions:

{
  "AlarmName": "payment-system-health",
  "AlarmRule": "ALARM(payment-db-connections) OR ALARM(payment-error-rate) OR ALARM(payment-response-time)"
}

AWS X-Ray Service Map for end-to-end visibility:

Automatically maps service dependencies
Identifies bottlenecks in distributed systems

2. Database Resilience Improvements

RDS Automated Backups:

Ensure point-in-time recovery is enabled
Test restore procedures regularly

RDS Multi-AZ Deployment:

# Enable Multi-AZ for high availability
aws rds modify-db-instance \
  --db-instance-identifier payment-db \
  --multi-az \
  --apply-immediately

Connection Pool Monitoring:

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def monitor_connection_pool():
    # Custom metric for application-level connection pool
    cloudwatch.put_metric_data(
        Namespace='PaymentApp/ConnectionPool',
        MetricData=[{
            'MetricName': 'ActiveConnections',
            'Value': get_active_connections(),
            'Timestamp': datetime.utcnow(),
            'Unit': 'Count'
        }]
    )

3. Deployment Safety

AWS CodeDeploy with Blue/Green Deployment:

# appspec.yml
version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: <TASK_DEFINITION>
        LoadBalancerInfo:
          ContainerName: "payment-app"
          ContainerPort: 8080
        PlatformVersion: "LATEST"
Hooks:
  - BeforeInstall: "validate.sh"
  - AfterInstall: "health-check.sh"

Canary Deployments with AWS App Mesh:

Gradually route traffic to new version
Automatic rollback on error threshold

Pre-Deployment Database Impact Analysis:

Use AWS DMS to test migrations on copy of production data
Run EXPLAIN ANALYZE on all new queries

4. Auto-Scaling Configuration

Application Auto-Scaling:

{
  "TargetTrackingScalingPolicies": [{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }]
}

Database Auto-Scaling (Aurora Serverless v2):

# Migrate to Aurora Serverless v2 for automatic scaling
aws rds create-db-cluster \
  --db-cluster-identifier payment-aurora \
  --engine aurora-postgresql \
  --serverless-v2-scaling-configuration MinCapacity=2,MaxCapacity=16

5. Incident Response Automation

AWS Systems Manager Automation Documents:

# incident-response-automation.yaml
schemaVersion: '0.3'
description: Automated incident response for connection pool exhaustion
assumeRole: 'arn:aws:iam::account:role/IncidentResponseRole'
mainSteps:
  - name: gatherMetrics
    action: 'aws:executeScript'
    inputs:
      Runtime: python3.8
      Script: |
        import boto3
        cloudwatch = boto3.client('cloudwatch')
        # Gather metrics and log to S3
  - name: scaleApplication
    action: 'aws:executeAwsApi'
    inputs:
      Service: ecs
      Api: UpdateService
      ServiceName: payment-service
      DesiredCount: 20

AWS Lambda for Automated Responses:

import boto3
import json

def lambda_handler(event, context):
    """
    Automated response to CloudWatch alarm
    """
    alarm_name = event['AlarmName']

    if 'connection-pool' in alarm_name.lower():
        # Scale application
        ecs = boto3.client('ecs')
        ecs.update_service(
            cluster='payment-cluster',
            service='payment-service',
            desiredCount=20
        )

        # Send notification
        sns = boto3.client('sns')
        sns.publish(
            TopicArn='arn:aws:sns:region:account:incident-response',
            Message=f'Automated scaling triggered for {alarm_name}'
        )

    return {'statusCode': 200}

Key Takeaways

Monitoring is Critical: CloudWatch, X-Ray, and RDS Performance Insights provide the visibility needed for rapid diagnosis
Automation Saves Time: Automated scaling, rollbacks, and incident response reduce MTTR
Systematic Approach: Follow a structured diagnosis methodology rather than random fixes
Document Everything: CloudTrail and CloudWatch Logs provide audit trails for compliance
Learn and Improve: Use post-incident analysis to prevent recurrence

AWS Services Checklist for Incident Response

✅ CloudWatch - Metrics, logs, alarms, dashboards
✅ RDS Performance Insights - Real-time database analysis
✅ X-Ray - Distributed tracing
✅ CloudTrail - Audit log of API calls
✅ Systems Manager - Secure access and automation
✅ CodeDeploy - Safe rollbacks
✅ ElastiCache - Caching to reduce database load
✅ SNS - Incident notifications
✅ Lambda - Automated responses
✅ ECS Auto Scaling - Dynamic capacity adjustment

Conclusion

Handling a critical production incident in a fintech environment requires a balance of speed, thoroughness, and compliance. AWS provides a comprehensive toolkit for monitoring, diagnosis, resolution, and prevention.

The key is to:

Detect early with proactive monitoring
Diagnose systematically using AWS observability tools
Resolve safely with proper rollback capabilities
Communicate transparently with stakeholders
Improve continuously based on learnings

Remember: In fintech, every incident is an opportunity to make the system more resilient. Use AWS services to turn reactive firefighting into proactive system reliability.

Additional Resources

Tags: #aws #devops #incident-response #fintech #cloudwatch #rds #production #reliability

DEV Community