Maureen Chebet

Posted on Nov 26

Scaling a Payment Platform for Black Friday: A 10x Traffic Spike Strategy on AWS

#aws #scaling #autoscaling #loadtesting

The Challenge

Black Friday is coming in 8 weeks, and your payment platform needs to handle a massive traffic spike:

Normal load: 500 transactions/second
Expected peak: 5,000 transactions/second (10x increase!)
Current capacity: Only 1,500 transactions/second
Budget constraint: Minimize costs and auto-scale down after the event
Requirement: Zero impact on current service levels during preparation

This isn't just about adding more servers. It requires a comprehensive scaling strategy across the entire stack: application layer, database, caching, networking, and monitoring.

In this article, I'll walk through a complete AWS-based scaling strategy that handles the 10x traffic spike while optimizing costs and maintaining security and compliance.

Scaling Strategy Overview

Multi-Layer Approach

Scaling for a 10x traffic spike requires addressing every layer:

┌─────────────────────────────────────┐
│   Application Load Balancer (ALB)   │  ← Distribute traffic
├─────────────────────────────────────┤
│   Application Tier (Auto-Scaling)    │  ← Horizontal scaling
├─────────────────────────────────────┤
│   Caching Layer (ElastiCache)        │  ← Reduce DB load
├─────────────────────────────────────┤
│   Database Tier (RDS + Replicas)    │  ← Read replicas + pooling
└─────────────────────────────────────┘

Key Principles

Horizontal Scaling: Add more instances, not bigger ones
Cost Optimization: Use Reserved Instances for baseline, On-Demand for peak
Predictive Scaling: Pre-scale before traffic arrives
Comprehensive Testing: Validate before the event
Auto-Scale Down: Aggressively reduce capacity after peak

Phase 1: Application Layer Auto-Scaling

Current State Analysis

# Analyze current capacity
# Current: 1,500 TPS with 10 instances = 150 TPS per instance
# Target: 5,000 TPS with 20% headroom = 6,000 TPS capacity
# Required: ~40 instances (4x scaling)

# Calculate instance requirements
CURRENT_INSTANCES=10
CURRENT_TPS=1500
TARGET_TPS=5000
HEADROOM_PERCENT=20
TARGET_CAPACITY=$(echo "$TARGET_TPS * (1 + $HEADROOM_PERCENT / 100)" | bc)
INSTANCES_NEEDED=$(echo "($TARGET_CAPACITY / ($CURRENT_TPS / $CURRENT_INSTANCES))" | bc)

echo "Target capacity: $TARGET_CAPACITY TPS"
echo "Instances needed: $INSTANCES_NEEDED"

Auto Scaling Group Configuration

CloudFormation Template:

Resources:
  ApplicationAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: 10
      MaxSize: 50
      DesiredCapacity: 10
      VPCZoneIdentifier:
        - !Ref AppSubnet1
        - !Ref AppSubnet2
        - !Ref AppSubnet3
      LaunchTemplate:
        LaunchTemplateId: !Ref ApplicationLaunchTemplate
        Version: !GetAtt ApplicationLaunchTemplate.LatestVersionNumber
      TargetGroupARNs:
        - !Ref ApplicationTargetGroup
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300
      Cooldown: 300
      Tags:
        - Key: Name
          Value: PaymentApp
          PropagateAtLaunch: true
        - Key: Environment
          Value: Production
          PropagateAtLaunch: true

  ScaleUpPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AutoScalingGroupName: !Ref ApplicationAutoScalingGroup
      PolicyType: TargetTrackingScaling
      TargetTrackingConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: ASGAverageCPUUtilization
        TargetValue: 70.0
        ScaleInCooldown: 300
        ScaleOutCooldown: 60

  ScheduledScaleUp:
    Type: AWS::AutoScaling::ScheduledAction
    Properties:
      AutoScalingGroupName: !Ref ApplicationAutoScalingGroup
      ScheduledActionName: BlackFridayScaleUp
      DesiredCapacity: 40
      Recurrence: "0 2 24 11 *"  # Nov 24 at 2 AM UTC (2 hours before peak)
      MinSize: 40
      MaxSize: 50

  ScheduledScaleDown:
    Type: AWS::AutoScaling::ScheduledAction
    Properties:
      AutoScalingGroupName: !Ref ApplicationAutoScalingGroup
      ScheduledActionName: BlackFridayScaleDown
      DesiredCapacity: 10
      Recurrence: "0 2 26 11 *"  # Nov 26 at 2 AM UTC (after Black Friday)
      MinSize: 10
      MaxSize: 50

Advanced Scaling Policies

Multi-Metric Scaling:

{
  "TargetTrackingScalingPolicies": [
    {
      "TargetValue": 70.0,
      "PredefinedMetricSpecification": {
        "PredefinedMetricType": "ASGAverageCPUUtilization"
      },
      "ScaleInCooldown": 300,
      "ScaleOutCooldown": 60
    },
    {
      "TargetValue": 500.0,
      "PredefinedMetricSpecification": {
        "PredefinedMetricType": "ALBRequestCountPerTarget",
        "ResourceLabel": "app/payment-alb/123456789/targetgroup/payment-tg/123456789"
      },
      "ScaleInCooldown": 300,
      "ScaleOutCooldown": 60
    }
  ]
}

Custom Metric Scaling (Response Time):

import boto3

cloudwatch = boto3.client('cloudwatch')
autoscaling = boto3.client('autoscaling')

def create_response_time_scaling_policy():
    """Create scaling policy based on response time"""

    # Put custom metric for response time
    cloudwatch.put_metric_data(
        Namespace='PaymentApp/Performance',
        MetricData=[{
            'MetricName': 'P95ResponseTime',
            'Value': 0.5,  # 500ms threshold
            'Unit': 'Seconds'
        }]
    )

    # Create scaling policy
    autoscaling.put_scaling_policy(
        AutoScalingGroupName='payment-app-asg',
        PolicyName='scale-on-response-time',
        PolicyType='TargetTrackingScaling',
        TargetTrackingConfiguration={
            'CustomizedMetricSpecification': {
                'MetricName': 'P95ResponseTime',
                'Namespace': 'PaymentApp/Performance',
                'Statistic': 'Average',
                'Unit': 'Seconds'
            },
            'TargetValue': 0.5,  # Scale if p95 > 500ms
            'ScaleInCooldown': 300,
            'ScaleOutCooldown': 60
        }
    )

ECS Auto-Scaling (If Using Containers)

# Register scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/payment-cluster/payment-service \
  --min-capacity 10 \
  --max-capacity 50 \
  --role-arn arn:aws:iam::account:role/ecs-autoscaling-role

# Create scaling policy
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/payment-cluster/payment-service \
  --policy-name payment-service-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Warm Pools for Faster Scaling

# Create warm pool for faster instance startup
aws autoscaling put-warm-pool \
  --auto-scaling-group-name payment-app-asg \
  --min-size 5 \
  --max-size 10 \
  --instance-reuse-policy ReuseOnScaleIn=true

Phase 2: Database Scaling Strategy

Read Replicas Setup

Create RDS Read Replicas:

# Create read replica for Black Friday
aws rds create-db-instance-read-replica \
  --db-instance-identifier payment-db-read-replica-1 \
  --source-db-instance-identifier payment-db-primary \
  --db-instance-class db.r5.4xlarge \
  --publicly-accessible false \
  --availability-zone us-east-1a

# Create additional replicas
aws rds create-db-instance-read-replica \
  --db-instance-identifier payment-db-read-replica-2 \
  --source-db-instance-identifier payment-db-primary \
  --db-instance-class db.r5.4xlarge \
  --publicly-accessible false \
  --availability-zone us-east-1b

aws rds create-db-instance-read-replica \
  --db-instance-identifier payment-db-read-replica-3 \
  --source-db-instance-identifier payment-db-primary \
  --db-instance-class db.r5.4xlarge \
  --publicly-accessible false \
  --availability-zone us-east-1c

Read Replica Auto-Scaling:

import boto3
import json

rds = boto3.client('rds')
cloudwatch = boto3.client('cloudwatch')

def scale_read_replicas_based_on_load():
    """Dynamically add/remove read replicas based on load"""

    # Get current read replica count
    replicas = rds.describe_db_instances(
        Filters=[{
            'Name': 'db-instance-id',
            'Values': ['payment-db-read-replica-*']
        }]
    )

    current_replica_count = len(replicas.get('DBInstances', []))

    # Get database load metrics
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/RDS',
        MetricName='CPUUtilization',
        Dimensions=[
            {'Name': 'DBInstanceIdentifier', 'Value': 'payment-db-primary'}
        ],
        StartTime=datetime.utcnow() - timedelta(minutes=5),
        EndTime=datetime.utcnow(),
        Period=300,
        Statistics=['Average']
    )

    avg_cpu = response['Datapoints'][0]['Average'] if response['Datapoints'] else 0

    # Scale logic
    if avg_cpu > 80 and current_replica_count < 5:
        # Create new replica
        replica_id = f"payment-db-read-replica-{current_replica_count + 1}"
        rds.create_db_instance_read_replica(
            DBInstanceIdentifier=replica_id,
            SourceDBInstanceIdentifier='payment-db-primary',
            DBInstanceClass='db.r5.4xlarge'
        )
        print(f"Created read replica: {replica_id}")
    elif avg_cpu < 40 and current_replica_count > 2:
        # Remove replica (keep at least 2)
        replica_to_delete = f"payment-db-read-replica-{current_replica_count}"
        rds.delete_db_instance(DBInstanceIdentifier=replica_to_delete)
        print(f"Deleted read replica: {replica_to_delete}")

RDS Proxy for Connection Pooling

Create RDS Proxy:

# Create RDS Proxy for connection pooling
aws rds create-db-proxy \
  --db-proxy-name payment-db-proxy \
  --engine-family POSTGRESQL \
  --auth '{
    "AuthScheme": "SECRETS",
    "SecretArn": "arn:aws:secretsmanager:region:account:secret:payment-db-credentials",
    "IAMAuth": "DISABLED"
  }' \
  --role-arn arn:aws:iam::account:role/rds-proxy-role \
  --vpc-subnet-ids subnet-123 subnet-456 subnet-789 \
  --vpc-security-group-ids sg-12345678 \
  --require-tls \
  --idle-client-timeout 1800 \
  --max-connections-percent 100 \
  --max-idle-connections-percent 50

# Register database targets
aws rds register-db-proxy-targets \
  --db-proxy-name payment-db-proxy \
  --targets '[
    {
      "RdsResourceId": "payment-db-primary",
      "TargetRole": "WRITER"
    },
    {
      "RdsResourceId": "payment-db-read-replica-1",
      "TargetRole": "READER"
    },
    {
      "RdsResourceId": "payment-db-read-replica-2",
      "TargetRole": "READER"
    }
  ]'

Application Connection Using RDS Proxy:

import psycopg2
import boto3

def get_db_connection():
    """Get database connection through RDS Proxy"""

    # RDS Proxy endpoint
    proxy_endpoint = "payment-db-proxy.proxy-xxxxx.us-east-1.rds.amazonaws.com"

    # Get credentials from Secrets Manager
    secrets_client = boto3.client('secretsmanager')
    secret = secrets_client.get_secret_value(SecretId='payment-db-credentials')
    credentials = json.loads(secret['SecretString'])

    # Connect through proxy (automatic read/write splitting)
    connection = psycopg2.connect(
        host=proxy_endpoint,
        port=5432,
        database=credentials['database'],
        user=credentials['username'],
        password=credentials['password']
    )

    return connection

Database Query Optimization

Identify Slow Queries:

-- Enable query logging
ALTER DATABASE paymentdb SET log_min_duration_statement = 1000;  -- Log queries > 1s

-- Find slow queries
SELECT 
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - pg_stat_activity.query_start > interval '1 second'
ORDER BY duration DESC;

-- Analyze query execution plan
EXPLAIN ANALYZE
SELECT * FROM transactions 
WHERE customer_id = $1 
  AND created_at >= NOW() - INTERVAL '30 days';

Optimize with Indexes:

-- Create indexes for common queries
CREATE INDEX CONCURRENTLY idx_transactions_customer_date 
ON transactions(customer_id, created_at DESC);

CREATE INDEX CONCURRENTLY idx_transactions_status 
ON transactions(status) 
WHERE status IN ('pending', 'processing');

-- Update statistics
ANALYZE transactions;

Phase 3: Caching Strategy

Amazon ElastiCache for Redis

Create Redis Cluster:

# Create Redis cluster for caching
aws elasticache create-cache-cluster \
  --cache-cluster-id payment-cache \
  --cache-node-type cache.r6g.xlarge \
  --engine redis \
  --num-cache-nodes 3 \
  --cache-parameter-group-name default.redis7 \
  --security-group-ids sg-cache-12345678 \
  --subnet-group-name payment-cache-subnet-group \
  --engine-version 7.0 \
  --port 6379 \
  --snapshot-retention-limit 7 \
  --automatic-failover-enabled \
  --multi-az-enabled

Application-Level Caching:

import redis
import json
import hashlib

redis_client = redis.Redis(
    host='payment-cache.xxxxx.cache.amazonaws.com',
    port=6379,
    decode_responses=True
)

def get_cached_data(cache_key, fetch_function, ttl=300):
    """Generic caching function"""

    # Try cache first
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)

    # Cache miss - fetch from source
    data = fetch_function()

    # Store in cache
    redis_client.setex(
        cache_key,
        ttl,
        json.dumps(data)
    )

    return data

def get_customer_profile(customer_id):
    """Get customer profile with caching"""

    cache_key = f"customer:profile:{customer_id}"

    def fetch_from_db():
        # Database query
        conn = get_db_connection()
        cursor = conn.cursor()
        cursor.execute(
            "SELECT * FROM customers WHERE id = %s",
            (customer_id,)
        )
        return cursor.fetchone()

    return get_cached_data(cache_key, fetch_from_db, ttl=600)

def get_transaction_history(customer_id, days=30):
    """Get transaction history with caching"""

    cache_key = f"transactions:{customer_id}:{days}"

    def fetch_from_db():
        conn = get_db_connection()
        cursor = conn.cursor()
        cursor.execute("""
            SELECT * FROM transactions 
            WHERE customer_id = %s 
              AND created_at >= NOW() - INTERVAL '%s days'
            ORDER BY created_at DESC
        """, (customer_id, days))
        return cursor.fetchall()

    # Shorter TTL for transaction data (more dynamic)
    return get_cached_data(cache_key, fetch_from_db, ttl=60)

Cache Warming Before Black Friday:

def warm_cache_before_black_friday():
    """Pre-populate cache with frequently accessed data"""

    # Get top 1000 customers by transaction volume
    conn = get_db_connection()
    cursor = conn.cursor()
    cursor.execute("""
        SELECT customer_id, COUNT(*) as tx_count
        FROM transactions
        WHERE created_at >= NOW() - INTERVAL '90 days'
        GROUP BY customer_id
        ORDER BY tx_count DESC
        LIMIT 1000
    """)

    top_customers = cursor.fetchall()

    # Warm cache for each customer
    for customer_id, _ in top_customers:
        get_customer_profile(customer_id)
        get_transaction_history(customer_id, days=30)
        print(f"Warmed cache for customer {customer_id}")

    print(f"Cache warming complete for {len(top_customers)} customers")

Application Load Balancer Caching

ALB with CloudFront for Static Content:

# Create CloudFront distribution for static assets
aws cloudfront create-distribution \
  --distribution-config '{
    "CallerReference": "payment-app-static",
    "Aliases": {
      "Quantity": 1,
      "Items": ["static.payment.example.com"]
    },
    "DefaultRootObject": "index.html",
    "Origins": {
      "Quantity": 1,
      "Items": [{
        "Id": "payment-alb",
        "DomainName": "payment-alb-123456789.us-east-1.elb.amazonaws.com",
        "CustomOriginConfig": {
          "HTTPPort": 80,
          "HTTPSPort": 443,
          "OriginProtocolPolicy": "https-only"
        }
      }]
    },
    "DefaultCacheBehavior": {
      "TargetOriginId": "payment-alb",
      "ViewerProtocolPolicy": "redirect-to-https",
      "AllowedMethods": {
        "Quantity": 2,
        "Items": ["GET", "HEAD"],
        "CachedMethods": {
          "Quantity": 2,
          "Items": ["GET", "HEAD"]
        }
      },
      "Compress": true,
      "MinTTL": 0,
      "DefaultTTL": 86400,
      "MaxTTL": 31536000
    },
    "Enabled": true
  }'

Phase 4: Load Testing and Validation

AWS Distributed Load Testing

Create Load Test Configuration:

# load-test-config.yaml
testName: BlackFridayLoadTest
testDescription: Validate 10x traffic spike handling
testDuration: 3600  # 1 hour
rampUp: 600  # 10 minutes ramp-up
targetTPS: 5000
scenarios:
  - name: payment_processing
    weight: 70
    script: |
      POST /api/payments
      Headers:
        Authorization: Bearer ${token}
      Body:
        amount: ${random(10, 1000)}
        currency: USD
        customer_id: ${random_customer_id}

  - name: transaction_history
    weight: 20
    script: |
      GET /api/transactions?customer_id=${random_customer_id}
      Headers:
        Authorization: Bearer ${token}

  - name: account_balance
    weight: 10
    script: |
      GET /api/accounts/${random_customer_id}/balance
      Headers:
        Authorization: Bearer ${token}

Run Load Test:

# Create load test
aws iotdeviceadvisor create-suite-definition \
  --suite-definition-configuration '{
    "suiteDefinitionName": "BlackFridayLoadTest",
    "devices": [{
      "thingArn": "arn:aws:iot:region:account:thing/load-test-device"
    }]
  }'

# Or use AWS Distributed Load Testing service
aws load-testing create-test-run \
  --test-name BlackFridayLoadTest \
  --test-config file://load-test-config.yaml \
  --target-url https://payment.example.com

Custom Load Testing Script:

import asyncio
import aiohttp
import time
from statistics import mean, median

async def send_request(session, url, data):
    """Send single request"""
    start = time.time()
    try:
        async with session.post(url, json=data) as response:
            await response.read()
            duration = time.time() - start
            return {
                'status': response.status,
                'duration': duration,
                'success': response.status == 200
            }
    except Exception as e:
        return {
            'status': 0,
            'duration': time.time() - start,
            'success': False,
            'error': str(e)
        }

async def load_test(target_tps, duration_seconds):
    """Run load test"""

    url = "https://payment.example.com/api/payments"
    concurrent_requests = target_tps  # 1 request per second per connection

    results = []
    start_time = time.time()

    async with aiohttp.ClientSession() as session:
        while time.time() - start_time < duration_seconds:
            tasks = []
            for _ in range(concurrent_requests):
                data = {
                    'amount': 100.0,
                    'currency': 'USD',
                    'customer_id': 'test-customer-123'
                }
                tasks.append(send_request(session, url, data))

            batch_results = await asyncio.gather(*tasks)
            results.extend(batch_results)

            # Wait 1 second before next batch
            await asyncio.sleep(1)

    # Analyze results
    durations = [r['duration'] for r in results]
    successes = [r['success'] for r in results]

    print(f"Total requests: {len(results)}")
    print(f"Success rate: {sum(successes) / len(successes) * 100:.2f}%")
    print(f"Average response time: {mean(durations):.3f}s")
    print(f"Median response time: {median(durations):.3f}s")
    print(f"P95 response time: {sorted(durations)[int(len(durations) * 0.95)]:.3f}s")
    print(f"P99 response time: {sorted(durations)[int(len(durations) * 0.99)]:.3f}s")

    return results

# Run load test
asyncio.run(load_test(target_tps=5000, duration_seconds=3600))

Load Testing Phases

Week 1-2: Baseline Testing

# Establish baseline
python load_test.py --target-tps 500 --duration 300

Week 3-5: Incremental Testing

# Test at 1x, 2x, 3x, 5x, 10x
for multiplier in 1 2 3 5 10; do
  target_tps=$((500 * multiplier))
  echo "Testing at ${target_tps} TPS"
  python load_test.py --target-tps $target_tps --duration 600
done

Week 6: Stress Testing

# Push beyond expected load
python load_test.py --target-tps 6000 --duration 1800  # 30 minutes

Week 7: Soak Testing

# Sustained load at peak
python load_test.py --target-tps 5000 --duration 14400  # 4 hours

Phase 5: Cost Optimization

Reserved Instances Strategy

# Purchase Reserved Instances for baseline capacity
aws ec2 purchase-reserved-instances-offering \
  --reserved-instances-offering-id 12345678-1234-1234-1234-123456789012 \
  --instance-count 10 \
  --instance-type t3.large

# RDS Reserved Instances
aws rds purchase-reserved-db-instances-offering \
  --reserved-db-instances-offering-id 12345678-1234-1234-1234-123456789012 \
  --db-instance-count 1 \
  --db-instance-class db.r5.4xlarge

Spot Instances for Non-Critical Workloads

import boto3

ec2 = boto3.client('ec2')
autoscaling = boto3.client('autoscaling')

def configure_spot_instances():
    """Configure spot instances for cost savings"""

    # Create launch template with spot instances
    ec2.create_launch_template(
        LaunchTemplateName='payment-app-spot',
        LaunchTemplateData={
            'ImageId': 'ami-12345678',
            'InstanceType': 't3.large',
            'InstanceMarketOptions': {
                'MarketType': 'spot',
                'SpotOptions': {
                    'MaxPrice': '0.10',  # Maximum bid price
                    'SpotInstanceType': 'one-time',
                    'InstanceInterruptionBehavior': 'terminate'
                }
            }
        }
    )

    # Create separate ASG for spot instances (for non-critical workloads)
    autoscaling.create_auto_scaling_group(
        AutoScalingGroupName='payment-app-spot-asg',
        LaunchTemplate={
            'LaunchTemplateName': 'payment-app-spot',
            'Version': '$Latest'
        },
        MinSize=0,
        MaxSize=20,
        DesiredCapacity=0,  # Start with 0, scale up only when needed
        VPCZoneIdentifier='subnet-123,subnet-456',
        Tags=[
            {
                'Key': 'InstanceType',
                'Value': 'Spot',
                'PropagateAtLaunch': True
            }
        ]
    )

Auto-Scaling Down After Peak

import boto3
from datetime import datetime, timedelta

autoscaling = boto3.client('autoscaling')
cloudwatch = boto3.client('cloudwatch')

def aggressive_scale_down():
    """Aggressively scale down after Black Friday"""

    # Check if traffic has dropped
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/ApplicationELB',
        MetricName='RequestCount',
        Dimensions=[
            {'Name': 'LoadBalancer', 'Value': 'app/payment-alb/123456789'}
        ],
        StartTime=datetime.utcnow() - timedelta(hours=2),
        EndTime=datetime.utcnow(),
        Period=300,
        Statistics=['Sum']
    )

    # Calculate average requests per second
    total_requests = sum([d['Sum'] for d in response['Datapoints']])
    avg_rps = total_requests / (2 * 3600)  # 2 hours

    # If below normal (500 TPS), scale down aggressively
    if avg_rps < 600:  # 20% above normal
        autoscaling.set_desired_capacity(
            AutoScalingGroupName='payment-app-asg',
            DesiredCapacity=10,  # Back to baseline
            HonorCooldown=False  # Ignore cooldown for faster scale-down
        )
        print(f"Scaled down to baseline (current RPS: {avg_rps:.2f})")

Cost Monitoring Dashboard

import boto3
from datetime import datetime, timedelta

ce = boto3.client('ce')  # Cost Explorer

def get_daily_costs():
    """Get daily AWS costs"""

    end_date = datetime.now().strftime('%Y-%m-%d')
    start_date = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': start_date,
            'End': end_date
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'DIMENSION', 'Key': 'SERVICE'},
            {'Type': 'DIMENSION', 'Key': 'USAGE_TYPE'}
        ]
    )

    daily_costs = {}
    for result in response['ResultsByTime']:
        date = result['TimePeriod']['Start']
        daily_costs[date] = {}

        for group in result['Groups']:
            service = group['Keys'][0]
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            daily_costs[date][service] = cost

    return daily_costs

# Create CloudWatch dashboard for costs
cloudwatch = boto3.client('cloudwatch')

def create_cost_dashboard():
    """Create CloudWatch dashboard for cost monitoring"""

    dashboard_body = {
        "widgets": [
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["AWS/Billing", "EstimatedCharges", {"stat": "Maximum"}]
                    ],
                    "period": 86400,
                    "stat": "Maximum",
                    "region": "us-east-1",
                    "title": "Daily AWS Costs"
                }
            }
        ]
    }

    cloudwatch.put_dashboard(
        DashboardName='BlackFridayCosts',
        DashboardBody=json.dumps(dashboard_body)
    )

Phase 6: Monitoring and Observability

CloudWatch Dashboards

Real-Time Operations Dashboard:

import boto3
import json

cloudwatch = boto3.client('cloudwatch')

def create_black_friday_dashboard():
    """Create comprehensive monitoring dashboard"""

    dashboard = {
        "widgets": [
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["AWS/ApplicationELB", "RequestCount", {"stat": "Sum", "period": 60}],
                        [".", "TargetResponseTime", {"stat": "Average", "period": 60}],
                        [".", "HTTPCode_Target_2XX_Count", {"stat": "Sum", "period": 60}],
                        [".", "HTTPCode_Target_5XX_Count", {"stat": "Sum", "period": 60}]
                    ],
                    "period": 60,
                    "stat": "Average",
                    "region": "us-east-1",
                    "title": "Application Load Balancer Metrics"
                }
            },
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["AWS/AutoScaling", "GroupDesiredCapacity", {"stat": "Average", "dimensions": [{"Name": "AutoScalingGroupName", "Value": "payment-app-asg"}]}],
                        [".", "GroupInServiceInstances", {"stat": "Average", "dimensions": [{"Name": "AutoScalingGroupName", "Value": "payment-app-asg"}]}],
                        [".", "GroupTotalInstances", {"stat": "Average", "dimensions": [{"Name": "AutoScalingGroupName", "Value": "payment-app-asg"}]}]
                    ],
                    "period": 60,
                    "stat": "Average",
                    "region": "us-east-1",
                    "title": "Auto Scaling Group"
                }
            },
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["AWS/RDS", "CPUUtilization", {"stat": "Average", "dimensions": [{"Name": "DBInstanceIdentifier", "Value": "payment-db-primary"}]}],
                        [".", "DatabaseConnections", {"stat": "Average", "dimensions": [{"Name": "DBInstanceIdentifier", "Value": "payment-db-primary"}]}],
                        [".", "ReadLatency", {"stat": "Average", "dimensions": [{"Name": "DBInstanceIdentifier", "Value": "payment-db-primary"}]}],
                        [".", "WriteLatency", {"stat": "Average", "dimensions": [{"Name": "DBInstanceIdentifier", "Value": "payment-db-primary"}]}]
                    ],
                    "period": 60,
                    "stat": "Average",
                    "region": "us-east-1",
                    "title": "RDS Database Metrics"
                }
            },
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["AWS/ElastiCache", "CPUUtilization", {"stat": "Average", "dimensions": [{"Name": "CacheClusterId", "Value": "payment-cache"}]}],
                        [".", "NetworkBytesIn", {"stat": "Sum", "dimensions": [{"Name": "CacheClusterId", "Value": "payment-cache"}]}],
                        [".", "NetworkBytesOut", {"stat": "Sum", "dimensions": [{"Name": "CacheClusterId", "Value": "payment-cache"}]}]
                    ],
                    "period": 60,
                    "stat": "Average",
                    "region": "us-east-1",
                    "title": "ElastiCache Metrics"
                }
            },
            {
                "type": "log",
                "properties": {
                    "query": "SOURCE '/aws/ecs/payment-service' | fields @timestamp, @message\n| filter @message like /ERROR/\n| sort @timestamp desc\n| limit 100",
                    "region": "us-east-1",
                    "title": "Application Errors",
                    "view": "table"
                }
            }
        ]
    }

    cloudwatch.put_dashboard(
        DashboardName='BlackFridayMonitoring',
        DashboardBody=json.dumps(dashboard)
    )

Critical Alerts

def create_critical_alerts():
    """Create CloudWatch alarms for critical metrics"""

    alarms = [
        {
            'AlarmName': 'high-error-rate',
            'MetricName': 'HTTPCode_Target_5XX_Count',
            'Namespace': 'AWS/ApplicationELB',
            'Statistic': 'Sum',
            'Period': 60,
            'EvaluationPeriods': 2,
            'Threshold': 10,
            'ComparisonOperator': 'GreaterThanThreshold'
        },
        {
            'AlarmName': 'high-response-time',
            'MetricName': 'TargetResponseTime',
            'Namespace': 'AWS/ApplicationELB',
            'Statistic': 'Average',
            'Period': 60,
            'EvaluationPeriods': 3,
            'Threshold': 1.0,  # 1 second
            'ComparisonOperator': 'GreaterThanThreshold'
        },
        {
            'AlarmName': 'database-connection-pool-exhausted',
            'MetricName': 'DatabaseConnections',
            'Namespace': 'AWS/RDS',
            'Statistic': 'Average',
            'Period': 300,
            'EvaluationPeriods': 2,
            'Threshold': 80,  # 80% of max connections
            'ComparisonOperator': 'GreaterThanThreshold'
        }
    ]

    for alarm in alarms:
        cloudwatch.put_metric_alarm(**alarm)
        print(f"Created alarm: {alarm['AlarmName']}")

Phase 7: Security and Compliance

Maintain Security During Scaling

# Ensure security groups allow traffic
aws ec2 authorize-security-group-ingress \
  --group-id sg-12345678 \
  --protocol tcp \
  --port 8080 \
  --source-group sg-alb-12345678

# Enable WAF on ALB
aws wafv2 create-web-acl \
  --name payment-app-waf \
  --scope REGIONAL \
  --default-action Allow={} \
  --rules file://waf-rules.json

Compliance Monitoring

def check_compliance_during_peak():
    """Verify compliance during peak load"""

    config = boto3.client('config')

    # Check encryption compliance
    response = config.get_compliance_summary_by_config_rule(
        ConfigRuleNames=['encrypted-volumes']
    )

    compliance = response['ComplianceSummariesByConfigRule'][0]

    if compliance['ComplianceSummary']['NonCompliantResourceCount']['CappedCount'] > 0:
        print("⚠️ Compliance issue detected during peak load")
        return False

    print("✅ All compliance checks passed")
    return True

Implementation Timeline

8-Week Preparation Plan

Weeks 1-2: Planning and Design

Finalize architecture
Create CloudFormation templates
Set up monitoring

Weeks 3-4: Infrastructure Implementation

Deploy auto-scaling groups
Set up read replicas
Configure ElastiCache
Implement RDS Proxy

Weeks 5-6: Load Testing Phase 1

Baseline testing
Incremental load testing (1x, 2x, 3x, 5x)
Identify and fix bottlenecks

Week 7: Load Testing Phase 2

Stress testing (6,000 TPS)
Soak testing (4 hours at 5,000 TPS)
Failover testing

Week 8: Final Preparations

Review all configurations
Team training
Dry run procedures
Final checklist

Black Friday (Event Day)

Pre-scale 2 hours before peak
Continuous monitoring
Incident response team on standby

Post-Black Friday

Scale down within 24 hours
Post-event review
Cost analysis
Lessons learned

Success Metrics

Performance Targets

target_metrics = {
    'response_time_p95_ms': 500,  # < 500ms
    'response_time_p99_ms': 1000,  # < 1s
    'error_rate_percent': 0.1,  # < 0.1%
    'transaction_success_rate_percent': 99.9,  # > 99.9%
    'availability_percent': 99.99,  # > 99.99%
    'auto_scaling_response_time_minutes': 5,  # < 5 minutes
    'cost_increase_percent': 400  # 4x for peak period only
}

Conclusion

Scaling for a 10x traffic spike requires a comprehensive, multi-layered approach. Key takeaways:

Horizontal auto-scaling with predictive scaling prevents capacity issues
Read replicas and RDS Proxy handle database load efficiently
Multi-level caching reduces database pressure significantly
Comprehensive load testing validates the strategy before the event
Cost optimization through Reserved Instances and aggressive scale-down
Real-time monitoring ensures visibility during peak load

The result? A payment platform that handles Black Friday traffic seamlessly while maintaining security, compliance, and cost efficiency.

DEV Community