What the AWS us-east-1 Outage Taught Me About Building Resilient Systems

#architecture #aws #systemdesign

AWS us-east-1 will go down again. When it does, will your system survive?

This past weekend, I built a system designed to survive it.

After 8 years building subscription infrastructure at Surfline—processing payments through Stripe, Apple, and Google Play—I've learned that the question isn't whether your cloud provider will fail. It's whether your architecture degrades gracefully when it does.

I spent 4 hours implementing three reliability patterns sourced directly from the AWS Builders' Library, Google SRE practices, and Stripe's engineering blog. Here's what I learned.

The Problem: Payment Systems Can't Afford to Fail

When AWS has an incident, your Lambda functions timeout. Your DynamoDB calls fail. Your SQS queues back up.

For most applications, users see an error page and retry later. But payment systems are different:

A failed charge might actually have succeeded
A retry might double-charge the customer
A thundering herd of retries can cascade the failure

You need patterns that handle partial failures without losing money or trust.

Pattern 1: Exponential Backoff with Full Jitter

The AWS Builders' Library article on Timeouts, retries, and backoff with jitter changed how I think about retry logic.

The insight: Without jitter, all clients retry at the exact same intervals. If 1,000 requests fail at t=0, they all retry at t=1s, then t=2s, then t=4s—creating synchronized waves that hammer your recovering service.

// Full jitter formula from AWS Builders' Library
const calculateDelay = (attempt: number): number => {
  const exponentialDelay = Math.min(
    MAX_DELAY,
    INITIAL_DELAY * Math.pow(2, attempt)
  );
  // Full jitter: random value between 0 and exponential delay
  return Math.random() * exponentialDelay;
};

The result: Success rates improved from ~70% to 99%+ in my load tests. The jitter spreads retry load evenly across time instead of creating synchronized spikes.

AWS Application

This pattern is critical when calling AWS services during degraded states:

Lambda retrying DynamoDB during throttling
ECS tasks calling external APIs through NAT Gateway
Step Functions with retry policies on service integrations

Pattern 2: Bounded Queues with Worker Pools

Here's something I discovered through testing that surprised me:

A bounded queue alone doesn't limit concurrent processing.

I set up a queue with capacity 100, sent 200 requests, and expected ~100 rejections. Instead: zero rejections. Why? Node.js was processing requests faster than they accumulated. The queue checked capacity but didn't control throughput.

// What you actually need: queue + worker pool
class BoundedQueue {
  private queue: Request[] = [];
  private readonly capacity = 100;

  enqueue(request: Request): boolean {
    if (this.queue.length >= this.capacity) {
      return false; // HTTP 429 - fail fast
    }
    this.queue.push(request);
    return true;
  }
}

class WorkerPool {
  private activeWorkers = 0;
  private readonly maxWorkers = 10; // THIS controls throughput

  async process(queue: BoundedQueue) {
    while (this.activeWorkers < this.maxWorkers) {
      // Actually limits concurrent execution
    }
  }
}

AWS Application

This maps directly to AWS service patterns:

SQS + Lambda concurrency limits: The queue (SQS) buffers; reserved concurrency limits throughput
API Gateway + throttling: Request queuing with rate limits
Kinesis + Lambda: Batch size and parallelization factor control processing rate

The key insight: SQS without Lambda concurrency limits is like a bounded queue without a worker pool—it buffers but doesn't protect downstream systems.

Pattern 3: Idempotency with Strategic Caching

Stripe's idempotency documentation shaped this implementation. The pattern: cache successful responses for 24 hours, never cache errors.

class IdempotencyStore {
  private cache = new Map<string, CachedResponse>();
  private inFlight = new Set<string>();

  async process(idempotencyKey: string, operation: () => Promise<Response>) {
    // Check cache first
    const cached = this.cache.get(idempotencyKey);
    if (cached) return cached.response;

    // Detect concurrent duplicates
    if (this.inFlight.has(idempotencyKey)) {
      throw new ConflictError('Request already in progress');
    }

    this.inFlight.add(idempotencyKey);
    try {
      const response = await operation();
      // Only cache successes
      if (response.success) {
        this.cache.set(idempotencyKey, { response, ttl: 24 * 60 * 60 });
      }
      return response;
    } finally {
      this.inFlight.delete(idempotencyKey);
    }
  }
}

AWS Application

DynamoDB for idempotency keys: Conditional writes with TTL for automatic cleanup
Lambda Powertools: Built-in idempotency utility using DynamoDB
Step Functions: Native idempotency with execution names

// DynamoDB idempotency pattern
await dynamodb.put({
  TableName: 'IdempotencyStore',
  Item: {
    idempotencyKey: key,
    response: result,
    ttl: Math.floor(Date.now() / 1000) + 86400 // 24 hours
  },
  ConditionExpression: 'attribute_not_exists(idempotencyKey)'
});

The Architecture: Putting It Together

Here's how these patterns compose into a resilient payment processing system on AWS:

┌─────────────────────────────────────────────────────────────┐
│                     API Gateway                              │
│                   (Rate Limiting)                            │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                      SQS Queue                               │
│              (Bounded Queue - Buffer)                        │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│              Lambda (Reserved Concurrency = 10)              │
│                   (Worker Pool)                              │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  1. Check DynamoDB idempotency store                    ││
│  │  2. Process payment with retry + jitter                 ││
│  │  3. Store result in DynamoDB                            ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                 DynamoDB Tables                              │
│    - IdempotencyStore (with TTL)                            │
│    - ProcessingResults                                       │
└─────────────────────────────────────────────────────────────┘

Key Takeaways for AWS Builders

Read the AWS Builders' Library. It's written by engineers who've operated services at massive scale. The jitter article alone is worth your time.
Test your assumptions. I assumed bounded queues limited throughput. They don't. Load testing revealed the gap.
Accept the tradeoff. These patterns increase latency. A request that would fail in 100ms might now take 5 seconds across retries. But 99%+ success beats 70% success every time.
Use AWS primitives. SQS, Lambda concurrency, DynamoDB TTL, and Step Functions give you these patterns without building from scratch.

What's Next

The resilient-relay repo has the full implementation. I'm planning to add:

Dead-letter queue handling for failed payments
CloudWatch metrics for RED (Rate, Errors, Duration) observability
Multi-region failover patterns

When us-east-1 goes down again—and it will—your system should degrade gracefully, not catastrophically.

The AWS Builders' Library exists because Amazon learned these lessons operating AWS itself. The patterns are proven. The question is whether we apply them.

What reliability patterns have you implemented in your AWS architectures? I'd love to hear what's worked (or failed spectacularly) in production.

GitHub: resilient-relay | LinkedIn