DEV Community

ajithmanmu
ajithmanmu

Posted on

Building a Webhook Replay System with AWS Kinesis

The Problem

Payment webhooks from Stripe, Apple, and Google are revenue-critical, but they're tricky to handle correctly. Events arrive out of order, can be duplicated, and if your processing logic has a bug, you can corrupt subscription state with no way to recover.

I built a webhook broker that treats Kinesis Data Streams as an immutable event log. When things go wrong, you can replay events and rebuild subscription state from scratch.

Architecture Overview

Here's how the system works:

Payment Providers → API Gateway → Lambda (Ingestion)
                                       ↓
                                  Kinesis Stream (7-day retention)
                                       ↓
                                  Lambda (Processor)
                                       ↓
                                  DynamoDB (State + Idempotency)
Enter fullscreen mode Exit fullscreen mode

AWS Services Used

API Gateway (HTTP API)

  • Public endpoint for webhooks: /webhooks/{provider}
  • Rate limiting: 100 req/sec, burst 200
  • Routes: /webhooks/stripe, /webhooks/apple, /webhooks/google

Lambda 1: Ingestion

  • Verifies webhook signatures (HMAC-SHA256 for Stripe)
  • Extracts partition key: provider:subscriptionId
  • Writes raw event to Kinesis

Kinesis Data Streams

  • 7-day retention (extendable to 365 days)
  • Source of truth for all webhook events
  • Partition key ensures per-subscription ordering

Lambda 2: Processor

  • Reads from Kinesis stream
  • Sorts events by timestamp (handles out-of-order delivery)
  • Uses DynamoDB conditional writes for idempotency
  • Updates subscription state

DynamoDB Tables

  • ProcessedEvents: Idempotency check (TTL: 90 days)
  • SubscriptionState: Current subscription data

SQS Dead Letter Queue

  • Captures poison messages after 3 retries
  • 14-day retention for manual investigation

Key Design Decisions

1. Kinesis as Event Log

Kinesis isn't just a queue—it's a durable log. Every webhook is preserved for 7 days (configurable up to 365). This gives you time-travel capability: replay events from any point in the retention window.

For longer-term needs (regulatory audits, multi-year forensics), you can archive to S3 and implement cold replay from there.

2. Partition Keys for Ordering

Events are sharded by provider:subscriptionId (e.g., stripe:sub_premium_user_001). This gives:

  • Ordering guarantees per subscription: Events for the same subscription are processed in order
  • Granular replay control: Replay just one customer's events, or all events from a specific provider

3. Idempotency with DynamoDB

The processor uses conditional writes to the ProcessedEvents table:

// Only write if eventId doesn't exist
await dynamodb.putItem({
  TableName: 'ProcessedEvents',
  Item: { eventId, provider, subscriptionId, timestamp },
  ConditionExpression: 'attribute_not_exists(eventId)'
});
Enter fullscreen mode Exit fullscreen mode

This prevents duplicate processing, even when replaying events that were already handled.

Demo: Recovery from Data Loss

Scenario: A bug deletes subscription data from DynamoDB.

Step 1: Initial State

Subscription has 4 processed events:

aws dynamodb get-item \
  --table-name webhook-broker-dev-subscription-state \
  --key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'

# Returns: eventCount: 4
Enter fullscreen mode Exit fullscreen mode

Step 2: Simulate Data Loss

Delete subscription state and processed events (simulating a bug):

python scripts/delete_subscription_data.py \
  --provider stripe \
  --subscription sub_premium_user_001 \
  --execute

# Deletes: SubscriptionState + 4 ProcessedEvents
Enter fullscreen mode Exit fullscreen mode

Step 3: Confirm Deletion

aws dynamodb get-item \
  --table-name webhook-broker-dev-subscription-state \
  --key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'

# Returns: (empty)
Enter fullscreen mode Exit fullscreen mode

Step 4: Replay from Kinesis

python scripts/replay.py \
  --subscription sub_premium_user_001 \
  --from-beginning \
  --execute

# Replays 9 events from stream
# Processes 4 unique events
# Skips 5 duplicates (idempotency)
Enter fullscreen mode Exit fullscreen mode

Step 5: Verify Recovery

aws dynamodb get-item \
  --table-name webhook-broker-dev-subscription-state \
  --key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'

# Returns: eventCount: 4 (restored!)
Enter fullscreen mode Exit fullscreen mode

Result: Subscription rebuilt with exact same state. Idempotency prevented duplicate processing.

Real-World Use Cases

1. Bug Recovery
Deploy a bug that corrupts state → Fix the code → Replay events → State rebuilt correctly

2. Schema Evolution
Add a new field to your subscription model → Replay events to backfill the data

3. Revenue Debugging
Finance reports a discrepancy → Replay specific time range → Trace what happened

4. Provider Outage Recovery
Stripe had an outage yesterday → Replay all Stripe events from that window → Ensure nothing was missed

Tech Stack

  • Infrastructure: Terraform
  • Lambda Functions: TypeScript (Node.js 18)
  • Replay CLI: Python 3.9+
  • AWS Services: API Gateway, Kinesis, Lambda, DynamoDB, SQS

Cost Considerations

With 7-day retention and moderate volume (10,000 events/day):

  • Kinesis Data Streams: ~$15/month (1 shard)
  • Lambda: ~$5/month (first 1M requests free)
  • DynamoDB: ~$5/month (on-demand pricing)
  • API Gateway: ~$3.50/month (first 1M requests free)

Total: ~$30/month for production-grade event replay capability.

Source Code

Full implementation with Terraform, TypeScript Lambdas, and Python replay tool:

https://github.com/ajithmanmu/webhook-broker


The key insight: Treat your event stream as a source of truth, not just a transport layer. When you have an immutable log, recovery becomes a replay operation instead of a panic.

Top comments (0)