The Problem
Payment webhooks from Stripe, Apple, and Google are revenue-critical, but they're tricky to handle correctly. Events arrive out of order, can be duplicated, and if your processing logic has a bug, you can corrupt subscription state with no way to recover.
I built a webhook broker that treats Kinesis Data Streams as an immutable event log. When things go wrong, you can replay events and rebuild subscription state from scratch.
Architecture Overview
Here's how the system works:
Payment Providers → API Gateway → Lambda (Ingestion)
↓
Kinesis Stream (7-day retention)
↓
Lambda (Processor)
↓
DynamoDB (State + Idempotency)
AWS Services Used
API Gateway (HTTP API)
- Public endpoint for webhooks:
/webhooks/{provider} - Rate limiting: 100 req/sec, burst 200
- Routes:
/webhooks/stripe,/webhooks/apple,/webhooks/google
Lambda 1: Ingestion
- Verifies webhook signatures (HMAC-SHA256 for Stripe)
- Extracts partition key:
provider:subscriptionId - Writes raw event to Kinesis
Kinesis Data Streams
- 7-day retention (extendable to 365 days)
- Source of truth for all webhook events
- Partition key ensures per-subscription ordering
Lambda 2: Processor
- Reads from Kinesis stream
- Sorts events by timestamp (handles out-of-order delivery)
- Uses DynamoDB conditional writes for idempotency
- Updates subscription state
DynamoDB Tables
-
ProcessedEvents: Idempotency check (TTL: 90 days) -
SubscriptionState: Current subscription data
SQS Dead Letter Queue
- Captures poison messages after 3 retries
- 14-day retention for manual investigation
Key Design Decisions
1. Kinesis as Event Log
Kinesis isn't just a queue—it's a durable log. Every webhook is preserved for 7 days (configurable up to 365). This gives you time-travel capability: replay events from any point in the retention window.
For longer-term needs (regulatory audits, multi-year forensics), you can archive to S3 and implement cold replay from there.
2. Partition Keys for Ordering
Events are sharded by provider:subscriptionId (e.g., stripe:sub_premium_user_001). This gives:
- Ordering guarantees per subscription: Events for the same subscription are processed in order
- Granular replay control: Replay just one customer's events, or all events from a specific provider
3. Idempotency with DynamoDB
The processor uses conditional writes to the ProcessedEvents table:
// Only write if eventId doesn't exist
await dynamodb.putItem({
TableName: 'ProcessedEvents',
Item: { eventId, provider, subscriptionId, timestamp },
ConditionExpression: 'attribute_not_exists(eventId)'
});
This prevents duplicate processing, even when replaying events that were already handled.
Demo: Recovery from Data Loss
Scenario: A bug deletes subscription data from DynamoDB.
Step 1: Initial State
Subscription has 4 processed events:
aws dynamodb get-item \
--table-name webhook-broker-dev-subscription-state \
--key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'
# Returns: eventCount: 4
Step 2: Simulate Data Loss
Delete subscription state and processed events (simulating a bug):
python scripts/delete_subscription_data.py \
--provider stripe \
--subscription sub_premium_user_001 \
--execute
# Deletes: SubscriptionState + 4 ProcessedEvents
Step 3: Confirm Deletion
aws dynamodb get-item \
--table-name webhook-broker-dev-subscription-state \
--key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'
# Returns: (empty)
Step 4: Replay from Kinesis
python scripts/replay.py \
--subscription sub_premium_user_001 \
--from-beginning \
--execute
# Replays 9 events from stream
# Processes 4 unique events
# Skips 5 duplicates (idempotency)
Step 5: Verify Recovery
aws dynamodb get-item \
--table-name webhook-broker-dev-subscription-state \
--key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'
# Returns: eventCount: 4 (restored!)
Result: Subscription rebuilt with exact same state. Idempotency prevented duplicate processing.
Real-World Use Cases
1. Bug Recovery
Deploy a bug that corrupts state → Fix the code → Replay events → State rebuilt correctly
2. Schema Evolution
Add a new field to your subscription model → Replay events to backfill the data
3. Revenue Debugging
Finance reports a discrepancy → Replay specific time range → Trace what happened
4. Provider Outage Recovery
Stripe had an outage yesterday → Replay all Stripe events from that window → Ensure nothing was missed
Tech Stack
- Infrastructure: Terraform
- Lambda Functions: TypeScript (Node.js 18)
- Replay CLI: Python 3.9+
- AWS Services: API Gateway, Kinesis, Lambda, DynamoDB, SQS
Cost Considerations
With 7-day retention and moderate volume (10,000 events/day):
- Kinesis Data Streams: ~$15/month (1 shard)
- Lambda: ~$5/month (first 1M requests free)
- DynamoDB: ~$5/month (on-demand pricing)
- API Gateway: ~$3.50/month (first 1M requests free)
Total: ~$30/month for production-grade event replay capability.
Source Code
Full implementation with Terraform, TypeScript Lambdas, and Python replay tool:
https://github.com/ajithmanmu/webhook-broker
The key insight: Treat your event stream as a source of truth, not just a transport layer. When you have an immutable log, recovery becomes a replay operation instead of a panic.
Top comments (0)