A real-world case study on eliminating duplicate payments and race conditions in Stripe webhook architecture.
TL;DR: Stop doing eager sync. Seriously. Webhooks should only verify and enqueue—nothing else. Let a single worker handle all writes. 217 lines replaced 3,769.
By “single worker,” I mean a single logical writer per Stripe object, enforced via queue partitioning — not literally one process.
The Problem: Simple Payments, Complex Failures
A user buys credits. Payment succeeds. They get credited twice.
Another user subscribes. Webhook arrives late. Their subscription shows "pending" for 10 minutes while they refresh angrily.
A third user? Their purchase disappears entirely after a Redis blip.
We thought Stripe webhooks were simple. We were wrong.
Here's how 3,769 lines of "helpful" code created a race condition that could take down our payment system—and the boring fix that solved everything.
What We Were Building
Our platform serves a lot of users. We process payments through Stripe:
- Credit purchases (one-time payments)
- Subscriptions (recurring billing)
The flow seems straightforward:
- Create a checkout session
- Wait for payment confirmation
- Credit the user's account
What could go wrong? Everything.
Inline Webhook Processing: The "Simple" Approach That Backfires
Like most teams, we started with the obvious approach:
// webhooks.controller.ts - The "simple" approach
app.post('/webhooks/stripe', async (req, res) => {
// 1. Verify the webhook signature
const event = stripe.webhooks.constructEvent(
req.body,
req.headers['stripe-signature'],
process.env.STRIPE_WEBHOOK_SECRET
);
// 2. Process the event inline
switch (event.type) {
case 'checkout.session.completed':
await handleCheckoutCompleted(event.data.object);
break;
case 'customer.subscription.created':
await handleSubscriptionCreated(event.data.object);
break;
case 'invoice.payment_succeeded':
await handleInvoicePayment(event.data.object);
break;
case 'charge.refunded':
await handleRefund(event.data.object);
break;
// ... 15 more event types
}
// 3. Return 200 to acknowledge receipt
res.status(200).json({ received: true });
});
Looks clean. Ships fast. Breaks at scale.
The Hidden Assumptions
This code assumes:
| Assumption | Reality |
|---|---|
| Processing is fast | Stripe times out at 30s. Our handlers took 35s during traffic spikes. |
| Dependencies are available | Redis goes down. We crash. No 200. Stripe retries. Duplicate processing. |
| Order doesn't matter | invoice.payment_succeeded arrives before subscription.created. Handler fails. |
| We won't crash mid-processing | Commit to MongoDB, crash, no 200. Stripe retries. Duplicate credit. |
Did you know? Stripe's retry schedule is: 1 hour, 6 hours, 48 hours. That "duplicate" you see at 3am? It's the retry from yesterday's timeout. Debugging webhook issues feels like time travel.
The Consequences
For a huge users base, these "edge cases" became daily incidents:
- Duplicate credits from retry storms
- Missing subscriptions from out-of-order events
- Timeouts during high-traffic periods
- Customers seeing inconsistent balances
The webhook controller bloated trying to handle every edge case inline.
The "Eager Sync" Optimization That Made Everything Worse
To improve UX, we added eager synchronization. The idea: don't make users wait for webhooks.
When a user completes checkout and returns to our app, we immediately check with Stripe:
// checkout-return.controller.ts - The "eager" approach
app.get('/checkout/return', async (req, res) => {
const { session_id } = req.query;
// Fetch the session from Stripe
const session = await stripe.checkout.sessions.retrieve(session_id);
// If payment succeeded, process immediately
if (session.payment_status === 'paid') {
await syncCheckoutSession(session); // Credit the user NOW
}
res.redirect('/dashboard?purchase=success');
});
Instant feedback. Happy users. Right?
Why Eager Sync Felt Right
- Instant feedback. User sees credits immediately, not "pending."
- No webhook delays. Webhooks can lag by seconds or minutes.
- Handles webhook failures. If our endpoint is down, eager sync still works.
The Hidden Problem: Two Writers, One Race Condition
Now we had two systems processing the same payment:
Timeline A (User is fast):
0ms - User completes payment
100ms - User redirected to /checkout/return
150ms - Eager sync processes payment ✓
500ms - Webhook arrives
550ms - Webhook processes payment ✓ (DUPLICATE!)
Timeline B (Race condition):
0ms - User completes payment
50ms - Webhook arrives, starts processing
60ms - User redirected to /checkout/return
70ms - Eager sync checks "already processed?" → No (webhook hasn't committed yet)
80ms - Eager sync starts processing
90ms - Webhook commits transaction
100ms - Eager sync commits transaction (DUPLICATE!)
The idempotency check didn't help because both systems checked before either committed.
Three Failed Fixes (And Why They Failed)
Fix #1: Database Locks
async function syncCheckoutSession(session: Stripe.Checkout.Session) {
const lock = await acquireLock(`checkout:${session.id}`);
try {
const existing = await Transaction.findOne({ stripeSessionId: session.id });
if (existing) return;
await creditUserAccount(session);
} finally {
await releaseLock(lock);
}
}
Failed because: Distributed locks across two different code paths are error-prone. Lock contention, deadlocks, and expiration issues.
Fix #2: Unique Constraints
const transactionSchema = new Schema({
stripeSessionId: { type: String, unique: true }
});
Failed because: Prevents duplicates but creates partial failures. Writer A creates transaction, crashes before crediting wallet. Writer B sees transaction exists, skips everything. User has record but no credits.
Fix #3: Redis Idempotency Keys
const wasSet = await redis.set(idempotencyKey, '1', 'NX', 'EX', 3600);
if (!wasSet) return; // Another process handling this
Failed because: Crash after setting key but before processing = payment stuck forever. Added cleanup jobs, TTLs, state tracking. Complexity exploded.
After three failed fixes, and one very long postmortem, we asked a different question.
The Root Cause: Two Writers, One Race Condition
We were solving the wrong problem.
The issue wasn't "how do we coordinate two writers?"
The issue was "why do we have two writers?"
The Two Generals Problem (1975): Two systems cannot reliably agree on shared state over an unreliable network. This is a proven impossibility in distributed systems. Our eager sync was literally trying to solve an unsolvable problem.
The fix? Don't have two generals. Have one general (the queue worker) and one messenger (the webhook endpoint).
Eager sync existed because we didn't trust webhooks. But instead of fixing webhook reliability, we added a second system that made everything worse.
Counter-intuitive: Showing "Processing..." for 2 seconds feels faster than showing "Success!" immediately and then correcting to "Actually, duplicate." Users trust systems that appear deliberate, not systems that appear to lie.
Queue-Based Webhook Processing with BullMQ
The Core Principles
| Principle | Implementation |
|---|---|
| Webhooks are source of truth | Frontend only reads state, never writes |
| Webhook handlers do one thing | Verify signature, queue event, return 200 |
| Single writer processes events | Worker with idempotency (actually works now) |
The New Architecture
┌────────────────────────────────────────────────┐
│ BEFORE │
├────────────────────────────────────────────────┤
│ │
│ Stripe User │
│ │ │ │
│ ▼ ▼ │
│ [Webhook] [Checkout Return] │
│ │ │ │
│ ▼ ▼ │
│ [Verify + Process] [Eager Sync] │
│ (3,769 lines) │ │
│ │ │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ SAME WALLET! │ ← Both race here │
│ └────────────────┘ │
│ │
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│ AFTER │
├────────────────────────────────────────────────┤
│ │
│ Stripe User │
│ │ │ │
│ ▼ ▼ │
│ [Webhook] [Checkout Return] │
│ │ │ │
│ ▼ ▼ │
│ [Verify]─▶[Queue]─▶200 [Poll]─▶Dashboard │
│ │ (read-only!) │
│ ▼ │
│ ┌────────────┐ │
│ │Redis Queue │ │
│ │ (BullMQ) │ │
│ └─────┬──────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ │
│ │ Worker │ ← Single writer │
│ │(217 lines) │ │
│ └─────┬──────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ │
│ │ Wallet │ ← No race │
│ └────────────┘ │
│ │
└────────────────────────────────────────────────┘
The New Webhook Controller (47 Lines)
// webhooks.controller.ts
import { queueService } from '@/shared/infrastructure/queue';
import { verifyStripeSignature } from './webhooks.service';
app.post('/webhooks/stripe', async (req, res) => {
try {
// 1. Verify signature (ONLY job of this endpoint)
const event = verifyStripeSignature(req);
// 2. Queue the event for async processing
const { queued } = await queueService.addStripeEvent(event);
// 3. Acknowledge receipt immediately
return res.status(200).json({ received: true, queued });
} catch (error) {
if (error instanceof StripeSignatureVerificationError) {
return res.status(400).json({ error: 'Invalid signature' });
}
// Redis/queue failure - return 503 so Stripe retries later
logger.error('Webhook queue failure', { error });
return res.status(503).json({ error: 'Service unavailable' });
}
});
Verify. Queue. Return 200. That's it.
The Queue Service (217 Lines)
// queue.service.ts
import { Queue } from 'bullmq';
const QUEUES = {
CREDIT_PURCHASE: new Queue('credit-purchase', { connection: redis }),
SUBSCRIPTION: new Queue('subscription', { connection: redis }),
};
// Declarative event routing
const EVENT_QUEUE_MAP: Record<string, keyof typeof QUEUES> = {
'checkout.session.completed': 'CREDIT_PURCHASE',
'charge.refunded': 'CREDIT_PURCHASE',
'customer.subscription.created': 'SUBSCRIPTION',
'customer.subscription.updated': 'SUBSCRIPTION',
'invoice.payment_succeeded': 'SUBSCRIPTION',
};
export async function addStripeEvent(
event: Stripe.Event
): Promise<{ queued: boolean }> {
const queueName = EVENT_QUEUE_MAP[event.type];
if (!queueName) {
logger.debug(`Unhandled event type: ${event.type}`);
return { queued: false };
}
await QUEUES[queueName].add(event.type, event, {
jobId: event.id, // BullMQ deduplicates: same ID = no-op
priority: queueName === 'CREDIT_PURCHASE' ? 1 : 5,
attempts: 3,
backoff: { type: 'exponential', delay: 1000 },
removeOnComplete: { age: 86400 * 3 }, // Keep for 3 days (Stripe retry window)
});
return { queued: true };
}
Priority routing: credits process before subscriptions. Automatic retries with backoff.
Gotcha: BullMQ's jobId deduplication only works while the job exists in Redis. Once completed/removed, the same jobId can be re-added. Set
removeOnComplete: { age: 86400 * 3 }to match Stripe's 3-day retry window, or your database idempotency check becomes the real safety net.
The Worker (Single Writer)
// credit-purchase.worker.ts
const worker = new Worker('credit-purchase', async (job) => {
const event = job.data as Stripe.Event;
switch (event.type) {
case 'checkout.session.completed':
await processCheckoutCompleted(event.data.object);
break;
case 'charge.refunded':
await processRefund(event.data.object);
break;
}
}, { connection: redis, concurrency: 5 });
async function processCheckoutCompleted(session: Stripe.Checkout.Session) {
// Idempotency check - NOW works because we're the only writer
const existing = await Transaction.findOne({ stripeSessionId: session.id });
if (existing) {
logger.info('Already processed', { sessionId: session.id });
return;
}
await creditUserAccount(session);
await createTransaction(session);
await sendConfirmationEmail(session);
}
Single writer = idempotency checks actually work.
Frontend: Read-Only Status Polling
// Checkout return - no more eager sync
app.get('/checkout/return', (req, res) => {
res.redirect(`/dashboard?session_id=${req.query.session_id}`);
});
// Status endpoint - read only
app.get('/checkout/status', async (req, res) => {
const transaction = await Transaction.findOne({
stripeSessionId: req.query.session_id
});
return res.json({
status: transaction ? 'completed' : 'pending',
credits: transaction?.credits
});
});
// React hook - polls until complete
export function useCheckoutPolling(sessionId: string | null) {
return useQuery({
queryKey: ['checkout-status', sessionId],
queryFn: () => api.get(`/checkout/status?session_id=${sessionId}`),
enabled: !!sessionId,
refetchInterval: (data) =>
data?.status === 'completed' ? false : 2000,
});
}
Frontend polls status. Webhook is the only writer. No race condition.
The Results
| Metric | Before | After |
|---|---|---|
| Webhook controller | 3,769 lines | 47 lines |
| Queue routing | N/A | 217 lines |
| Duplicate transactions | Daily | Zero |
| Stripe timeouts | During traffic spikes | None |
| Debugging time | Hours | Minutes (queue inspection) |
| Race conditions | Constant | Eliminated |
What This Doesn't Handle (Honest Assessment)
| Limitation | Mitigation |
|---|---|
| Queue going down (Redis failure) | Return 503 → Stripe retries for up to 3 days |
| Poison messages (always fail) | Dead-letter after 3 attempts + alerting |
| Event ordering | Handlers are idempotent, check current state |
| Worker crashes mid-processing | Job returns to queue, next attempt reprocesses |
| Signature verification failures | Alert on failure rate > threshold (possible replay attack) |
Key Takeaways
Two writers = race condition. Not redundancy. Coordination nightmare.
Stop doing eager sync. If you don't trust your webhooks, fix your webhooks—don't add another writer.
Webhooks should only enqueue. Verify signature. Queue event. Return 200. That's it. Nothing else.
Idempotency needs single writers. findOne → create isn't atomic.
When in doubt, queue it. Free retries, backpressure, observability.
The Diff
src/features/billing/webhooks/webhooks.controller.ts | -156 lines
src/features/billing/webhooks/webhooks.service.ts | -3,613 lines
src/shared/infrastructure/queue/queue.service.ts | +217 lines
47 files changed, 368 insertions(+), 3793 deletions(-)
The best code is the code you delete.
Irony: 217 lines is approximately the length of a single well-commented function in our old codebase. The entire queue architecture is smaller than the error handling we needed for one edge case.
Your Turn
If your webhook handler has more than 100 lines, you're probably doing too much inline.
Action items:
- Count your webhook handler lines (be honest)
- List every place that writes payment state
- If you have two writers, pick one
The queue-based approach took 2 weeks to implement, but the resulting race conditions took exponentially longer to diagnose. Choose your battles.
Coming Next: The Wallet Race Condition
Fixing webhook duplicates was just the beginning.
We had another bug. A nastier one:
User A: Add 100 credits (reads balance: 50, writes: 150)
User B: Deduct 30 credits (reads balance: 50, writes: 20)
// Both operations race on the same balance
// Final balance: 20 or 150, depending on who commits last
// Correct answer: 120
Multiple concurrent writes and deducts. Credits being added and removed simultaneously. The classic lost update problem—and it happens even with a single writer queue when different event types modify the same resource.
The naive fixes that failed:
- Mutex locks (deadlocks, performance cliffs)
- Optimistic locking (retry storms under load)
- Read-then-write patterns (the race IS the read-then-write)
The actual fix: Atomic balance operations that never read before writing.
// Wrong - read then write
const balance = await getBalance(userId);
await setBalance(userId, balance + credits);
// Right - atomic increment
await Wallet.updateOne(
{ userId },
{ $inc: { balance: credits } }
);
But it gets more complex with validation (can't go negative), multi-currency, and audit trails.
Next post: How to Fix Wallet Race Conditions: Atomic Operations Without Losing Your Audit Trail
How we made wallet operations atomic without sacrificing the ability to validate, audit, and roll back.
Further Reading
Found this useful? Share it with someone fighting webhook race conditions.
Dhruv Khara is a full-stack engineer who built payment infrastructure. He learned that "helpful" code is often the most dangerous kind.
Top comments (0)