Webhooks look easy until your system processes the same payment 3 times, drops one critical event, and you can’t prove what actually happened.
This article is a production-grade deep dive into building a webhook ingestion system that survives retries, replays, out-of-order delivery, provider bugs, and your own future self.
If you’ve ever thought “we’ll just verify the signature and store the payload” — this post is for you.
Why webhooks are deceptively hard
Most webhook providers promise:
- at-least-once delivery
- retries on failure
- signed payloads
What they don’t promise:
- ordering
- uniqueness
- consistency
sane retry behavior
Reality: webhooks are an unreliable distributed queue that you do not control.
Treat them as such.
Failure modes most teams discover too late
- Duplicate events processed twice
- Provider retries for hours after success
- Events arriving out of order
- Partial failures mid-processing
- Clock skew breaking signatures
- Silent drops with no audit trail
A correct design assumes all of these happen daily.
Architecture overview
Webhook Provider
│
│ POST /webhook
▼
Ingress Layer (Fast, Stateless)
│
│ enqueue
▼
Persistent Event Store
│
│ dedupe + order
▼
Event Processor
│
│ side effects
▼
Domain Services
Key principle:
Never do business logic in the webhook handler.
Step 1: Fast acknowledgment (or you will get retries)
Webhook endpoints must:
- verify signature
- persist raw payload
- return 2xx
Nothing else.
app.post('/webhook', async (req, res) => {
verifySignature(req)
await storeRawEvent(req)
res.status(200).end()
})
If your endpoint takes >1–2 seconds, retries are guaranteed.
Step 2: Raw event persistence (non-negotiable)
Store exactly what you received:
- headers
- body
- timestamp
provider event ID (if any)
Why?replay
audits
debugging provider disputes
Never transform at this stage.
Step 3: Idempotency is not optional
If your system is not idempotent, retries are data corruption.
The wrong approach
- “We’ll check if status already changed” ❌
- “We’ll trust provider event IDs” ❌ The correct approach
Create your own idempotency key:
const key = hash(provider + eventType + externalObjectId)
Persist it with a unique constraint.
If insert fails → duplicate → skip safely.
Step 4: Ordering without pretending you control time
Providers do not guarantee ordering.
Never assume:
- event A arrives before event B
- timestamps are monotonic
Strategy
- Model events as state transitions
- Reject invalid transitions
if (!isValidTransition(currentState, nextEvent)) {
logAndIgnore()
}
This makes ordering irrelevant.
Step 5: Exactly-once side effects (the hard part)
Databases are transactional. External APIs are not.
Pattern: transactional outbox
- Write domain change + outbox record in same transaction
- Commit
- Async worker executes side effects
- Mark outbox as done
This prevents:
- double emails
- double charges
- partial failures
Step 6: Signature verification pitfalls
Common mistakes:
- parsing JSON before verification
- ignoring header casing
- using system clock blindly
Always:
- verify against raw body
- allow small clock skew
- fail closed
If verification fails → do not retry internally.
Step 7: Observability or it didn’t happen
You need to answer:
- did we receive it?
- did we process it?
- what did it change?
Minimum requirements:
- event ID traceable across logs
- processing status persisted
- dead-letter queue for failures
If you can’t answer these in <5 minutes, your system is blind.
Production checklist
Miss one — and you’ll eventually ship a bug you can’t undo.
Final thoughts
Webhooks are not callbacks. They are untrusted, replayable messages.
Once you treat them that way, they become boring.
And boring infrastructure is the goal.
Top comments (2)
Yeah this is the kind of post people think they understand until Stripe (or whoever) politely ruins their weekend.
The biggest W here is how hard you push the “webhook = untrusted distributed queue you don’t control” framing. That mindset alone saves teams from the classic “just verify signature and update DB” speedrun into duplicate charges + angry finance emails.
Also love the clean separation:
ingress = verify + persist + 2xx
everything else = async, replayable, observable
That “never do business logic in the handler” rule should be tattooed on half the internet.
Idempotency section is 🔥 too — especially calling out “trust provider event IDs” as a trap. Providers try to be consistent… until they aren’t. Owning your own idempotency key + unique constraint is the grown-up move.
And thank you for saying the quiet part loud: ordering is a lie. State transitions + rejecting invalid transitions is the only sane way to survive out-of-order delivery without building a fake time machine.
Transactional outbox mention is chef’s kiss. That pattern is the difference between “we’re reliable” and “we occasionally send 3 emails and pretend it’s fine.”
Only thing I’d add (minor) is maybe a quick “how to replay safely” section: like a
/replay/:event_idinternal endpoint or a “reprocessor” worker that can re-run from raw payloads with guardrails. But honestly the foundations you laid already make replay basically inevitable in a good way.Boring infrastructure is the goal, and this is exactly how you earn boring.
This is an excellent breakdown — I really like how you frame webhooks as an untrusted, distributed system, because that mental model alone prevents so many real-world failures. The clear separation between ingress and async processing feels like the right long-term solution, and the emphasis on owning idempotency and state transitions shows a very mature, production-tested perspective. I’m especially interested in how teams could extend this with safe replay tooling, but even as-is, this post sets a rock-solid foundation for building truly boring (in the best way) infrastructure.