Art light

Posted on Jan 19

Webhooks at Scale: Designing an Idempotent, Replay-Safe, and Observable Webhook System

#programming #webdev #webhook #development

Webhooks look easy until your system processes the same payment 3 times, drops one critical event, and you can’t prove what actually happened.

This article is a production-grade deep dive into building a webhook ingestion system that survives retries, replays, out-of-order delivery, provider bugs, and your own future self.

If you’ve ever thought “we’ll just verify the signature and store the payload” — this post is for you.

Why webhooks are deceptively hard

Most webhook providers promise:

at-least-once delivery
retries on failure
signed payloads

What they don’t promise:

ordering
uniqueness
consistency

sane retry behavior

Reality: webhooks are an unreliable distributed queue that you do not control.

Treat them as such.

Failure modes most teams discover too late

Duplicate events processed twice
Provider retries for hours after success
Events arriving out of order
Partial failures mid-processing
Clock skew breaking signatures
Silent drops with no audit trail

A correct design assumes all of these happen daily.

Architecture overview

Webhook Provider
   │
   │  POST /webhook
   ▼
Ingress Layer (Fast, Stateless)
   │
   │  enqueue
   ▼
Persistent Event Store
   │
   │  dedupe + order
   ▼
Event Processor
   │
   │  side effects
   ▼
Domain Services

Key principle:

Never do business logic in the webhook handler.

Step 1: Fast acknowledgment (or you will get retries)

Webhook endpoints must:

verify signature
persist raw payload
return 2xx

Nothing else.

app.post('/webhook', async (req, res) => {
  verifySignature(req)
  await storeRawEvent(req)
  res.status(200).end()
})

If your endpoint takes >1–2 seconds, retries are guaranteed.

Step 2: Raw event persistence (non-negotiable)

Store exactly what you received:

headers
body
timestamp
provider event ID (if any)
Why?
replay
audits
debugging provider disputes

Never transform at this stage.

Step 3: Idempotency is not optional

If your system is not idempotent, retries are data corruption.

The wrong approach

“We’ll check if status already changed” ❌
“We’ll trust provider event IDs” ❌ The correct approach

Create your own idempotency key:

const key = hash(provider + eventType + externalObjectId)

Persist it with a unique constraint.

If insert fails → duplicate → skip safely.

Step 4: Ordering without pretending you control time

Providers do not guarantee ordering.

Never assume:

event A arrives before event B
timestamps are monotonic

Strategy

Model events as state transitions
Reject invalid transitions

if (!isValidTransition(currentState, nextEvent)) {
  logAndIgnore()
}

This makes ordering irrelevant.

Step 5: Exactly-once side effects (the hard part)

Databases are transactional. External APIs are not.

Pattern: transactional outbox

Write domain change + outbox record in same transaction
Commit
Async worker executes side effects
Mark outbox as done

This prevents:

double emails
double charges
partial failures

Step 6: Signature verification pitfalls

Common mistakes:

parsing JSON before verification
ignoring header casing
using system clock blindly

Always:

verify against raw body
allow small clock skew
fail closed

If verification fails → do not retry internally.

Step 7: Observability or it didn’t happen

You need to answer:

did we receive it?
did we process it?
what did it change?

Minimum requirements:

event ID traceable across logs
processing status persisted
dead-letter queue for failures

If you can’t answer these in <5 minutes, your system is blind.

Production checklist

Miss one — and you’ll eventually ship a bug you can’t undo.

Final thoughts

Webhooks are not callbacks. They are untrusted, replayable messages.

Once you treat them that way, they become boring.

And boring infrastructure is the goal.

Top comments (6)

PEACEBINFLOW • Jan 20

Yeah this is the kind of post people think they understand until Stripe (or whoever) politely ruins their weekend.

The biggest W here is how hard you push the “webhook = untrusted distributed queue you don’t control” framing. That mindset alone saves teams from the classic “just verify signature and update DB” speedrun into duplicate charges + angry finance emails.

Also love the clean separation:

ingress = verify + persist + 2xx
everything else = async, replayable, observable

That “never do business logic in the handler” rule should be tattooed on half the internet.

Idempotency section is 🔥 too — especially calling out “trust provider event IDs” as a trap. Providers try to be consistent… until they aren’t. Owning your own idempotency key + unique constraint is the grown-up move.

And thank you for saying the quiet part loud: ordering is a lie. State transitions + rejecting invalid transitions is the only sane way to survive out-of-order delivery without building a fake time machine.

Transactional outbox mention is chef’s kiss. That pattern is the difference between “we’re reliable” and “we occasionally send 3 emails and pretend it’s fine.”

Only thing I’d add (minor) is maybe a quick “how to replay safely” section: like a /replay/:event_id internal endpoint or a “reprocessor” worker that can re-run from raw payloads with guardrails. But honestly the foundations you laid already make replay basically inevitable in a good way.

Boring infrastructure is the goal, and this is exactly how you earn boring.

Art light • Jan 20

This is an excellent breakdown — I really like how you frame webhooks as an untrusted, distributed system, because that mental model alone prevents so many real-world failures. The clear separation between ingress and async processing feels like the right long-term solution, and the emphasis on owning idempotency and state transitions shows a very mature, production-tested perspective. I’m especially interested in how teams could extend this with safe replay tooling, but even as-is, this post sets a rock-solid foundation for building truly boring (in the best way) infrastructure.

Travis Cole • Jan 25

"And boring infrastructure is the goal" <<<<

Art light • Jan 25

😎

Some comments may only be visible to logged-in visitors. Sign in to view all comments.