DEV Community

Cover image for Webhooks at Scale: Designing an Idempotent, Replay-Safe, and Observable Webhook System
Art light
Art light

Posted on

Webhooks at Scale: Designing an Idempotent, Replay-Safe, and Observable Webhook System

Webhooks look easy until your system processes the same payment 3 times, drops one critical event, and you can’t prove what actually happened.

This article is a production-grade deep dive into building a webhook ingestion system that survives retries, replays, out-of-order delivery, provider bugs, and your own future self.

If you’ve ever thought “we’ll just verify the signature and store the payload” — this post is for you.

Why webhooks are deceptively hard

Most webhook providers promise:

  • at-least-once delivery
  • retries on failure
  • signed payloads

What they don’t promise:

  • ordering
  • uniqueness
  • consistency

sane retry behavior

Reality: webhooks are an unreliable distributed queue that you do not control.

Treat them as such.

Failure modes most teams discover too late

  • Duplicate events processed twice
  • Provider retries for hours after success
  • Events arriving out of order
  • Partial failures mid-processing
  • Clock skew breaking signatures
  • Silent drops with no audit trail

A correct design assumes all of these happen daily.

Architecture overview

Webhook Provider
   │
   │  POST /webhook
   ▼
Ingress Layer (Fast, Stateless)
   │
   │  enqueue
   ▼
Persistent Event Store
   │
   │  dedupe + order
   ▼
Event Processor
   │
   │  side effects
   ▼
Domain Services
Enter fullscreen mode Exit fullscreen mode

Key principle:

Never do business logic in the webhook handler.

Step 1: Fast acknowledgment (or you will get retries)

Webhook endpoints must:

  • verify signature
  • persist raw payload
  • return 2xx

Nothing else.

app.post('/webhook', async (req, res) => {
  verifySignature(req)
  await storeRawEvent(req)
  res.status(200).end()
})
Enter fullscreen mode Exit fullscreen mode

If your endpoint takes >1–2 seconds, retries are guaranteed.

Step 2: Raw event persistence (non-negotiable)

Store exactly what you received:

  • headers
  • body
  • timestamp
  • provider event ID (if any)
    Why?

  • replay

  • audits

  • debugging provider disputes

Never transform at this stage.

Step 3: Idempotency is not optional

If your system is not idempotent, retries are data corruption.

The wrong approach

  • “We’ll check if status already changed” ❌
  • “We’ll trust provider event IDs” ❌ The correct approach

Create your own idempotency key:

const key = hash(provider + eventType + externalObjectId)
Enter fullscreen mode Exit fullscreen mode

Persist it with a unique constraint.

If insert fails → duplicate → skip safely.

Step 4: Ordering without pretending you control time

Providers do not guarantee ordering.

Never assume:

  • event A arrives before event B
  • timestamps are monotonic

Strategy

  • Model events as state transitions
  • Reject invalid transitions
if (!isValidTransition(currentState, nextEvent)) {
  logAndIgnore()
}
Enter fullscreen mode Exit fullscreen mode

This makes ordering irrelevant.

Step 5: Exactly-once side effects (the hard part)

Databases are transactional. External APIs are not.

Pattern: transactional outbox

  1. Write domain change + outbox record in same transaction
  2. Commit
  3. Async worker executes side effects
  4. Mark outbox as done

This prevents:

  • double emails
  • double charges
  • partial failures

Step 6: Signature verification pitfalls

Common mistakes:

  • parsing JSON before verification
  • ignoring header casing
  • using system clock blindly

Always:

  • verify against raw body
  • allow small clock skew
  • fail closed

If verification fails → do not retry internally.

Step 7: Observability or it didn’t happen

You need to answer:

  • did we receive it?
  • did we process it?
  • what did it change?

Minimum requirements:

  • event ID traceable across logs
  • processing status persisted
  • dead-letter queue for failures

If you can’t answer these in <5 minutes, your system is blind.

Production checklist

Miss one — and you’ll eventually ship a bug you can’t undo.

Final thoughts

Webhooks are not callbacks. They are untrusted, replayable messages.

Once you treat them that way, they become boring.

And boring infrastructure is the goal.

Top comments (2)

Collapse
 
peacebinflow profile image
PEACEBINFLOW

Yeah this is the kind of post people think they understand until Stripe (or whoever) politely ruins their weekend.

The biggest W here is how hard you push the “webhook = untrusted distributed queue you don’t control” framing. That mindset alone saves teams from the classic “just verify signature and update DB” speedrun into duplicate charges + angry finance emails.

Also love the clean separation:

ingress = verify + persist + 2xx
everything else = async, replayable, observable

That “never do business logic in the handler” rule should be tattooed on half the internet.

Idempotency section is 🔥 too — especially calling out “trust provider event IDs” as a trap. Providers try to be consistent… until they aren’t. Owning your own idempotency key + unique constraint is the grown-up move.

And thank you for saying the quiet part loud: ordering is a lie. State transitions + rejecting invalid transitions is the only sane way to survive out-of-order delivery without building a fake time machine.

Transactional outbox mention is chef’s kiss. That pattern is the difference between “we’re reliable” and “we occasionally send 3 emails and pretend it’s fine.”

Only thing I’d add (minor) is maybe a quick “how to replay safely” section: like a /replay/:event_id internal endpoint or a “reprocessor” worker that can re-run from raw payloads with guardrails. But honestly the foundations you laid already make replay basically inevitable in a good way.

Boring infrastructure is the goal, and this is exactly how you earn boring.

Collapse
 
art_light profile image
Art light

This is an excellent breakdown — I really like how you frame webhooks as an untrusted, distributed system, because that mental model alone prevents so many real-world failures. The clear separation between ingress and async processing feels like the right long-term solution, and the emphasis on owning idempotency and state transitions shows a very mature, production-tested perspective. I’m especially interested in how teams could extend this with safe replay tooling, but even as-is, this post sets a rock-solid foundation for building truly boring (in the best way) infrastructure.