DEV Community

Webhooks at Scale: Designing an Idempotent, Replay-Safe, and Observable Webhook System

Art light on January 19, 2026

Webhooks look easy until your system processes the same payment 3 times, drops one critical event, and you can’t prove what actually happened. T...

Read full post

PEACEBINFLOW • Jan 20

Yeah this is the kind of post people think they understand until Stripe (or whoever) politely ruins their weekend.

The biggest W here is how hard you push the “webhook = untrusted distributed queue you don’t control” framing. That mindset alone saves teams from the classic “just verify signature and update DB” speedrun into duplicate charges + angry finance emails.

Also love the clean separation:

ingress = verify + persist + 2xx
everything else = async, replayable, observable

That “never do business logic in the handler” rule should be tattooed on half the internet.

Idempotency section is 🔥 too — especially calling out “trust provider event IDs” as a trap. Providers try to be consistent… until they aren’t. Owning your own idempotency key + unique constraint is the grown-up move.

And thank you for saying the quiet part loud: ordering is a lie. State transitions + rejecting invalid transitions is the only sane way to survive out-of-order delivery without building a fake time machine.

Transactional outbox mention is chef’s kiss. That pattern is the difference between “we’re reliable” and “we occasionally send 3 emails and pretend it’s fine.”

Only thing I’d add (minor) is maybe a quick “how to replay safely” section: like a /replay/:event_id internal endpoint or a “reprocessor” worker that can re-run from raw payloads with guardrails. But honestly the foundations you laid already make replay basically inevitable in a good way.

Boring infrastructure is the goal, and this is exactly how you earn boring.

Art light • Jan 20

This is an excellent breakdown — I really like how you frame webhooks as an untrusted, distributed system, because that mental model alone prevents so many real-world failures. The clear separation between ingress and async processing feels like the right long-term solution, and the emphasis on owning idempotency and state transitions shows a very mature, production-tested perspective. I’m especially interested in how teams could extend this with safe replay tooling, but even as-is, this post sets a rock-solid foundation for building truly boring (in the best way) infrastructure.

Travis Cole • Jan 25

"And boring infrastructure is the goal" <<<<

Art light • Jan 25

😎

Mateusz • Mar 10

Nice deep dive.

Something that helped us in one system was separating webhook concerns into three layers:

payload persistence
delivery reliability (retry queues)
observability and replay

Once payloads are stored, replay becomes almost free — and debugging integrations gets dramatically easier.

Without that layer you end up trying to reconstruct events from logs, which is painful.

Art light • Mar 10

That’s a really solid approach — separating webhook handling into payload persistence, delivery reliability, and observability makes a lot of architectural sense, especially for keeping the system resilient and debuggable. I’m very interested in this pattern because storing the payload first and enabling replay could significantly simplify failure recovery and integration debugging in real-world systems.

Mateusz • Mar 11

Exactly — once the payload is persisted, everything else becomes a controlled system instead of relying on the network.

One thing that surprised me in practice is how useful replay becomes not just for failures but for debugging integrations. When something downstream behaves unexpectedly, being able to re-send the exact event payload is incredibly helpful.

Without that layer you end up trying to reconstruct state from logs, which is rarely fun.

Art light • Mar 11

Thank you for your attention!

Tyler • Mar 4

The "ordering is a lie" framing is correct but there's a layer underneath
it worth naming: even when you model events as state transitions and reject
invalid ones, you still have an implicit arbitration strategy — you just
haven't documented it.

Last-write-wins. First-seen-wins. Timestamp-ordered. Most systems land on
one of these not by design but by however the queue happened to be
implemented. They work until they don't, and when they fail the postmortem
calls it "network instability" because nobody built the layer that makes
the arbitration decision explicit and traceable.

The missing piece between your ingress layer and your event processor is
what I'd call a resolution layer — something that takes conflicting or
ambiguous events and returns not just a state but a confidence score and a
documented basis for the decision. That way when a device reports offline
but its reconnect event was already processed 2.3 seconds earlier, your
system doesn't just pick one — it knows which one it picked, why, and how
confident it was.

Your replay point is also the right instinct. Safe replay only works if
the arbitration decisions are deterministic — same inputs, same output,
every time. Without that guarantee, replaying raw events can produce
different final states depending on when you run it, which defeats the
audit trail entirely.

Solid post. The transactional outbox section alone is worth the read.

Art light • Mar 5

Great point — I really like how you highlighted the implicit arbitration problem, because many event-driven systems quietly rely on things like queue order or timestamps without ever making that decision layer explicit. Your idea of a resolution layer with deterministic arbitration and confidence scoring is exactly the kind of mechanism that could make replay, debugging, and auditing far more reliable in real-world distributed systems.