Why Synchronous Webhook Processing Is a Production Trap

#programming #api #productivity

Most webhook implementations start the same way: the event arrives, the handler parses the payload, does some database work, maybe fires an email, and returns 200. It works in testing. It works in early production with low event volumes. Then it fails in predictable and expensive ways.

The failure modes of synchronous webhook processing are not edge cases. They're the normal operating conditions of a production webhook integration. Understanding why they fail makes the fix obvious.

What Synchronous Processing Looks Like

A synchronous webhook handler processes the event in the same request context where it was received. In pseudocode:

POST /webhooks/events
  1. Parse payload
  2. Verify signature
  3. Query database to get account
  4. Update account records
  5. Send confirmation email
  6. Return 200

Steps 3, 4, and 5 involve external calls. A database query under load might take 500ms. An email provider having a slow day might take 2 seconds. If anything in steps 3-5 throws an exception, the handler returns a 500.

The Retry Problem

Most webhook senders interpret a non-2xx response as delivery failure and schedule a retry. Stripe retries webhooks for up to three days, with increasing intervals between attempts. GitHub retries delivery failures. Most enterprise webhook senders follow a similar policy.

When your synchronous handler returns 500 because the database query timed out, the sender queues a retry. The retry arrives, your database is still under load, and returns 500 again. After several retries, you have a queue of the same event being retried repeatedly, each attempt potentially writing partial state to the database before failing.

The synchronous handler created a worst-case scenario: the database is slow, so the handler fails, so there are more retries, so the database load increases. This is a feedback loop.

The Timeout Problem

Webhook senders enforce delivery timeouts. If your endpoint doesn't respond within their timeout window (often 5-30 seconds), they treat it as a failed delivery and schedule a retry.

For most simple operations, this isn't a problem. For operations that involve slow downstream services, it is. A third-party API call that normally completes in 1 second might take 15 seconds under load. Your handler, waiting for that call to complete, times out from the sender's perspective before returning a response. The sender retries. You now have the same event being processed twice simultaneously, each racing to write to the same database records.

The Idempotency Problem That Synchronous Processing Creates

Synchronous processing combined with retries creates idempotency problems in code that was never designed to handle duplicate events. If your handler does:

account.credits += event.amount
account.save()

Running this twice doubles the credit amount. Running it once is correct. Designing for exactly-once execution is hard when you can't guarantee the sender won't retry.

Idempotent processing (checking whether an event ID has already been handled before doing any work) is the correct solution. But tacking it onto a synchronous handler doesn't fix the underlying architecture problem. You're still doing work inside the request window, still subject to timeouts, and still returning 500s on failures that cause retries.

The Correct Architecture Separates Receiving from Processing

The fix is to separate what happens in the request from what happens after it. The receiver endpoint does three things: verify the signature, store the raw payload, and return 200. Everything else happens in a background worker after the request has been acknowledged.

POST /webhooks/events
  1. Verify signature -> 400 if invalid
  2. Check idempotency (event_id already seen?) -> 200 immediately
  3. Write raw payload to queue with status "pending"
  4. Return 200

[background worker]
  1. Read "pending" event from queue
  2. Process event (queries, updates, notifications)
  3. Mark event as "processed" or "failed"

The receiver now completes in under 500 milliseconds regardless of what processing involves. The sender gets a 200 immediately after delivery. Retries only happen if the network connection fails before the response, not because processing was slow.

What the Worker Gets

The worker processes events asynchronously, which changes what's possible. Retrying failed events is now the worker's responsibility, not the sender's. If a database is slow, the worker backs off and retries with exponential delay. If a downstream service is down, the event stays in the queue until it becomes available. No 500s, no sender retries, no feedback loops.

Redis works well as the queue layer for this pattern. The receiver appends events to a list or stream. Workers consume from the stream, update event status on completion, and move failed events to a dead-letter queue after exhausting retries.

Designing Worker Retry Logic

The worker's retry behavior matters as much as the receiver's architecture. Without explicit retry logic, a single transient failure leaves the event in a failed state permanently.

A practical worker retry pattern:

Pick up the event and attempt processing.
On success, mark the event as "processed" with a completion timestamp.
On failure, increment a retry count. If below the threshold, return the event to the queue with an exponential delay. If the retry count exceeds the threshold, move the event to a dead-letter queue and emit an alert.

The delay between retries should grow with each attempt. Flat retry intervals put sustained pressure on a downstream service that's already struggling. Exponential backoff -- retry after 10 seconds, then 100, then 1000 -- gives external services time to recover without exhausting retries immediately. Most production systems cap the maximum interval to avoid events sitting in the queue indefinitely.

Queue infrastructure handles much of this natively. Redis streams track unacknowledged messages and allow a configurable pending timeout before re-delivery. RabbitMQ's dead-letter exchange can route a message to a retry queue with a delay after a configurable rejection count.

Dead-Letter Queue Design

Events that exhaust their retry limit need a place to go that isn't silently deleted. A dead-letter queue preserves events that couldn't be processed after multiple attempts, making them available for inspection and manual replay.

The minimum useful dead-letter record includes: the original payload, the event source, the retry count, the last error message, and the timestamp of the last attempt. The error message is critical -- without it, debugging what went wrong requires reconstructing the failure from distributed application logs, which is much slower.

Dead-letter management can be straightforward. A separate database table, a query to list failed events by source and time range, and a replay operation that resets a set of events back to "pending" covers most operational needs. The engineering work is in setting up an alert when dead-letter depth grows past a threshold so the failures are visible before they affect business-critical event types.

Testing the full async flow end-to-end during development is important. Unit tests verify the processing logic in isolation, but they can't replicate the sender's retry timing or the behavior of the real queue consumer. ngrok exposes your local receiver to the actual external sender so you can exercise the complete path including signature verification, queue writes, and worker consumption under realistic delivery conditions.

When the Synchronous Approach Is Acceptable

For very simple processing (a webhook that only logs the event to a table) and very small volumes, synchronous processing is fine. The failure modes described here only manifest at meaningful event volumes or when processing involves slow external calls.

For a complete implementation of the async receiver pattern including signature verification, idempotency, and failure handling, How to Build a Webhook Receiver That Handles Real-World Traffic covers each component with implementation notes.

This team at 137Foundry builds data integration infrastructure, including webhook receivers for high-volume event processing environments.

Photo by delphinmedia on Pixabay