Yuriy for Adal

Posted on Jun 29 • Originally published at adal.cloud

Why Webhooks get lost and why they sometimes arrive twice

#webhooks #backend #devops #architecture

Webhooks often look deceptively simple.

One service sends an HTTP request. Another service receives it. The receiver returns 200 OK, processes the payload, and everyone moves on.

That model works until the webhook becomes part of a real business process.

A missed event can mean a customer message that never reaches a chatbot, an order that does not appear in a CRM, a subscription that is not updated, or a payment that succeeds but does not unlock access for the customer.

But lost webhooks are only half of the problem. The other half is duplicate delivery. A reliable webhook integration must be prepared for both.

What does it mean to lose a webhook?

A webhook is lost when an event is sent but never successfully processed by the receiving system.

There are many possible causes:

the receiving server is temporarily unavailable;
a reverse proxy is misconfigured;
the application crashes after accepting the request;
the database is overloaded;
a TLS certificate expires;
a deployment introduces an unexpected error;
a DNS or network change routes traffic to infrastructure that is no longer active.

For a low-risk notification, this may be inconvenient but manageable. For a production integration, it can become a business problem.

Imagine a payment provider sends an event confirming that a customer has paid. If your application never processes that event, the customer may be charged but never receive access to the product.

That is not just a technical failure. It creates support work, refunds, manual reconciliation, and a loss of trust.

DNS migrations can create both loss and duplication

A common example is moving an application to a new server.

The usual approach looks simple:

Deploy the application to a new server.
Update the DNS record.
Wait for traffic to move.
Shut down the old server.

The problem is that DNS does not update everywhere at the same moment.

Resolvers, ISPs, corporate networks, and individual machines may cache the old address for different amounts of time. Even when the DNS TTL is low, some traffic can continue reaching the old server after the record has been changed.

During the transition, different webhook senders may reach different servers:

Webhook provider
        |
        +--> old server
        |
        +--> new server

If the old server is already offline, webhooks sent there may fail or be lost. If both servers are active, the same event can potentially be processed twice. That creates a different class of failure.

A chatbot might send two replies to the same message. A CRM integration might create duplicate records. A payment workflow might credit a balance twice or issue the same product twice.

Why retries are necessary

Webhook delivery should normally use an at-least-once model.

That means a sender retries delivery when it cannot confirm that the receiving system accepted the event. This is much safer than silently dropping an event after one failed attempt.

However, at-least-once delivery has an important consequence:

The receiving system must assume that the same event can arrive more than once.

A sender may retry because:

the destination returned a 5xx response;
the connection timed out;
the destination was temporarily unavailable;
the sender did not receive the response, even though the receiver processed the request;
the provider intentionally retries to protect against transient network failures.

This is normal behaviour. It is not necessarily a bug in the provider. The receiving side must be designed accordingly.

Idempotency is the protection against duplicates

The key concept is idempotency.

An idempotent operation produces the same final result whether it runs once or several times.

For webhook processing, this usually means storing the provider’s event identifier before running the business logic.

A simplified flow might look like this:

Receive webhook
        |
        v
Validate signature
        |
        v
Check event ID
        |
        +--> already processed --> return 200 OK
        |
        v
Store event ID
        |
        v
Run business logic
        |
        v
Return 200 OK

For example, if a payment provider sends an event with the ID evt_123, your application should record that ID after receiving it.

If the same event arrives again, the application should recognise it as a duplicate and avoid repeating the business operation.

Returning 200 OK is still important: it tells the sender that the event has been safely handled, even when no additional action was needed.

Acknowledge quickly, process safely

Another useful pattern is to separate acceptance from processing.

Instead of performing all business logic directly inside the webhook request handler, the application can:

Validate the request.
Store the event durably.
Return a successful response quickly.
Process the event asynchronously.

This reduces the risk of timeouts and makes failures easier to recover from. It also gives you a durable record of what happened, which is essential when investigating an incident.

Of course, this only works if the storage step itself is reliable. Returning 200 OK before the event has been saved somewhere durable can still cause data loss.

A practical webhook reliability checklist

A production webhook integration should answer these questions clearly:

What happens when the destination is temporarily unavailable?
Are failed deliveries retried?
How long are retries performed?
Can the same event be delivered more than once?
Does the application use event IDs or idempotency keys?
Are incoming requests stored before expensive processing starts?
Can the delivery history be inspected later?
Can a failed event be replayed safely?
During a migration, how long will the old endpoint remain available?
Can the team distinguish between an event that was never received and one that was received but failed during processing?

If these questions do not have explicit answers, reliability depends too heavily on everything going right.

And in production, everything eventually does not.

Reliability is not just an HTTP status code

Returning 200 OK is not the same as handling an event reliably.

Reliable webhook processing requires a clear delivery model, retry behaviour, idempotent business logic, durable event storage, and enough visibility to understand what happened when an incident occurs.

That is why webhook infrastructure matters.

Adal Cloud is designed to make webhook delivery predictable and observable: incoming requests are retained, delivery attempts can be inspected, failed deliveries can be retried, and stored requests can be replayed when needed.

When webhooks carry events that matter to customers, payments, or business operations, the delivery path should not be the least reliable part of the system.