Idempotency Patterns for Distributed Systems

#api #integration #webdev

In a single-machine system, an operation happens once or it fails. In a distributed system, an operation can happen once, fail, succeed twice, or end in an ambiguous state that nobody can cleanly observe. The network is allowed to drop the request on the way in, drop the response on the way out, time out in the middle, or silently deliver the same message twice. Any service that expects perfect delivery semantics from a network it does not own is a service that will eventually corrupt its own data.

The mitigation is retry. The client did not get a response, so the client tries again. This is correct behavior, and it is the only behavior that keeps distributed systems running. The problem is that if the original request actually succeeded on the server — if the ambiguity was a lost response, not a lost request — the retry creates a duplicate. A second payment. A second email. A second row in the ledger.

Idempotency is what turns retry from a foot-gun into a safety mechanism. It is the property that lets the client retry as many times as it needs to, without the server doing the work more than once. Without it, distributed systems either accept duplicates as a fact of life, or they invest in heavy coordination protocols to prevent them. With it, the coordination disappears.

The natural idempotency of HTTP verbs

Before reaching for sophisticated patterns, it is worth recognizing that HTTP already distinguishes between idempotent and non-idempotent operations, and the distinction is load-bearing.

GET is idempotent by specification. Reading a resource does not change it, so retrying a GET produces the same result regardless of how many times it runs. This is why GET requests are safe to retry automatically at any layer — the browser, the reverse proxy, the SDK. No special server logic is needed.

PUT is idempotent by convention. A PUT request says “this resource should have this state” — the second and tenth PUT with the same body produce the same end state as the first. If a client retries a PUT after a timeout, the worst outcome is a wasted network round trip, not a duplicate resource.

DELETE is idempotent. Deleting something that is already deleted is a no-op. The conventional response is two hundred four on the second delete rather than four hundred four, although both are defensible — the important property is that the server does not treat the retry as an error condition.

POST is the odd one out. POST creates resources, so two POSTs with the same body create two resources. This is correct semantically — the client that sent two POSTs might have genuinely wanted two resources — but it is exactly the operation where retry is most dangerous. When people talk about “idempotency patterns,” they are almost always talking about how to make POST safe to retry.

The idempotency key pattern

The dominant solution for retry-safe POSTs is the idempotency key. The shape is simple enough to be described in a paragraph.

The client generates a unique key before sending the request — typically a UUID, sometimes a hash of the request contents. It sends the key in a header, traditionally named Idempotency-Key. The server, when it receives the request, checks whether it has already seen that key. If it has, it returns the cached response from the original request — the same status code, the same body, as if the operation were performed again with identical results. If it has not, it performs the operation, stores the response keyed by the idempotency key, and returns the result.

The effect is that the client can retry the request as many times as it wants. The first successful attempt does the work. Every subsequent retry returns the same response without touching the business logic. The client gets exactly-once semantics without the server needing to coordinate with anyone.

Several APIs have made this pattern standard for an entire industry. Stripe’s idempotency keys are the most widely copied implementation, and the shape they popularized is roughly the right default: client-generated keys, server-stored responses, a time-to-live of about twenty-four hours.

Implementation details that matter

The pattern looks simple. The details have sharp edges.

Where the keys live. The server has to store the key and the response somewhere. For low-volume, high-value operations (payments, account creation), a dedicated database table works well. For high-volume operations, a key-value store with appropriate eviction is typically better. Redis with a TTL is the most common choice; the important constraint is that the store must be consistent with the database where the actual state lives, otherwise a key can indicate “already processed” while the actual state was not persisted.

What counts as “the same request.” If a client sends the same idempotency key but a different request body, is that a retry or a new request? The defensible answer is: it is an error, and the server should reject the second request with a clear error message rather than either treating it as a retry (wrong) or silently processing it as a new request (also wrong). The client is expected to use a fresh key for a fresh request.

TTL selection. The idempotency cache cannot live forever. Twenty-four hours is a common default, chosen because it is long enough to absorb any reasonable retry window and short enough that storage stays bounded. Shorter TTLs save storage but risk the client retrying after expiry, which creates exactly the duplicate the pattern was designed to prevent. Longer TTLs increase storage without much benefit, because no well-behaved client retries a request from days ago.

What to store. The full response body, the status code, and any response headers that affect client behavior. It is tempting to store only “success” or “failure” and recompute the response on retry, but that recomputation breaks the whole guarantee — the retried response could differ from the original, and the client has no way to know.

The race condition between lookup and write

A subtle bug hides inside the naive implementation. If two retries of the same request arrive at the server simultaneously — which happens constantly in practice, because clients often retry after a short timeout rather than waiting — both requests can check the idempotency store, see no entry, and proceed to perform the operation. Now the work has happened twice.

The fix is that the idempotency record must be inserted with a uniqueness constraint before the business logic runs. The first request to insert wins and proceeds. The second request’s insert fails, and the second request then waits for the first to complete and returns its cached response. This turns the race condition into a wait condition, which is safe.

Getting this right requires a bit of care with database transactions. The insert of the idempotency record and the execution of the business logic need to be structured such that the record is visible to concurrent requests before the logic starts, but is not committed until the logic succeeds. This is the pattern every robust idempotency implementation ends up with, even if the early versions did not.

The idempotency receipt pattern

A variant worth knowing exists for systems where the client does not want to pre-generate a key, or where the server wants stronger control over what counts as a retry.

In the receipt pattern, the client makes a POST request to a dedicated endpoint that creates an idempotency receipt — an identifier with a short TTL — and returns it. The client then uses that receipt as the idempotency key for its actual request. If the actual request fails or times out, the client can retry with the same receipt safely.

This is more round-trips than the client-generated key pattern, but it has advantages. The server controls the receipt lifetime and format. The client cannot accidentally generate duplicate keys. And the pattern integrates cleanly with authorization flows where the receipt can be scoped to a specific caller.

Most APIs are fine with the simpler client-generated key pattern. The receipt pattern is worth reaching for when the client environment is constrained — old mobile clients, embedded systems, systems where generating a sufficiently unique key is harder than it sounds.

Retrofitting onto non-idempotent APIs

A common real-world situation is an API that was designed without idempotency and has now grown large enough that duplicate requests are causing real problems. Retrofitting idempotency is harder than greenfield, but it is not impossible.

The usable pattern: introduce the Idempotency-Key header as optional. Clients that pass it get the deduplication guarantee. Clients that do not pass it behave as before. New clients and updated SDKs start passing the key by default. Over time, the fraction of traffic with keys grows, and the duplicate rate drops proportionally.

The trap to avoid is making the key mandatory immediately. That breaks existing clients that have been operating under the old contract and creates more incidents than the idempotency program was supposed to solve. Mandatory keys are a destination to migrate to slowly, not a starting point.

What this buys, in practice

The visible benefit of idempotency is that duplicate operations disappear from the error logs. The less visible benefit is larger: retry policies become aggressive. Clients can retry on any network error, any five-hundred response, any timeout, without fear of corrupting state. Mobile clients can retry when the connection comes back after a tunnel outage. Background workers can retry on job failure without human intervention. The entire distributed system becomes more resilient, because retry — the simplest reliability mechanism there is — is now safe to use everywhere.

The cost is a modest amount of storage, a small per-request lookup, and a one-time investment in getting the race condition right. The first API a team adds idempotency to feels like overhead. The tenth feels like the baseline. The hundredth is the reason the entire system survives a bad network day without a single corrupted record.

Retry is a fact of distributed systems. Idempotency is what makes retry safe. Any API that mutates state and does not have idempotency support is, at a fundamental level, a timebomb — it has not caused a problem yet because the network has been kind, and it will cause a problem eventually because the network will not stay kind. Build the support before the incident, not after.