DEV Community

Cover image for Why We Accidentally Blocked Our Users: A Deep Dive into Idempotency in Distributed Systems
amir
amir

Posted on

Why We Accidentally Blocked Our Users: A Deep Dive into Idempotency in Distributed Systems

I learned one of my most important distributed-systems lessons the hard way.

We were working on a payment flow connected to an external payment gateway. On paper, the architecture looked solid: microservices, clean database transactions, retry logic, monitoring, and enough security checks to make us feel safe before deployment.

Then production reminded us that real users do not live inside clean architecture diagrams.

Support tickets started coming in. Some users could not complete payments. Some accounts were being blocked too aggressively. At first, it looked like suspicious behavior: multiple payment attempts, repeated payloads, and requests arriving only seconds apart.

But when I dug into the logs, the real problem was not fraud.

It was our own backend.

A user with a slow network clicked the Pay button, waited, saw nothing happen, and clicked again. In another case, the browser retried a request after a timeout. Our backend received multiple identical payment requests within a very short window, and our naive security logic treated them like duplicate transaction anomalies or replay attempts.

We were punishing users for having bad internet.

That incident changed the way I think about payment systems, retries, APIs, and side effects. In distributed systems, you cannot control the network. You cannot control the user's browser. You cannot guarantee that a response will reach the client.

But you must control what happens when the same intent reaches your backend more than once.

That is where idempotency becomes essential.


The Problem Is Not the Retry. The Problem Is the Side Effect.

Retries are normal.

Clients retry. Browsers retry. Mobile networks fail. Gateways timeout. Load balancers drop connections. Users double-click buttons. Background workers reprocess jobs. Message queues deliver the same message more than once.

The dangerous part is not receiving the same request multiple times.

The dangerous part is executing the same side effect multiple times.

For example:

  • charging a card twice
  • creating two orders
  • sending duplicate invoices
  • blocking a user after repeated attempts
  • reducing inventory multiple times
  • creating duplicate ledger entries
  • triggering the same notification several times

In backend engineering, especially around payments and financial workflows, the real question is not:

How do I stop duplicate requests?

The better question is:

How do I make duplicate requests safe?


What Idempotency Actually Means

In mathematics, an operation is idempotent when applying it multiple times produces the same result as applying it once.

In API design, some HTTP methods are naturally expected to be idempotent.

GET should not change state.

PUT can be idempotent because replacing a resource with the same representation multiple times should leave the system in the same final state.

DELETE can also be idempotent because deleting the same resource multiple times still means the resource does not exist.

But POST is different.

A POST /payments request usually means:

Create a new payment.

If the client sends the same request twice, the backend may create two payments unless we intentionally design against it.

That is the core problem.

A payment request represents an intent, not just an HTTP payload. If the user intended to pay once, the system should execute that intent once, even if the request arrives multiple times.


Enter the Idempotency Key

An idempotency key is a unique value generated by the client and sent with the request, usually as an HTTP header:

Idempotency-Key: 7f3f0f4c-98a4-4d8c-9b91-7a2f9e4c5d11
Enter fullscreen mode Exit fullscreen mode

The key says:

This request represents one specific user action. If you see this key again, do not execute the action again. Return the result of the first execution.

That simple idea changes the system from:

I received another request, so I will process it again.

to:

I received the same intent again, so I will return the same result.

This is especially important for:

  • payments
  • order creation
  • wallet transfers
  • account provisioning
  • invoice generation
  • message publishing
  • background job processing
  • any workflow where duplicate execution is dangerous

A Good Idempotency Layer Is More Than a Boolean Flag

A common mistake is implementing idempotency like this:

if key exists:
    reject request
else:
    process request
Enter fullscreen mode Exit fullscreen mode

That looks simple, but it is not enough for production systems.

A real implementation needs to answer several questions:

  • Is the request currently being processed?
  • Did the request complete successfully?
  • Should we return a cached response?
  • Did the request fail with a retryable error?
  • Is the same key being reused with a different payload?
  • What happens if two identical requests arrive at the exact same millisecond?

For that reason, I prefer thinking about idempotency as a small state machine.


The State Machine I Like to Use

A practical idempotency record can have these states:

STARTED
COMPLETED
FAILED
Enter fullscreen mode Exit fullscreen mode

STARTED

The server received the key and started processing the request.

This state is important because it protects you from concurrent duplicates. If another request arrives with the same key while the first request is still running, the system should not execute the action again.

Usually, I return something like:

409 Conflict
Enter fullscreen mode Exit fullscreen mode

with a response explaining that the request is already in progress.

COMPLETED

The operation finished successfully.

At this point, the server stores the final response and returns that same response for future requests with the same key.

This is the key behavior that makes retries safe.

FAILED

The operation failed with a server-side or retryable error.

This part depends on your business rules, but in many systems I prefer allowing the client to retry after a true internal failure. The important thing is to be explicit about which failures are cached and which failures are not.


Why Redis Works Well Here

You can implement idempotency using PostgreSQL with a unique constraint. In some systems, that is perfectly fine.

But in high-throughput APIs, I usually prefer Redis for the idempotency layer because it gives you:

  • fast key lookup
  • atomic operations
  • natural TTL support
  • simple distributed locking primitives
  • low overhead for temporary request metadata

The TTL part matters a lot.

Idempotency keys should not live forever. For many payment-style workflows, a 24-hour TTL is a reasonable starting point. Some systems may need shorter or longer retention depending on reconciliation, compliance, and product behavior.


The Race Condition You Must Handle

The biggest bug in naive idempotency implementations is the race condition.

Imagine two identical requests arrive at the same time:

Request A checks key -> key does not exist
Request B checks key -> key does not exist
Request A processes payment
Request B processes payment
Enter fullscreen mode Exit fullscreen mode

Now you have two charges.

This is why the first write must be atomic.

In Redis, you can use SET with NX and EX:

const lockKey = `idempotency:${key}`;

const acquired = await redis.set(
  lockKey,
  JSON.stringify({
    state: "STARTED",
    payloadHash,
    createdAt: Date.now()
  }),
  "NX",
  "EX",
  86400
);

if (!acquired) {
  // Another request already created this key.
  // Now inspect its current state.
}
Enter fullscreen mode Exit fullscreen mode

NX means "only set this key if it does not already exist."

That one detail is critical. It turns the check-and-set operation into one atomic step.


Returning the Cached Response

When the first request completes, we update the idempotency record:

await redis.set(
  `idempotency:${key}`,
  JSON.stringify({
    state: "COMPLETED",
    payloadHash,
    statusCode: 201,
    responseBody: {
      paymentId: "pay_123",
      status: "succeeded"
    },
    completedAt: Date.now()
  }),
  "EX",
  86400
);
Enter fullscreen mode Exit fullscreen mode

Then, if the same request arrives again:

const record = JSON.parse(await redis.get(`idempotency:${key}`));

if (record.state === "COMPLETED") {
  return res.status(record.statusCode).json(record.responseBody);
}
Enter fullscreen mode Exit fullscreen mode

The client gets a successful response, but the payment is not executed again.

That is the whole point.

The retry becomes harmless.


Payload Fingerprinting: The Edge Case Many People Miss

There is one subtle bug that can become very dangerous.

What if the client reuses the same idempotency key for a different request?

Example:

Key: abc-123
Amount: $10
Enter fullscreen mode Exit fullscreen mode

Then later:

Key: abc-123
Amount: $1,000
Enter fullscreen mode Exit fullscreen mode

If your server only checks the key, it may return the cached response for the old request. That can create serious data integrity problems.

The fix is payload fingerprinting.

When the request first arrives, create a stable hash of the meaningful request body:

import crypto from "crypto";

function createPayloadHash(payload) {
  return crypto
    .createHash("sha256")
    .update(JSON.stringify(payload))
    .digest("hex");
}
Enter fullscreen mode Exit fullscreen mode

Then store that hash with the idempotency key.

On every retry, compare the incoming payload hash with the stored hash.

if (record.payloadHash !== incomingPayloadHash) {
  return res.status(400).json({
    error: "Idempotency key was reused with a different payload."
  });
}
Enter fullscreen mode Exit fullscreen mode

This prevents key reuse bugs from silently corrupting your workflow.

In my opinion, this is not optional for financial systems.


Be Careful with 4xx and 5xx Responses

Not every response should be cached the same way.

2xx responses

Usually safe to cache.

The operation succeeded, and future retries should return the same response.

5xx responses

This depends on where the failure happened.

If your service failed before executing the side effect, a retry may be safe.

If your service timed out after calling the payment gateway, you may not know whether the external side effect happened. In that case, you need reconciliation, gateway lookup, or a more careful state transition.

This is where many systems get complicated.

4xx responses

Be very careful.

If the user sent invalid input, you may not want to cache that failure forever. Maybe the user fixes the payload and retries. Maybe the frontend generated the key before validation was complete.

Personally, I do not like blindly caching all 4xx responses for long periods. For many product flows, it is better to reject the invalid request and ask the client to generate a new idempotency key after the user changes the input.

The important thing is to define this behavior intentionally.


Client-Side Key Generation

The idempotency key should usually be generated on the client.

For example, in a checkout page, generate the key when the user starts a specific payment attempt and reuse that key for retries of the same attempt.

const idempotencyKey = crypto.randomUUID();

await fetch("/api/payments", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Idempotency-Key": idempotencyKey
  },
  body: JSON.stringify({
    amount: 1000,
    currency: "USD"
  })
});
Enter fullscreen mode Exit fullscreen mode

If the server generates the key too late, it may not help with network failures between the client and server.

The client needs to be able to say:

I am retrying the same action.

That requires the client to own the key for that action.


Idempotency Is Not a Replacement for Database Integrity

One mistake I see sometimes is treating idempotency as the only protection layer.

It should not be.

Idempotency is part of the design, but you should still use database constraints and transactional boundaries where appropriate.

For example:

  • unique constraints for order numbers
  • unique transaction references
  • ledger constraints
  • gateway transaction IDs
  • consistent status transitions
  • outbox patterns for event publishing

In serious systems, correctness usually comes from layers.

The API idempotency layer prevents duplicate execution at the request boundary.

The database protects final state.

The message queue and worker design protect asynchronous processing.

The reconciliation process protects you when external systems behave unexpectedly.


A Simple Middleware Flow

A clean idempotency middleware can follow this flow:

1. Read Idempotency-Key from the request.
2. Validate that the key exists for dangerous POST endpoints.
3. Create a hash of the request payload.
4. Atomically create a STARTED record with Redis SET NX EX.
5. If the key already exists:
   - compare payload hash
   - if STARTED, return 409 Conflict
   - if COMPLETED, return cached response
   - if FAILED, allow or reject retry based on policy
6. Execute the business operation.
7. Store the final response as COMPLETED.
8. Return the response to the client.
Enter fullscreen mode Exit fullscreen mode

This flow is simple enough to understand, but strong enough to avoid many production problems.


Practical Rules I Follow Now

After dealing with this in real systems, these are the rules I try to follow:

  1. Any endpoint that creates money movement must support idempotency.
  2. Any endpoint that creates orders, invoices, or irreversible side effects should support idempotency.
  3. The key should represent one user intent, not one HTTP request.
  4. The first write for the idempotency key must be atomic.
  5. Store and compare a payload hash.
  6. Cache successful responses.
  7. Be very careful when caching failures.
  8. Use TTL intentionally.
  9. Do not rely only on frontend button disabling.
  10. Do not use rate limiting as a replacement for idempotency.

That last point is important.

Rate limiting protects infrastructure.

Idempotency protects correctness.

They are not the same thing.


The Real Lesson

The day we accidentally blocked users was not just a payment bug. It was a design lesson.

Our system was technically trying to protect users, but because we did not model retries correctly, we created a worse experience for legitimate customers.

Distributed systems are messy. Networks fail. Clients retry. Users double-click. Gateways timeout. Workers reprocess jobs.

A mature backend does not pretend these things will not happen.

A mature backend absorbs them.

Idempotency is one of those patterns that looks simple from the outside, but when you implement it properly, it changes the reliability of the whole system.

For me, the biggest mindset shift was this:

Do not fight duplicate requests. Make duplicate requests safe.

That is the difference between a backend that works in ideal conditions and a backend that survives production.

Top comments (0)