amir

Posted on May 28

Why We Accidentally Blocked Our Users: A Deep Dive into Idempotency in Distributed Systems

#backend #distributedsystems #redis #architecture

Redis locks and signed message alternatives

I learned one of my most important distributed-systems lessons the hard way.

We were working on a payment flow connected to an external payment gateway. On paper, the architecture looked solid: microservices, clean database transactions, retry logic, monitoring, and enough security checks to make us feel safe before deployment.

Then production reminded us that real users do not live inside clean architecture diagrams.

Support tickets started coming in. Some users could not complete payments. Some accounts were being blocked too aggressively. At first, it looked like suspicious behavior: multiple payment attempts, repeated payloads, and requests arriving only seconds apart.

But when I dug into the logs, the real problem was not fraud.

It was our own backend.

A user with a slow network clicked the Pay button, waited, saw nothing happen, and clicked again. In another case, the browser retried a request after a timeout. Our backend received multiple identical payment requests within a very short window, and our naive security logic treated them like duplicate transaction anomalies or replay attempts.

We were punishing users for having bad internet.

That incident changed the way I think about payment systems, retries, APIs, and side effects. In distributed systems, you cannot control the network. You cannot control the user's browser. You cannot guarantee that a response will reach the client.

But you must control what happens when the same intent reaches your backend more than once.

That is where idempotency becomes essential.

The Problem Is Not the Retry. The Problem Is the Side Effect.

Retries are normal.

Clients retry. Browsers retry. Mobile networks fail. Gateways timeout. Load balancers drop connections. Users double-click buttons. Background workers reprocess jobs. Message queues deliver the same message more than once.

The dangerous part is not receiving the same request multiple times.

The dangerous part is executing the same side effect multiple times.

For example:

charging a card twice
creating two orders
sending duplicate invoices
blocking a user after repeated attempts
reducing inventory multiple times
creating duplicate ledger entries
triggering the same notification several times

In backend engineering, especially around payments and financial workflows, the real question is not:

How do I stop duplicate requests?

The better question is:

How do I make duplicate requests safe?

What Idempotency Actually Means

In mathematics, an operation is idempotent when applying it multiple times produces the same result as applying it once.

In API design, some HTTP methods are naturally expected to be idempotent.

GET should not change state.

PUT can be idempotent because replacing a resource with the same representation multiple times should leave the system in the same final state.

DELETE can also be idempotent because deleting the same resource multiple times still means the resource does not exist.

But POST is different.

A POST /payments request usually means:

Create a new payment.

If the client sends the same request twice, the backend may create two payments unless we intentionally design against it.

That is the core problem.

A payment request represents an intent, not just an HTTP payload. If the user intended to pay once, the system should execute that intent once, even if the request arrives multiple times.

Enter the Idempotency Key

An idempotency key is a unique value generated by the client and sent with the request, usually as an HTTP header:

Idempotency-Key: 7f3f0f4c-98a4-4d8c-9b91-7a2f9e4c5d11

The key says:

This request represents one specific user action. If you see this key again, do not execute the action again. Return the result of the first execution.

That simple idea changes the system from:

I received another request, so I will process it again.

to:

I received the same intent again, so I will return the same result.

This is especially important for:

payments
order creation
wallet transfers
account provisioning
invoice generation
message publishing
background job processing
any workflow where duplicate execution is dangerous

A Good Idempotency Layer Is More Than a Boolean Flag

A common mistake is implementing idempotency like this:

if key exists:
    reject request
else:
    process request

That looks simple, but it is not enough for production systems.

A real implementation needs to answer several questions:

Is the request currently being processed?
Did the request complete successfully?
Should we return a cached response?
Did the request fail with a retryable error?
Is the same key being reused with a different payload?
What happens if two identical requests arrive at the exact same millisecond?

For that reason, I prefer thinking about idempotency as a small state machine.

The State Machine I Like to Use

A practical idempotency record can have these states:

STARTED
COMPLETED
FAILED

`STARTED`

The server received the key and started processing the request.

This state is important because it protects you from concurrent duplicates. If another request arrives with the same key while the first request is still running, the system should not execute the action again.

Usually, I return something like:

409 Conflict

with a response explaining that the request is already in progress.

`COMPLETED`

The operation finished successfully.

At this point, the server stores the final response and returns that same response for future requests with the same key.

This is the key behavior that makes retries safe.

`FAILED`

The operation failed with a server-side or retryable error.

This part depends on your business rules, but in many systems I prefer allowing the client to retry after a true internal failure. The important thing is to be explicit about which failures are cached and which failures are not.

Why Redis Works Well Here

You can implement idempotency using PostgreSQL with a unique constraint. In some systems, that is perfectly fine.

But in high-throughput APIs, I usually prefer Redis for the idempotency layer because it gives you:

fast key lookup
atomic operations
natural TTL support
simple distributed locking primitives
low overhead for temporary request metadata

The TTL part matters a lot.

Idempotency keys should not live forever. For many payment-style workflows, a 24-hour TTL is a reasonable starting point. Some systems may need shorter or longer retention depending on reconciliation, compliance, and product behavior.

The Race Condition You Must Handle

The biggest bug in naive idempotency implementations is the race condition.

Imagine two identical requests arrive at the same time:

Request A checks key -> key does not exist
Request B checks key -> key does not exist
Request A processes payment
Request B processes payment

Now you have two charges.

This is why the first write must be atomic.

In Redis, you can use SET with NX and EX:

const lockKey = `idempotency:${key}`;

const acquired = await redis.set(
  lockKey,
  JSON.stringify({
    state: "STARTED",
    payloadHash,
    createdAt: Date.now()
  }),
  "NX",
  "EX",
  86400
);

if (!acquired) {
  // Another request already created this key.
  // Now inspect its current state.
}

NX means "only set this key if it does not already exist."

That one detail is critical. It turns the check-and-set operation into one atomic step.

Returning the Cached Response

When the first request completes, we update the idempotency record:

await redis.set(
  `idempotency:${key}`,
  JSON.stringify({
    state: "COMPLETED",
    payloadHash,
    statusCode: 201,
    responseBody: {
      paymentId: "pay_123",
      status: "succeeded"
    },
    completedAt: Date.now()
  }),
  "EX",
  86400
);

Then, if the same request arrives again:

const record = JSON.parse(await redis.get(`idempotency:${key}`));

if (record.state === "COMPLETED") {
  return res.status(record.statusCode).json(record.responseBody);
}

The client gets a successful response, but the payment is not executed again.

That is the whole point.

The retry becomes harmless.

Payload Fingerprinting: The Edge Case Many People Miss

There is one subtle bug that can become very dangerous.

What if the client reuses the same idempotency key for a different request?

Example:

Key: abc-123
Amount: $10

Then later:

Key: abc-123
Amount: $1,000

If your server only checks the key, it may return the cached response for the old request. That can create serious data integrity problems.

The fix is payload fingerprinting.

When the request first arrives, create a stable hash of the meaningful request body:

import crypto from "crypto";

function createPayloadHash(payload) {
  return crypto
    .createHash("sha256")
    .update(JSON.stringify(payload))
    .digest("hex");
}

Then store that hash with the idempotency key.

On every retry, compare the incoming payload hash with the stored hash.

if (record.payloadHash !== incomingPayloadHash) {
  return res.status(400).json({
    error: "Idempotency key was reused with a different payload."
  });
}

This prevents key reuse bugs from silently corrupting your workflow.

In my opinion, this is not optional for financial systems.

Be Careful with 4xx and 5xx Responses

Not every response should be cached the same way.

2xx responses

Usually safe to cache.

The operation succeeded, and future retries should return the same response.

5xx responses

This depends on where the failure happened.

If your service failed before executing the side effect, a retry may be safe.

If your service timed out after calling the payment gateway, you may not know whether the external side effect happened. In that case, you need reconciliation, gateway lookup, or a more careful state transition.

This is where many systems get complicated.

4xx responses

Be very careful.

If the user sent invalid input, you may not want to cache that failure forever. Maybe the user fixes the payload and retries. Maybe the frontend generated the key before validation was complete.

Personally, I do not like blindly caching all 4xx responses for long periods. For many product flows, it is better to reject the invalid request and ask the client to generate a new idempotency key after the user changes the input.

The important thing is to define this behavior intentionally.

Client-Side Key Generation

The idempotency key should usually be generated on the client.

For example, in a checkout page, generate the key when the user starts a specific payment attempt and reuse that key for retries of the same attempt.

const idempotencyKey = crypto.randomUUID();

await fetch("/api/payments", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Idempotency-Key": idempotencyKey
  },
  body: JSON.stringify({
    amount: 1000,
    currency: "USD"
  })
});

If the server generates the key too late, it may not help with network failures between the client and server.

The client needs to be able to say:

I am retrying the same action.

That requires the client to own the key for that action.

Idempotency Is Not a Replacement for Database Integrity

One mistake I see sometimes is treating idempotency as the only protection layer.

It should not be.

Idempotency is part of the design, but you should still use database constraints and transactional boundaries where appropriate.

For example:

unique constraints for order numbers
unique transaction references
ledger constraints
gateway transaction IDs
consistent status transitions
outbox patterns for event publishing

In serious systems, correctness usually comes from layers.

The API idempotency layer prevents duplicate execution at the request boundary.

The database protects final state.

The message queue and worker design protect asynchronous processing.

The reconciliation process protects you when external systems behave unexpectedly.

A Simple Middleware Flow

A clean idempotency middleware can follow this flow:

1. Read Idempotency-Key from the request.
2. Validate that the key exists for dangerous POST endpoints.
3. Create a hash of the request payload.
4. Atomically create a STARTED record with Redis SET NX EX.
5. If the key already exists:
   - compare payload hash
   - if STARTED, return 409 Conflict
   - if COMPLETED, return cached response
   - if FAILED, allow or reject retry based on policy
6. Execute the business operation.
7. Store the final response as COMPLETED.
8. Return the response to the client.

This flow is simple enough to understand, but strong enough to avoid many production problems.

Practical Rules I Follow Now

After dealing with this in real systems, these are the rules I try to follow:

Any endpoint that creates money movement must support idempotency.
Any endpoint that creates orders, invoices, or irreversible side effects should support idempotency.
The key should represent one user intent, not one HTTP request.
The first write for the idempotency key must be atomic.
Store and compare a payload hash.
Cache successful responses.
Be very careful when caching failures.
Use TTL intentionally.
Do not rely only on frontend button disabling.
Do not use rate limiting as a replacement for idempotency.

That last point is important.

Rate limiting protects infrastructure.

Idempotency protects correctness.

They are not the same thing.

The Real Lesson

The day we accidentally blocked users was not just a payment bug. It was a design lesson.

Our system was technically trying to protect users, but because we did not model retries correctly, we created a worse experience for legitimate customers.

Distributed systems are messy. Networks fail. Clients retry. Users double-click. Gateways timeout. Workers reprocess jobs.

A mature backend does not pretend these things will not happen.

A mature backend absorbs them.

Idempotency is one of those patterns that looks simple from the outside, but when you implement it properly, it changes the reliability of the whole system.

For me, the biggest mindset shift was this:

Do not fight duplicate requests. Make duplicate requests safe.

That is the difference between a backend that works in ideal conditions and a backend that survives production.

Top comments (7)

ANP2 Network • May 28

The Redis lock + idempotency key pattern works, but it pushes the dedup responsibility to the caller — they have to remember to generate the key, send it, and the server has to remember the response for replay. An alternative I've used: have the operation itself be a signed message, so the signature IS the natural idempotency key.

In one settlement flow I work on, a payment-release transaction is keyed on the signed verdict that triggered it (= verifier signs "task passed", ledger debits + credits in the same transaction, verdict signature as dedup key). A double-submitted verdict produces the same signature, hits a primary-key conflict on the ledger insert, second insert silently no-ops. The client never manages idempotency keys; the cryptographic identity of the operation is the key.

Tradeoff: only works when the operation has a stable canonical form that signs the same way on retry. For payloads with timestamps or random nonces, you'd still want explicit idempotency keys.

amir • May 29

Great point. I agree that the canonical form is the hardest part here. I really like this approach because it lets the database handle conflicts with primary keys instead of depending on Redis or other external state. This makes the whole process more predictable and easier to control. I added the “Signed Identity” idea to my draft. Thanks for the inspiration

Mohammadreza Khalilikhorram • Jun 1

Great lesson.

One of the hardest parts of distributed systems is realizing that retries are a normal behavior, not necessarily malicious traffic.

Idempotency is one of those concepts that looks simple in theory but saves systems from massive production issues in practice. Thanks for sharing the real-world experience.

amir • Jun 2

Thank you so much for reading and sharing your thoughts! I really appreciate your feedback.

Valentyn Kit • Jun 26

A Redis lock only narrows the race window; expiry, a dropped node, or a retry landing right after release all still slip through. The thing that actually closes it is making the side effect idempotent at the system of record: a unique constraint on the idempotency key, and passing that same key to the gateway so it dedupes the charge too. Then the lock is just an optimization. Did you persist the key, or lean on the TTL?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.