Why idempotency implementation is probably broken under concurrent load

#node #api #backend #distributedsystems

Most idempotency implementations are quietly broken under concurrent load.

They work fine in testing. They even work in staging.
Here's why, and what actually needs to happen.

The problem

Network requests fail and get retried. Your payment client times out and fires again. A webhook gets delivered twice. Your order creation endpoint runs twice and charges the customer twice.

The naive fix looks like this:

app.post('/orders', async (req, res) => {
  const key = req.headers['idempotency-key']
  if (!key) return next()

  const existing = await store.get(key)
  if (existing) return res.json(existing.response)

  const result = await createOrder(req.body)
  await store.set(key, result)
  res.json(result)
})

This looks correct. It isn't.

The race condition

Under concurrent load, two retries with the same key arrive simultaneously:

Thread A: store.get(key) → null   (key doesn't exist yet)
Thread B: store.get(key) → null   (key doesn't exist yet)
Thread A: createOrder()           (runs)
Thread B: createOrder()           (also runs — double charge)
Thread A: store.set(key, result)
Thread B: store.set(key, result)

Both requests pass the get() check before either has written to the store. Both execute the handler. Both charge the customer.

store.get() is atomic — no thread sees it half-done. But atomicity of a single operation doesn't prevent race conditions between operations. The race lives in the gap between get() returning null and set() being called.

The fix: collapse check-and-set into one atomic operation

Redis gives you this for free:

SET key value NX EX ttlSeconds

NX means "only set if the key doesn't exist."
EX sets expiry.
Most critically, this is a single atomic command. There is no gap. Only one caller wins across any number of concurrent processes or server instances.

async acquire(key, ttlSeconds) {
  const result = await redis.set(
    key,
    JSON.stringify({ status: 'processing' }),
    { NX: true, EX: ttlSeconds }
  )
  return result === 'OK'  // 'OK' = won, null = lost
}

Now only one request executes the handler. Every other concurrent duplicate hits acquire() returning false and gets a 409 in-progress response with a Retry-After header.

What happens if the process crashes between acquire and set?

The lock is stuck as "processing." The next retry arrives and sees a processing lock — it returns 409 and waits. But the handler that acquired the lock is dead. The retry waits forever.

This is why the lock has a TTL (processingTtl). If the process crashes, the lock auto-expires after 30 seconds and the next retry re-acquires cleanly. Set processingTtl higher than your p99 handler latency so the lock doesn't expire on a slow-but-alive request.

The key reuse problem

There's another failure mode: a client reuses the same idempotency key for a different request.

POST /orders  Idempotency-Key: abc123  body: { item: "keyboard" }  → 201
POST /orders  Idempotency-Key: abc123  body: { item: "mouse" }     → ?

Without validation, the second request silently returns the cached keyboard order. The client thinks they ordered a mouse. This should be rejected.

The fix is fingerprinting — hash the request shape and store it alongside the response. On every duplicate, compare the incoming fingerprint to the stored one. A mismatch means key reuse:

POST /orders { item: "mouse" } with key abc123
→ 422 idempotency_key_mismatch: This key was used with a different request. Use a new key.

Putting it together

Here's the full execution flow of a correct idempotency implementation:

acquire(key) 
  ├── true  → execute handler → set(completed response)
  │                          └── if handler throws → release(key)
  └── false → get(key)
                ├── completed  → validate fingerprint → serve cached response
                └── processing → 409 + Retry-After header

This is what reliability-kit implements — a production-grade idempotency middleware for Express and Fastify, with pluggable store backends so you can use Redis, Postgres, DynamoDB, or any backend that supports a conditional write.

npm install @reliability-tools/express

import { reliability, RedisStore } from '@reliability-tools/express'
import Redis from 'ioredis'

app.use(reliability({
  idempotency: {
    enabled: true,
    store: new RedisStore(new Redis()),
    ttl: 86400,
    fingerprintStrategy: 'full',  // validates method + path + body
  },
}))