TTLs, key derivation, retry storms, and the six ways teams accidentally let the same charge through twice.
TL;DR
- The idempotency key is the only thing that keeps "the network blinked, let's retry" from turning into two real charges.
- The key has to be client-generated, stable across retries, scoped to the operation, persisted server-side, and TTL'd — miss any one and you have a bug waiting for a busy Friday.
- Replay semantics matter: a retry with the same key must return the same response, not "OK, accepted" again.
- The six classic ways teams get this wrong are all preventable. They just need to be on a checklist.
If your payment API has a /charge endpoint and no idempotency key in the request header, this post is for you. If it has one but you're not sure what the TTL should be — also for you.
The shape of the problem
Every retry in a payment system is a coin flip on a duplicate. The classic failure goes like this:
client ──POST /charge───▶ server (200 OK, settled)
◀──── socket reset ────
client ──POST /charge (retry)──▶ server (200 OK, settled AGAIN)
The server has no way of knowing the second request is the same one. The card was just charged twice. The customer files a chargeback. You eat the fee and (depending on the network) a hit to your dispute ratio.
An idempotency key fixes this with one rule: for a given key, the server returns the same response for a window of time, regardless of how many times the request arrives.
client ──POST /charge key=k1──▶ server (200 OK, settled, store(k1, resp))
◀──── socket reset ────
client ──POST /charge key=k1──▶ server (returns stored resp; no new charge)
Five properties of a key that works
-
Client-generated. The point of the key is to survive a retry; if the server generates it, every retry gets a new one. Use a UUIDv4 or
<order_id>:<attempt>— anything the client controls. - Stable across retries. The same logical operation must reuse the same key for the entire retry window. If you regenerate on each attempt, you've defeated the mechanism.
-
Scoped to the operation. "Charge $42" and "Refund $42" share nothing; reusing a key across them is undefined behavior. Most APIs scope by
(merchant, endpoint, key). - Persisted server-side. Not in memory, not in the load balancer — in a durable store the next worker can read. Otherwise a process restart between attempt #1 and attempt #2 nukes the protection.
- TTL'd. Keys live for a bounded window (commonly 24h). After that they expire so you can reuse them for new operations and your store doesn't grow forever.
A reference implementation
import hashlib, json, time
from typing import Optional
TTL_SECONDS = 24 * 3600 # 24h — fits a typical end-of-day reconciliation window
def fingerprint(body: dict) -> str:
canonical = json.dumps(body, sort_keys=True, separators=(",", ":"))
return hashlib.sha256(canonical.encode()).hexdigest()
def handle_charge(req, store, acquirer):
key = req.headers.get("Idempotency-Key")
if not key:
return 400, {"error": "Idempotency-Key required"}
fp = fingerprint(req.body)
cached = store.get(key)
if cached:
# Same key, same body → replay the previous response
if cached["fingerprint"] != fp:
return 422, {"error": "Idempotency-Key reuse with different body"}
return cached["status"], cached["response"]
# First sighting — reserve the slot so concurrent retries serialize
if not store.set_if_absent(key, {"state": "in_flight", "fingerprint": fp, "ts": time.time()}, ttl=TTL_SECONDS):
# Another worker is already processing this key — wait or return 409
return 409, {"error": "request in progress"}
res = acquirer.charge(req.body)
store.set(key, {"state": "done", "fingerprint": fp, "status": 200, "response": res, "ts": time.time()}, ttl=TTL_SECONDS)
return 200, res
Three details worth flagging:
- The fingerprint check (
cached["fingerprint"] != fp) is what turns "I sent the same key twice with different amounts" from an exploit into a 422. - The
set_if_absentreservation is what stops two concurrent retries from both calling the acquirer at once. Without it, you've moved the race condition one layer down. - The 24h TTL is a convention, not a law. Pick it to match your operational window — see the next section.
TTL: how long is long enough?
This is where teams disagree. The answer depends on what your retry surface looks like.
| Caller | Realistic retry horizon | TTL guidance |
|---|---|---|
| Browser checkout | seconds → a few minutes | 1h is plenty |
| Mobile app (offline-tolerant) | up to 24h (offline queue) | 24h |
| Server-to-server with manual retry | up to a few days | 48–72h |
| Batch payouts (overnight files) | aligned to settlement cycle | match the cycle (often 24h) |
Two heuristics that survive most arguments:
- Longer than your longest real retry path, shorter than your reconciliation cycle. If a retry can land after you've closed the books on that day, your books are wrong.
- If in doubt, 24h. Then revisit when you have data on actual key reuse age.
The six classic ways teams get this wrong
- Regenerating the key on each retry. Almost always a bug in a retry library's "what changes per attempt" config. Fix: generate the key once, persist it with the order, reuse it.
- Storing keys in memory only. Works until your process restarts. Fix: durable store (Postgres unique-index, Redis with persistence, DynamoDB).
-
Not scoping the key. A refund retry uses the same key as the original charge → 422 or, worse, undefined. Fix: scope by
(merchant, endpoint, key). - Not fingerprinting the body. Same key + different body = silent acceptance of whichever request arrived first, with no way to tell. Fix: hash the canonical body and compare.
-
No reservation between concurrent retries. Two retries arrive on two workers, both miss the cache, both call the acquirer. Fix:
INSERT ... ON CONFLICT DO NOTHINGor a RedisSET NX. - TTL shorter than the retry window. Mobile app retries 2 hours later; the key has expired; the charge goes through twice. Fix: TTL ≥ longest legitimate retry path.
Idempotency in cascading retries
In a multi-acquirer setup, "retry" can mean two different things:
- Same acquirer, transient network error → same idempotency key. You want the acquirer to deduplicate.
- Different acquirer, after a hard failure → new idempotency key per attempt. The acquirers don't share state; reusing the same key would either be ignored or — worse — happen to clash with someone else's traffic.
A clean way to encode this:
def attempt_key(charge_id: str, attempt: int) -> str:
return f"{charge_id}:a{attempt}" # client-stable, per-attempt
charge_id is the logical operation (lives in your DB, survives retries), attempt increments only when you fall over to a different acquirer.
A small checklist for your code review
- [ ] Header name documented (
Idempotency-Key) and required on mutating endpoints. - [ ] Server returns 400 if missing on charge/refund/payout endpoints.
- [ ] Body fingerprint stored alongside the key; mismatch = 422.
- [ ] Reservation pattern prevents concurrent execution.
- [ ] TTL ≥ your longest legitimate retry path and < your reconciliation window.
- [ ] Replay returns the original status code, not a generic 200.
- [ ] Tested with a fault injector that drops responses between server commit and client receipt.
If your checkout endpoint passes all seven, you've taken the most common cause of double charges off the table.
The next post in this series treats the payout ledger as a real-time read model — same correctness mindset, different shape: append-only events, idempotent projections, end-of-day reconciliation that doesn't need a Friday-night batch. If you want the orchestration-layer view, see the payment orchestration overview.
*Author: payments engineer at PaynetEasy — we build payment orchestration and global payouts infrastructure → payneteasy.com
Top comments (0)