Mustafa ERBAY

Posted on Jun 7 • Originally published at mustafaerbay.com.tr

Distributed Systems Idempotency Design: 3 Practical Ways

#distributedsystems #career #indiehacker

If you work in a distributed architecture, you have to assume that the network can drop at any moment, packets can be lost, and clients will resend the same request repeatedly. A timeout response from the API gateway caused by a micro‑second network jitter can lead the client (or a queue consumer) to resend the same payment‑trigger request a second time. If your system isn’t prepared for this, the user’s card may be charged twice or the warehouse may create two separate shipment orders for the same product.

In my twenty years of field experience, I’ve personally had to clean up thousands of dollars of reconciliation errors in the database caused by such duplicate operations that brought business processes to a halt. In this post I’m putting on the table the three practical ways to design idempotency in distributed systems that I applied in a large‑scale production ERP and high‑traffic backend services, complete with technical details, code snippets, and trade‑off analyses.

The 3 Practical Ways to Design Idempotency in Distributed Systems: Why It Matters

Idempotency, as a term in mathematics and computer science, means that after the first application of an operation, subsequent repetitions leave the state unchanged and produce the same result each time. The HTTP standard (RFC 7231) expects GET, PUT, and DELETE methods to be idempotent, while POST requests that write data or create resources are inherently non‑idempotent. In a distributed system, a packet loss or delay at the network layer can prevent the client from receiving a “operation successful” response, prompting the client to retry.

The simple log flow below shows a typical problem created by duplicate requests received by a payment service. The second request arrives while the first response is still in flight, leading to a race condition:

2026-06-07 10:14:02.102 [INFO] POST /api/v1/charges - Payload: {"order_id": "ord_99812", "amount": 1500} - Client_IP: 192.168.12.44
2026-06-07 10:14:03.450 [WARN] HTTP Gateway Timeout (504) returned to client due to internal database latency (1350ms).
2026-06-07 10:14:04.112 [INFO] POST /api/v1/charges - Payload: {"order_id": "ord_99812", "amount": 1500} - Client_IP: 192.168.12.44 (RETRY)
2026-06-07 10:14:04.115 [ERROR] Double-charge detected! Order ord_99812 already processed. Transaction ID: tx_88129-2

If your system cannot distinguish these two requests, two separate financial records are created. Queue mechanisms (Message Brokers) in distributed architectures typically provide an “At‑Least‑Once” delivery guarantee, making duplicate consumption inevitable. Therefore, idempotency is not a luxury but a mandatory architectural layer for preserving data integrity.

I once saw the same raw material order forwarded to a supplier three times because of a network packet loss while working on a large production ERP’s supply‑chain integration. Fixing that kind of error in a big‑scale organization took days of phone calls and caused reputational damage.

Method 1: Unique Request Key and Database Uniqueness Constraints

The most reliable “nothing can go wrong” approach is to trust the ACID guarantees of your database engine. The client generates a unique UUID (e.g., idempotency_key) for each operation and sends it either in an HTTP header or within the payload. We then protect that key at the database level with a UNIQUE constraint.

When I built an orders table on PostgreSQL 14+, I discovered that merely defining a unique index isn’t enough; we also need to gracefully handle conflicts by either letting the database raise an error or returning the existing record (UPSERT).

-- Design of idempotency table and unique index on PostgreSQL
CREATE TABLE idempotency_keys (
    key_hash VARCHAR(64) PRIMARY KEY,
    response_code SMALLINT NOT NULL,
    response_body JSONB NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Transactional write scenario linked with the operation table
CREATE OR REPLACE FUNCTION process_payment(
    p_idempotency_key VARCHAR,
    p_order_id VARCHAR,
    p_amount NUMERIC
) RETURNS JSON AS $$
DECLARE
    v_response JSON;
BEGIN
    -- First check and lock the idempotency table
    INSERT INTO idempotency_keys (key_hash, response_code, response_body)
    VALUES (p_idempotency_key, 201, json_build_object('status', 'success', 'order_id', p_order_id, 'amount', p_amount))
    ON CONFLICT (key_hash) DO NOTHING;

    -- If the record could not be inserted (conflict), read the existing response
    IF NOT FOUND THEN
        SELECT response_body INTO v_response FROM idempotency_keys WHERE key_hash = p_idempotency_key;
        RETURN v_response;
    END IF;

    -- Real business logic (e.g., creating payment record) goes here
    -- ...

    RETURN json_build_object('status', 'success', 'order_id', p_order_id, 'amount', p_amount);
END;
$$ LANGUAGE plpgsql;

The biggest advantage of this method is that it requires no additional stateful layer (Redis, Memcached, etc.). As long as your database is up, you have an idempotency guarantee. However, every advantage comes with a cost.

In high‑traffic systems (e.g., handling 5 000+ write requests per second), performing a uniqueness check for every request and writing to a B‑tree index can strain disk I/O limits. Moreover, long‑running operations (e.g., calling an external bank API) that keep a database transaction open can quickly exhaust the connection pool (starvation).

⚠️ Connection Pool Blocking

If you have external service integrations, never send an HTTP request to the outside world while a database transaction is open. Doing so can lock your connection pool within seconds.

Method 2: Distributed Lock and Token‑Based Validation on Redis

If you want to reduce the load on your database and return responses in milliseconds, you need an in‑memory distributed lock mechanism. This method leverages Redis’s atomic commands and TTL (Time‑To‑Live) feature.

When the client sends an Idempotency-Key, we lock it in Redis with SET key value NX PX milliseconds. The NX flag creates the key only if it does not already exist, and PX gives the key a lifetime in milliseconds. Consequently, a second request with the same key is rejected because the lock is still held.

Below is a simplified example of a Redis‑based, fault‑tolerant idempotency middleware I frequently use in FastAPI backends:

import redis
import time
from fastapi import FastAPI, Request, HTTPException, status

app = FastAPI()
# Redis connection pool settings (max_connections=50, timeout limits set)
redis_pool = redis.ConnectionPool(host='127.0.0.1', port=6379, db=0, max_connections=50)
r = redis.Redis(connection_pool=redis_pool)

def acquire_idempotency_lock(key: str, ttl_ms: int = 5000) -> bool:
    """
    Acquire an atomic lock on Redis.
    Returns False if the key already exists.
    """
    # PX parameter sets TTL in milliseconds, NX writes only if key does not exist
    return bool(r.set(f"idemp:{key}", "PROCESSING", px=ttl_ms, nx=True))

def set_idempotency_response(key: str, response_data: str, ttl_sec: int = 86400):
    """Store the response in Redis when the operation finishes (kept for 1 day)"""
    r.setex(f"idemp_resp:{key}", ttl_sec, response_data)

def get_idempotency_response(key: str) -> str:
    """Quickly fetch a previous response from Redis"""
    val = r.get(f"idemp_resp:{key}")
    return val.decode('utf-8') if val else None

@app.post("/api/v1/orders")
async def create_order(request: Request):
    idemp_key = request.headers.get("X-Idempotency-Key")
    if not idemp_key:
        raise HTTPException(status_code=400, detail="X-Idempotency-Key header is missing")

    # Stage 1: Check if a previous successful response exists
    cached_response = get_idempotency_response(idemp_key)
    if cached_response:
        return {"source": "cache", "data": cached_response}

    # Stage 2: Attempt to acquire distributed lock
    if not acquire_idempotency_lock(idemp_key, ttl_ms=10000): # 10‑second limit
        raise HTTPException(
            status_code=status.HTTP_409_CONFLICT, 
            detail="Another request with the same idempotency key is already in progress."
        )

    try:
        # Real business logic (ERP stock reservation, invoice generation, etc.)
        # Simulated heavy processing time: 150ms
        time.sleep(0.15) 

        result_payload = f"Order created successfully for key {idemp_key}"

        # Stage 3: Store successful response
        set_idempotency_response(idemp_key, result_payload)
        return {"source": "engine", "data": result_payload}

    finally:
        # Clean up the lock (so subsequent retries can proceed in case of error)
        r.delete(f"idemp:{idemp_key}")

The critical part of this code is the finally block. If your business logic throws an exception (e.g., a database connection drops), you must delete the lock immediately; otherwise, a retry a few seconds later would see a “locked” error and fail to proceed.

The downside of this approach is what happens when Redis runs out of memory (OOM). If an eviction policy like allkeys-lru is active, active locks or cached responses may be evicted, causing the system to silently lose its idempotency guarantee and produce duplicate records.

Method 3: State Machine and Optimistic Locking (Version Column)

In queue‑based (Kafka/RabbitMQ) and event‑driven architectures, the most elegant way to achieve idempotency is to let entities manage their own lifecycle via a State Machine and a version column. This is also known as Optimistic Locking.

Each database record has a state and a version field. State transitions are allowed only according to defined rules. For example, an order can move from PENDING to PROCESSING, but a COMPLETED order cannot be moved back to PROCESSING.

[PENDING, v1]  -->  (Start Process)  -->  [PROCESSING, v2]  -->  (Payment Received)  -->  [COMPLETED, v3]
      |                                        |                                        |
      +------------ (Duplicate Request) ------+------------ (Duplicate Request) ------+
                           |                                                        |
                     [Reject/Ignore]                                          [Reject/Ignore]

Optimizing this flow at the SQL level reduces the application’s memory footprint to zero. The query below shows how to update an order’s state while performing version checking:

-- Update state with version control on PostgreSQL
-- If another worker thread has already updated the row, affected rows will be 0.
UPDATE orders 
SET 
    status = 'COMPLETED', 
    version = version + 1,
    updated_at = NOW()
WHERE 
    id = 'ord_10293' 
    AND status = 'PROCESSING' 
    AND version = 4;

If the number of affected rows is 0, we know either the operation has already been performed (the version is now 5) or the record is not in the expected state. In that case the worker does not raise an error; it treats the operation as “already completed” and acknowledges the message (ACK) successfully.

This method requires no additional locking mechanism, making it highly performant. However, it demands that your workflow be designed around a strict state machine. Implementing it in complex flows where state transitions are ambiguous incurs significant analysis overhead.

Network‑Level Retry Policies and Idempotency‑Key HTTP Header Standards

Beyond application code, the retry policies of API gateways and reverse proxies (Nginx, HAProxy) directly affect idempotency. For example, Nginx automatically forwards a request to another upstream server when a timeout occurs (if proxy_next_upstream is enabled). If the first server started a database transaction but never returned a response, Nginx’s behavior can trigger a double‑processing scenario.

To avoid this, you need to manage non‑idempotent methods carefully in your Nginx configuration. Below is an example Nginx access log that shows how duplicate requests appear:

# Nginx access.log - Client's retry behavior after timeout
192.168.12.44 - - [07/Jun/2026:10:14:02 +0300] "POST /api/v1/charges HTTP/1.1" 504 182 "-" "Go-http-client/1.1" "X-Idempotency-Key: id_9921_abc" upstream_response_time: 1.500s
192.168.12.44 - - [07/Jun/2026:10:14:04 +0300] "POST /api/v1/charges HTTP/1.1" 200 89 "-" "Go-http-client/1.1" "X-Idempotency-Key: id_9921_abc" upstream_response_time: 0.012s

Notice that the second request’s upstream_response_time is only 12 ms because the system returned the cached response using the first method (or Redis cache) without hitting the database.

I have witnessed hundreds of duplicate requests hammering the API layer during VPS migration when network routes were still stabilizing. If your gateway and code do not recognize and handle the IETF‑standard Idempotency-Key header, keeping the system alive becomes impossible.

Lessons Learned and Edge‑Case Scenarios from Production

In production, theory rarely matches practice. Systems that look perfect on paper can explode in the wild. Here are two critical failure scenarios I’ve experienced recently, along with the lessons they taught me:

1. Clock Skew and Redis TTL Disaster

In a client project, two API servers had a clock skew of about four seconds. Server A set a Redis idempotency lock using its local clock for TTL, while Server B, seeing a different timestamp, considered the lock “expired” and processed the same key within the same second.

Solution: Never rely on local clocks for distributed locks. Use Redis’s own time (TIME command) or synchronize all servers to a reliable NTP source and base TTL calculations on UNIX epoch timestamps.

2. Silent Failure When Redis Hits OOM

Our financial calculator service’s backend ran out of Redis memory, triggering the volatile-lru eviction policy. Redis evicted some still‑valid idempotency responses (metadata) to free space. Subsequent retries could not find the previous record, started from scratch, and produced duplicate operations.

Solution: Do not store idempotency data on the same Redis instance that holds cache data. Treat idempotency as critical data. If you must use Redis, set the eviction policy to noeviction so that writes fail loudly, allowing the system to enter a safe fail‑safe mode instead of silently losing idempotency.

The table below compares the three methods discussed in this article across key metrics:

Criterion	Method 1: DB Unique Constraint	Method 2: Redis Distributed Lock	Method 3: State Machine & Version
Latency	High (disk I/O bound)	Very Low (1‑3 ms, in‑memory)	Low (single UPDATE)
Complexity	Easy	Medium (requires lock management)	High (requires model design)
Data Reliability	Maximum (ACID guarantees)	Medium (OOM or Redis crash risk)	Maximum (DB‑level control)
External Service Compatibility	Poor (long transaction risk)	Excellent (lock before external call)	Medium (state update before external call)
Infrastructure Cost	None (uses existing DB)	Extra (Redis Cluster/Sentinel)	None (uses existing DB)

Conclusion

There is no one‑size‑fits‑all silver bullet for idempotency in distributed systems. If you handle financial or otherwise critical data, Method 1 (DB Unique Constraint) is the safest harbor. For high‑scale, millisecond‑level e‑commerce or payment gateways, you need Method 2 (Redis Lock) with a caching layer. In event‑driven architectures that consume queues, Method 3 (State Machine) gives you the cleanest and most maintainable codebase.

In my own projects I usually adopt a hybrid approach: lock requests at the API‑Gateway level with Redis, and then rely on database unique constraints at the final write stage. This lets me keep both speed and data integrity.

Next step: to further improve fault tolerance in distributed systems, check out my article on the Transactional Outbox Pattern.

DEV Community