Idempotency Design in Distributed Systems: A Modern Approach

#distributedsystems #backend #postgres #apidesign

When I send an API request in a distributed system and get a "timeout," there are few things as frustrating as not knowing whether that operation actually happened on the server side. The TCP connection might have dropped, the load balancer might have returned a 504 Gateway Timeout, but the database commit might have already gone through in the background. In this moment of uncertainty, if the client sends the request again "just in case" and my system isn't designed according to the principle of idempotency, I end up with two of the same order or a double-billed invoice.

In my 20 years of field experience, the biggest misconception I've seen is the thought that idempotency can be solved by simply "adding a check." In reality, this is an architectural decision that spans from the network layer to database transaction management. Last year, while working on a production ERP, we noticed raw material stocks deviating by 12% due to duplicate production orders coming from operator screens. I realized the problem wasn't "fast-clicking buttons" but a weak idempotency design. In this post, I will explain how I solve this problem in modern systems and which trade-offs I consider.

The Network is a Liar: The Dark Side of Retry Mechanisms

The only way to be sure a network packet reached its destination is to receive an acknowledgment (ACK), but the fact that you didn't receive an ACK doesn't mean the packet didn't arrive. I call this the "Network Layer Paradox." Especially in mobile applications or devices working on weak Wi-Fi networks in factory environments, the packet reaches the server and is processed, but the connection drops while the response is returning. In this case, the "retry" logic on the client side becomes the system's worst enemy.

While examining crash reports from the Play Store for a mobile app I developed, I saw that 4% of users were submitting the same form 3-4 times due to poor connectivity. If I had accepted every request on the backend as a "new order," the database logs would have bloated and inconsistencies would have begun. This is where the concept of the Idempotency-Key comes in. The client should generate a UUID for every unique transaction and send it as a header. When the server sees this key, it should ask, "Have I seen this movie before?"

POST /api/v1/orders
Host: api.mustafaerbay.com
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Content-Type: application/json

{
  "product_id": "SKU-123",
  "quantity": 10
}

In this structure, if another request comes with the same key after the server returns a 201 Created, it should return the previously generated response (from cache or DB) instead of performing the operation again. However, there is a critical detail here: should you return an HTTP 409 Conflict or the original response? I generally prefer returning the original response so that the state machine on the client side doesn't break.

Idempotency at the Database Level: The PostgreSQL Approach

Doing an if exists check in the application code is often not enough because that insidious plague we call a "race condition" can strike at any moment. When two parallel requests arrive milliseconds apart, both might pass the "no record exists" check. That's why I always rely on "Unique Constraints" at the database level and PostgreSQL's INSERT ... ON CONFLICT capabilities.

In a production project, we needed to perform more than 5,000 inserts per second while processing sensor data from machines. Due to network fluctuations, the same data packets could repeat. Checking at the application layer was consuming 40% more CPU. I found the solution by structuring the database schema like this:

CREATE TABLE idempotency_keys (
    id UUID PRIMARY KEY,
    response_code SMALLINT NOT NULL,
    response_body JSONB NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Usage during transaction:
INSERT INTO idempotency_keys (id, response_code, response_body)
VALUES ('550e8400-e29b-41d4-a716-446655440000', 201, '{"status": "success"}')
ON CONFLICT (id) DO UPDATE SET 
    created_at = EXCLUDED.created_at -- Just verify existence and return the existing record
RETURNING *;

This approach ensures an atomic operation. In PostgreSQL 16+ versions, to prevent WAL (Write Ahead Log) bloating during such heavy write operations, it is essential to "vacuum" the idempotency_keys table at certain intervals (e.g., after 24 hours) or use partitioning to clean up old keys. Remember, idempotency keys shouldn't be stored forever; a TTL (Time To Live) should be determined based on the "finality" period of the operation.

Redis vs PostgreSQL: State Storage Strategies

Which storage unit you choose is directly related to the criticality of the operation. If I'm developing a financial calculator or a payment system, I never trust Redis. Redis, in its default configuration, is a "cache" unit and can lose the last few seconds of data during a crash (unless persistence settings are very aggressive). This means duplicate payments. However, for low-risk operations like a "like" button or "read" status, Redis is perfect.

When using Redis, the command SET key value NX PX 86400000 (with Not Exists and Expiry) is a lifesaver. However, once during a "failover" process in a Redis cluster, we experienced a millisecond synchronization loss while the master was changing, and the same key was written to two different nodes. After that experience, I made it a rule to keep an idempotency table in PostgreSQL for any operation involving "money" or "stock."

⚠️ Performance Trap

Do not add idempotency keys as columns to the main business tables (orders, invoices). This breaks the table's index structure and degrades query performance. Instead, maintain a separate idempotency_log table and associate it with the transaction_id in the main table.

Another advantage of using PostgreSQL is transaction integrity. If the operation fails (rollback), you also delete the idempotency key. In Redis, if the application code crashes, the key might hang there, and the client might receive an "already done" error even though the transaction was never reflected in the DB.

Message Queues and the "Exactly-Once" Lie

When using RabbitMQ or Kafka in distributed systems, be skeptical of those who promise "Exactly-Once Delivery." This is nearly impossible at the network layer. The realistic approach is the duo of "At-Least-Once Delivery" and "Idempotent Consumer." In other words, accept that the message will arrive at least once and prepare the consumer side accordingly.

In one of our projects, while performing an iSCSI supply chain integration, we noticed that orders from the message queue were sometimes processed 2-3 times. The problem was that the consumer would process the message and the connection would drop just as it was about to send the "ACK." Since the message queue (broker) didn't receive the ACK, it would put the message back in the queue. We found the solution by embedding a message_id inside every message and writing this ID to a "processed_messages" table on the consumer side.

# FastAPI + PostgreSQL Consumer Example
async def process_order(message: OrderMessage):
    async with db.transaction():
        # 1. Check if the message was processed before (Idempotency check)
        is_processed = await db.execute(
            "INSERT INTO processed_messages (msg_id) VALUES (:id) ON CONFLICT DO NOTHING",
            {"id": message.msg_id}
        )

        if not is_processed:
            # 2. Already processed, skip silently
            return 

        # 3. Actual business logic
        await create_production_order(message.data)

Thanks to this structure, even if the same message arrives 100 times, only the first message triggers the "real" work due to the unique constraint at the database level. The others remain as just a line in the logs. As I mentioned in my previous [related: PostgreSQL performance tuning] post, if you don't structure the indexes of these tables correctly, consumer lag will start to increase as the system grows.

Idempotency in API Design: Only for POST?

The general consensus is that only POST requests need to be idempotent. This is because GET, PUT, and DELETE are already considered idempotent by nature. However, the practical world isn't always as rosy as RFC documents. For example, a DELETE /user/1 request that deletes a resource will return a 404 when called a second time. This is technically idempotent (the server state doesn't change), but it might be perceived as an "error" on the client side.

My approach is to always use an explicit key, especially for critical "state" changes (canceling, approving, archiving). Even when performing a "cancel" operation (PUT/PATCH), tracking a request_id increases the system's observability. In the architecture we built for a bank's internal platform last month, we mandated an Operation-ID for all mutation (write) requests. This allows us to track the entire journey of a transaction in the logs with a single ID.

Furthermore, the format of idempotency keys is important. Instead of just a random number, deterministic keys that include the type of operation and the time (e.g., order_create_user123_ts20260517) can sometimes be lifesavers during the debugging phase. However, from a security perspective, UUID v4 is always the safest harbor; unpredictability prevents one user from "colliding" with another's key.

Error Management and Operational Experiences

The most common mistake when designing for idempotency is returning only an error code in the "key already exists" scenario. Instead of telling the user "You have already performed this operation," showing the result of the operation as if it were just performed (transparent retry) improves the user experience (UX) tremendously. If the user's internet drops after clicking the button and they see "Order Successfully Received" (without it being created a second time) when they click again, your system is successful.

Once, on an operator panel for a production ERP, the database fell into a lock while a work order was being approved. The operator pressed the button 10 times. If the idempotency key were only checked at the "insert" moment, the first request would wait in the lock while the others would receive errors. As a solution, we added a "Processing" status to the key table. If the status for a key was "processing," we started returning "Please wait, your transaction is in progress" (HTTP 102 Processing) to other incoming requests.

// HTTP 102 Response
{
  "status": "processing",
  "message": "Your transaction is currently being handled by a worker.",
  "retry_after": 2
}

This approach prevents the system from assigning the same task to multiple workers, especially for long-running background jobs (like report generation or AI-based production planning). If you want to optimize resource consumption (CPU/RAM), you should blend these types of "locking mechanisms" with idempotency.

Conclusion: A Pragmatic Approach

There is no such thing as a 100% guarantee in distributed systems; there is only minimizing risks. Idempotency is the most powerful tool we have to manage these risks. When designing, ask yourself: "How much would the company lose if this operation were performed twice?" and "Where can I perform this check with the least performance loss?".

My clear position is this: For all financial and operational write transactions, a mandatory idempotency key must be used at the database level. Checks in the application code can only be a "pre-layer," but the database should always have the final word. Instead of complex libraries or "magic" frameworks, a simple idempotency_keys table and a well-structured middleware are enough to keep your system running for years.

In the next post, I will explain how I use the duo of systemd timer and PostgreSQL partitioning to clean up these keys. For the next steps, you can check out the [related: Distributed Log Monitoring] guide.