Mustafa ERBAY

Posted on Jun 7 • Originally published at mustafaerbay.com.tr

Idempotency in Distributed Systems: Even If You Process Multiple

#distributedsystems #guide #software

Idempotency in Distributed Systems: Even If You Process Multiple Times, The Result Remains The Same

While developing a production ERP system, an incident in the order confirmation mechanism pushed me to think deeply about idempotency. Users, due to reasons like network connection issues or accidental double-clicks, could send the same request multiple times. If these requests resulted in a different outcome each time, our stock could appear short, our invoices could conflict, or our entire system could descend into chaos. This is precisely where the concept of idempotency comes into play.

In distributed systems, unpredictable situations like network delays, server errors, or repeated requests from the client side are quite common. In such scenarios, idempotency, which guarantees that executing an operation multiple times will yield the same result as executing it once, is critically important for the stability and reliability of our systems. In this post, I will explain what idempotency is, why it's important, and the different approaches used to solve this problem, blending them with my own experiences.

What is Idempotency and Why is it Vitally Important?

Simply put, an operation is idempotent if executing it multiple times produces the same result as executing it once. This is vitally important, especially in distributed systems, where we are unsure if a request has reached the server due to network issues or timeouts. If a request is not idempotent and is sent again due to a network error, the same operation could be triggered twice, leading to data inconsistency.

For example, consider a money deposit operation to a user's account. If this operation is not idempotent and the user accidentally presses the "Send" button twice, money could be deposited into the account twice. Such an error is unacceptable in financial systems. However, with idempotency, when the first request is processed, an "operation ID" is created, and subsequent identical requests are recognized by this ID, preventing the operation from being executed again. This ensures the system's stability and accuracy.

ℹ️ A Real-Life Example

In the order creation process on an e-commerce site, receiving payment confirmation is a critical step. If a network error occurs after payment confirmation is received but before the order status is updated, the system might not know that the payment was successfully received. In this case, the client side will try to re-send the operation. If the order creation operation is not idempotent, the same order could be created multiple times. This situation leads to serious problems in inventory management and customer satisfaction.

Methods for Ensuring Idempotency: Alternative Approaches

There are several different approaches to ensure idempotency. The chosen method can vary depending on the system's complexity, performance requirements, and existing infrastructure. Here are some commonly used methods:

1. Unique Request IDs

This is one of the most common and effective ways to ensure idempotency. Each client request is sent with a unique ID that can be identified by the server. The server receives this ID and stores it in a database or cache. When a request arrives, the server first checks if this ID has been processed before. If the ID has been seen before, it returns the result of the previous operation instead of re-executing the request.

Things to consider with this approach:

ID Generation: Unique IDs should be generated on the client side or server side. Structures like UUID (Universally Unique Identifier) are commonly used.
ID Storage: Operation IDs should be stored for a specific period. This period is determined by the system's transaction volume and error tolerance. For example, an hour or a day.
Performance: Performing a database query for every request can lead to performance issues. Therefore, ID checking is usually done on a fast cache (e.g., Redis).

In one scenario, I used this method for transaction records sent to the API of an accounting software. For each transaction record, an X-Request-ID header was added. On the server side, we used the incoming X-Request-ID as a Redis key and stored the transaction's status (success, failed, processing) as the value. If another request came with the same ID, we would read the status from Redis and return a response accordingly. This prevented the same operation from being triggered twice, even in case of a network timeout of approximately 500ms.

# Example Python (FastAPI) snippet
from fastapi import FastAPI, Request, Header
import uuid
import redis

app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379, db=0)

@app.middleware("http")
async def idempotency_middleware(request: Request, call_next):
    request_id = request.headers.get("X-Request-ID")
    if not request_id:
        # If there's no ID, we could generate a new one and add it to the header,
        # but in that case, we'd also need to consider the client's retry mechanism.
        # For now, we'll let requests without an ID proceed normally.
        response = await call_next(request)
        return response

    # Check if the ID exists in Redis
    if redis_client.exists(request_id):
        # If the ID exists, it means it has been processed before.
        # Returning the previous response can be complex.
        # For simplicity, let's just return success information or an error.
        # In a real scenario, the previous response might need to be saved.
        print(f"Request ID {request_id} already processed. Skipping.")
        return Response(content="Already processed", status_code=409) # Conflict

    # Proceed with the operation
    response = await call_next(request)

    # If the operation was successful, save the ID to Redis
    if response.status_code < 400: # For successful requests
        redis_client.setex(request_id, 3600, "processed") # Valid for 1 hour

    return response

# Simple endpoint example
@app.post("/process_payment")
async def process_payment(request: Request):
    # Real payment processing logic goes here
    print("Processing payment...")
    return {"message": "Payment processed successfully"}

This middleware checks the incoming request's X-Request-ID header. If the header exists and this ID has been processed before (is present in Redis), it responds with a 409 Conflict status. If the operation is successful, the ID is saved to Redis, and subsequent identical requests are blocked. The biggest disadvantage of this approach is that it requires an additional data store (like Redis) to track the idempotency status.

2. State Tracking

In this method, the status of each operation (e.g., PENDING, PROCESSING, COMPLETED, FAILED) is stored in a database. When a request arrives, the server first checks if a record exists in the database with the relevant operation ID. If a record exists, it acts based on the current status of the operation.

If the operation is COMPLETED, it returns the previous successful result.
If the operation is FAILED, it re-throws the error or offers a retry option.
If the operation is PENDING or PROCESSING, it rejects the request or waits for the operation to finish.

This method may require more data storage space than unique request IDs but offers more detailed control over the operation flow. It is particularly useful for long-running operations.

In a financial data integration project, we had an ETL (Extract, Transform, Load) process that handled large datasets. We kept a job_id for each ETL job. We stored this job_id in the database and updated the job's status (QUEUED, RUNNING, SUCCESS, FAILED). If a request came with the same job_id, we read its status from the database and, if it was SUCCESS, returned a success message, or if it was FAILED, returned the error information. This prevented the same ETL job from being triggered multiple times due to network interruptions or client errors and ensured data integrity. In this system, the status of each job was stored for 24 hours.

-- PostgreSQL example
CREATE TABLE IF NOT EXISTS idempotency_keys (
    request_id UUID PRIMARY KEY,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP WITH TIME ZONE,
    status VARCHAR(20) NOT NULL DEFAULT 'PENDING' -- PENDING, PROCESSING, COMPLETED, FAILED
);

-- When starting an operation:
INSERT INTO idempotency_keys (request_id, status) VALUES ('<incoming_uuid>', 'PENDING');

-- When checking a request:
SELECT status FROM idempotency_keys WHERE request_id = '<incoming_uuid>';

-- After the operation is completed:
UPDATE idempotency_keys SET status = 'COMPLETED', processed_at = CURRENT_TIMESTAMP WHERE request_id = '<incoming_uuid>';

-- When the operation fails:
UPDATE idempotency_keys SET status = 'FAILED' WHERE request_id = '<incoming_uuid>';

This SQL code snippet creates a table named idempotency_keys. The request_id is unique and stores the status of the operation. Before an operation starts, it's recorded as PENDING. If another request comes with the same request_id, its status is checked. When the operation completes or fails, its status is updated. This approach offers a more persistent solution than Redis but can increase database load.

3. Atomic Operations

In some cases, idempotency can be achieved by leveraging the atomic nature of database operations. Atomic operations either succeed entirely or fail entirely. This is useful, especially in scenarios that can be handled with a single database operation.

For example, statements like INSERT ... ON CONFLICT (PostgreSQL) or INSERT IGNORE (MySQL) will skip inserting a new record if a record with the same key (primary key or unique constraint) already exists. This ensures that even if you try to execute an operation multiple times, only the first successful insertion will occur.

I used this method while working on a table storing user preferences for the backend of a mobile application. Each update request from the mobile app was made unique by a combination of user_id and preference_key. This (user_id, preference_key) pair had a UNIQUE constraint in the database. When a request arrived, a simple INSERT INTO user_preferences (user_id, preference_key, preference_value) VALUES (...) ON CONFLICT (user_id, preference_key) DO UPDATE SET preference_value = EXCLUDED.preference_value; query was executed. If a record with this (user_id, preference_key) pair already existed, the query would update only the preference_value without error. This prevented data inconsistency for the mobile app, which had approximately 1500 customers, even with a 2-second network delay.

-- PostgreSQL example
INSERT INTO user_preferences (user_id, preference_key, preference_value)
VALUES (123, 'theme', 'dark')
ON CONFLICT (user_id, preference_key)
DO UPDATE SET preference_value = EXCLUDED.preference_value;

This SQL statement attempts to insert or update a record in a table where user_id and preference_key are unique. If a record with the same user_id and preference_key combination exists, the DO UPDATE block is executed, and only the preference_value is updated. This makes the operation idempotent.

4. Client-Side Idempotency Keys

This approach is similar to the unique request ID method, but the generation and management of the key are entirely on the client side. The server verifies this key sent by the client and uses it to track the operation's status. This reduces the additional load on the server but requires more complex logic on the client side.

It's practical, especially for client applications consuming APIs. The client generates a unique key for each new operation and sends it with the request. The server checks if this key has been processed before. If it has, it resends the previous response. In this method, the server needs to store the keys sent by the client and their corresponding responses.

In an enterprise software project, when integrating with different systems, we used a custom-generated correlation_id for each integration request. This correlation_id was known by both the client and the server. The server used the incoming correlation_id to store the operation's status and result in the database. If another request came with the same correlation_id, the server would directly return the saved result. This was successfully used for over 10,000 operations with a network latency tolerance of approximately 1 second.

Trade-offs and Considerations

As with any technical decision, there are trade-offs in the methods for ensuring idempotency:

Performance vs. Reliability: Methods like unique IDs and state tracking require additional database or cache queries, slightly reducing performance. However, this prevents much larger problems that could arise from data inconsistency.
Complexity: Some methods (e.g., state tracking) require additional logic and infrastructure that are more complex to implement and manage.
Storage Duration: Deciding how long operation IDs or statuses should be stored is important. Very short durations increase the risk of errors, while very long durations increase storage costs. Generally, a storage duration slightly longer than the maximum possible network timeout is sufficient.
Idempotency Key Security: If client-generated keys are used, ensure these keys are unpredictable and secure.

⚠️ A Point to Note

Idempotency is solely concerned with the operation itself. That is, it guarantees that executing an operation multiple times produces the same result as executing it once. However, this does not mean the operation will have the same effect each time. For example, adding a log entry can be idempotent, as it adds the same log entry each time it's repeated. But if the log entry has a unique timestamp, each added record will be different. In the real world, idempotency is more meaningful for operations that require a state change.

Conclusion: Idempotency is Indispensable for Reliable Systems

In distributed systems, idempotency is an indispensable principle for ensuring the stability and data integrity of our systems in the face of unpredictable network conditions and client behaviors. Different methods such as unique request IDs, operation state tracking, and atomic operations provide us with the tools to implement this principle. Whichever method we choose, the main goal is to guarantee that executing an operation multiple times yields the same result as executing it once.

That issue I experienced with the order confirmation in a production ERP system showed me that idempotency is not just a theoretical concept but a practical and powerful tool for solving the operational challenges we face in the real world. Overlooking this principle when designing our systems can pave the way for bigger and more costly problems down the line. Therefore, I believe that every developer and architect working on distributed systems should understand and implement the concept of idempotency well.

In the next step, we will look at how we can apply idempotency in more complex distributed system components like message queues.

DEV Community