DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Idempotency in Distributed Systems: The Realities of Design

Introduction: The Hidden Cost of Idempotency

Ever since I started working in the world of distributed systems, I've frequently heard the word "idempotency," but it took me a while to fully grasp its meaning and the practical challenges it presents. This concept, which seems simple in theory, can lead to months-long debugging sessions, data inconsistencies, and, most importantly, customer complaints when put into practice. The principle that an operation should not change the system's state, no matter how many times it's repeated, sounds appealing, but it's not always easy to achieve in real-world scenarios involving network latencies, server crashes, and unexpected retries.

In my experience, skipping idempotency or designing a system without sufficient consideration often led to serious financial consequences, such as a payment being processed twice or an order being accidentally created multiple times. Over the years, I've encountered these kinds of problems repeatedly, and each time, I've understood anew how critical this fundamental principle is. Even in the financial calculators of one of my side products, a simple double-trigger of a user action caused data inconsistency, showing me that idempotency isn't exclusive to large-scale systems.

Core Principle and Expectations: Why Is It So Hard?

Idempotency means that applying an operation multiple times has the same effect as applying it once. Mathematically, we can think of it as f(f(x)) = f(x). HTTP GET requests are inherently idempotent; no matter how many times you refresh a page, its content remains the same. However, this is not true for state-changing operations like HTTP POST or PUT, and it's our responsibility to ensure idempotency in these cases.

So, why is it so difficult in distributed systems? The main reasons are network unreliability and partial system failures. When one service calls another, the response might be delayed or never arrive. In such cases, the client usually retries the operation. If the operation completed on the server side but the response didn't reach the client, the second attempt will trigger the operation twice. These scenarios are our data inconsistency nightmares. In a production ERP, an iSCSI (Internet Small Computer System Interface) supply call for supply chain integration once ran twice due to a momentary network fluctuation, leading to the same product being ordered twice. Such errors directly affect the overall reliability and reputation of the system.

ℹ️ Network and Retries

In distributed systems, it's common for a client to send a request multiple times (retry) due to network delays or partial failures. Idempotency ensures the system remains consistent even during these retries.

Idempotency in Practice: Solutions at Various Layers

There is no single "silver bullet" for achieving idempotency; different strategies must be applied at different layers. I've developed solutions tailored to the dynamics of each layer, such as databases, API design, and message queues.

Idempotency at the Database Layer

The most common way to achieve idempotency at the database level is to use a unique transaction ID (idempotency key). This key is generated by the client and sent with each request. The server uses this key to check if the same operation has been performed before. For example, in a payment transaction, a value like transaction_id or request_id can be stored with a unique constraint in the database.

In an internal banking platform, double-clicking by a user or retrying after a network error during transfer operations could lead to the same transfer occurring twice. To solve this, we added a UUIDv4-based client_request_id to each transfer request. We defined a UNIQUE constraint for this field in the database.

CREATE TABLE transactions (
    id SERIAL PRIMARY KEY,
    client_request_id UUID UNIQUE NOT NULL,
    account_id INT NOT NULL,
    amount NUMERIC(10, 2) NOT NULL,
    status VARCHAR(50) NOT NULL DEFAULT 'PENDING',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
Enter fullscreen mode Exit fullscreen mode

When a request arrived, I first checked if a record with this client_request_id existed. If it did, I returned the current status; otherwise, I created a new record and proceeded with the operation. This can be easily managed with PostgreSQL's INSERT ... ON CONFLICT (client_request_id) DO UPDATE or DO NOTHING construct. However, this approach also brings some challenges. For example, what happens if a second request arrives while the operation's status is PENDING? In this case, it might be necessary to wait for the result of the first operation or define a specific timeout period. [Related: Concurrency Control in PostgreSQL]

Idempotency in API Design

Ensuring idempotency in APIs is typically done by using a special HTTP header (Idempotency-Key) or a field within the request body. The client generates a unique key for each operation and sends this key to the server. The server uses this key to cache the request or check it in the database.

We had an order creation API on an e-commerce site. We received complaints that when a customer got stuck on the payment page and then refreshed the page or pressed the browser's back button, the same order was created twice. To solve this problem, I decided to use the Idempotency-Key header.

POST /api/orders HTTP/1.1
Idempotency-Key: e4a7f0e9-7b3b-4c7b-8b5e-9f0a2c3d4e5f
Content-Type: application/json

{
    "items": [{"product_id": 123, "quantity": 1}],
    "customer_id": 456
}
Enter fullscreen mode Exit fullscreen mode

On the backend, I stored this key in Redis for a short period (e.g., 24 hours) to check if incoming requests were duplicates. If a request with the same key had arrived before and was being processed, I would return the result of the pending operation. If the operation was completed, I would return the completed result. This can be done very efficiently with Redis's atomic SETNX (SET if Not eXists) command. However, it's important to back it up with persistent storage to avoid inconsistencies in scenarios like Redis crashing or data being deleted.

Idempotency in Message Queues and Event-Driven Systems

Message queues (Kafka, RabbitMQ, etc.) and event-driven architectures are cornerstones of distributed systems. Due to the "at-least-once" delivery guarantee of messages in these systems, there is always a risk of a message being processed multiple times. Therefore, message handlers (consumers) must be idempotent.

In a manufacturing company's ERP, sensor data from the production line was dropped into a Kafka topic, and then this data was processed to update inventory. Sometimes, due to restarts of Kafka consumers or network outages, the same sensor data message was consumed twice, leading to an incorrect increase in inventory. To prevent this, I added a unique event_id to each sensor message.

{
    "event_id": "7b1e9c2d-3a4f-5b6c-7d8e-9f0a1b2c3d4e",
    "sensor_id": "PRD-LINE-01",
    "product_id": "ITEM-ABC",
    "quantity_produced": 10,
    "timestamp": "2026-05-31T10:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

On the consumer side, before starting to process each incoming event_id, I checked it against a processed_events table, added it if it didn't exist, and then started the operation. If the record already existed, I skipped the message. This can be implemented with a UNIQUE constraint on PostgreSQL or another database. This approach becomes more robust when combined with optimistic locking in scenarios where the message processing order is also important. [Related: Real-Time Systems with Event Sourcing]

Trade-offs and Real-World Scenarios

Ensuring idempotency comes with costs. The most prominent ones are performance, complexity, and storage requirements. Sending a unique key with each request, checking this key, storing it, and returning the appropriate response adds overhead to the system.

Performance: Adding a database or cache query for every operation can reduce the system's overall throughput. Especially in high-traffic systems, these additional queries can severely strain CPU and I/O resources. In the backend of one of my side products, when I added idempotency checks to every request in the early stages, I saw API response times increase by 15%, especially when Redis slowed down. Therefore, applying idempotency only where it's truly critical, rather than to every API call, is a matter of balance.

Complexity: Writing, testing, and maintaining code that implements idempotency logic adds extra complexity. Especially when dealing with distributed transactions or multi-stage operations, managing the lifecycle and validity period of the idempotency key becomes challenging.

Storage: Additional storage space is needed to store idempotency keys and associated transaction statuses. How long this data is stored varies according to system requirements. Storing past transaction keys indefinitely rapidly increases database size. Typically, a period between 24 hours and 7 days is considered sufficient, but in critical situations like financial transactions, this period might be longer. System-level limits like journald rate limits have taught us lessons in managing such logs.

⚠️ Misplaced Idempotency

Instead of automatically adding idempotency to all operations, focusing only on critical, state-changing operations that can be affected by retries reduces performance costs and keeps system complexity manageable.

Overlooked Corners and My Mistakes

In my experience, there have been some points often overlooked in idempotency design, and mistakes I've made myself:

  1. Idempotency Key Lifecycle: Once, while integrating payments for a client's project, I set the idempotency key's validity period too short. When a user closed their browser during payment provisioning and reopened it 30 minutes later, the system treated it as a new transaction with the same Idempotency-Key because the key had been deleted from Redis. This resulted in the same payment being charged twice. The key's duration should be determined by considering the transaction completion guarantee and potential retry duration. Generally, 24 hours has been a good starting point for me.

  2. Incorrect Response States: When I make an operation idempotent, if the first request is successful but the response doesn't reach the client, then upon a second request, I must return the result of the first request. Simply stating "operation already performed" is not enough; it's important for the client to receive the result of that initial successful operation (e.g., order number, transaction ID). When I overlooked this, it could lead to a false perception on the client side, such as "operation failed."

  3. Transaction State Transitions: Especially in long-running or multi-step operations, storing the current state of the operation along with the idempotency key becomes critical. If a second request arrives while an operation is in a PENDING state, we should either hold this request or ensure the first operation continues. Optimistic lock mechanisms or transaction outbox patterns help manage such situations. Last month, in the backend of my own side product, when I accidentally put a sleep 360 command into a SystemD unit, I experienced an OOM-killed scenario. This caused the operation to be half-finished, and because the idempotency key's state was not managed correctly, the system remained in an inconsistent state. The lesson learned was to switch to polling-wait mechanisms for long-running operations and to update the idempotency key not just when the operation completes, but at every state transition. [Related: Service Management with SystemD]

  4. Database Deadlocks and Performance: UNIQUE constraints and transactions can cause database deadlocks in high concurrency situations. Especially when using Serializable isolation level in PostgreSQL, serialization failure errors need to be managed. These errors require the application to correctly implement its retry logic.

Beyond Idempotency: Event Sourcing and Immutable Logs

While idempotency is a cornerstone of reliability in distributed systems, it's not the only solution for architecture. In more complex scenarios, especially when requirements like auditing and replaying events to reconstruct past states exist, approaches like event sourcing and immutable logs come into play.

In an event sourcing architecture, the system's state is not directly manipulated; instead, all changes that occur in the system are recorded as an event stream. Each event represents a fact and is inherently immutable. By replaying this event stream from beginning to end, we can reconstruct the system's state at any given moment. This approach fundamentally solves the idempotency problem because attempting to record an event twice is usually prevented by a unique constraint on the event ID. If an event with the same event ID has already been recorded, the system ignores it the second time.

In my production ERP, when using AI for production planning, tracking product stock movements was critical. In a classic CRUD (Create, Read, Update, Delete) approach, when a stock movement record was accidentally updated or deleted twice, it was very difficult to trace the history. By switching to event sourcing, we recorded every "product inflow," "product outflow," or "stock adjustment" action as an event. We assigned unique event_ids to these events. Thus, even if an event was triggered twice, the system's consistency was maintained, and we could review all past movements, like audit logs. This was a very valuable feature, especially for cost calculations and IFRS integrations.

# Example Event Sourcing event structure
class ProductStockAdjusted(BaseEvent):
    def __init__(self, event_id: str, product_id: str, quantity: int, reason: str, timestamp: datetime):
        self.event_id = event_id
        self.product_id = product_id
        self.quantity = quantity
        self.reason = reason
        self.timestamp = timestamp

# Event handler
def handle_product_stock_adjusted(event: ProductStockAdjusted):
    # Idempotency check: has event_id been processed before?
    if event_id_processed(event.event_id):
        print(f"Event {event.event_id} already processed. Skipping.")
        return

    # Update stock
    update_product_stock(event.product_id, event.quantity)
    mark_event_as_processed(event.event_id)
Enter fullscreen mode Exit fullscreen mode

This model also offers a more transparent path in distributed transactions and eventual consistency scenarios. However, event sourcing also carries its own complexities; such as maintaining the correct order of events, taking snapshots, and managing read models. Nevertheless, the reliability and auditability it provides in critical business processes can be worth this complexity.

Conclusion: Idempotency is a Tool, Not an End

Idempotency is an indispensable part of distributed system architecture and an important tool that enhances the reliability of our systems. However, it's crucial to view it as a tool rather than an end goal. Instead of blindly adding idempotency to every operation, it's essential to apply it where it's truly needed, with the right trade-offs. Considering performance, complexity, and storage costs, choosing the most suitable solution for the project's requirements leads to a more sustainable architecture in the long run.

In my career, every mistake regarding idempotency has shown me that system design is not just a technical matter, but a discipline that requires a deep understanding of operational processes, user experience, and business logic. A correct idempotency strategy not only ensures data consistency but also increases users' trust in your system. My preference is to start with idempotency key usage primarily at the database and API layers, and then transition to more comprehensive patterns like event sourcing when needed. This approach is key to pragmatically building reliability.

Top comments (1)

Collapse
 
davidloibner profile image
David Loibner

I liked especially the part about pending states and not just returning 'already processed.

I was wondering how you would think about this when the client is not a normal service, but an AI agent.

A normal service usually retries because of a timeout or network issue. An agent may retry for messier reasons: it did not understand the previous result, lost track of the workflow, or calls the same state-changing tool again with slightly different wording.

So the same idempotency problem appears, but one layer higher.

Before the agent changes external state, maybe there should be a boundary that checks: is this already pending, already done, partly failed, or a duplicate effect?

From your perspective, should that logic stay inside each target API/tool, or does it make sense to have a separate layer for only that?