DEV Community

Cover image for Idempotency in Distributed Systems: Design Patterns Beyond 'Retry Safely'
Alok Ranjan Daftuar
Alok Ranjan Daftuar

Posted on • Originally published at aloknecessary.github.io

Idempotency in Distributed Systems: Design Patterns Beyond 'Retry Safely'

Every engineer in distributed systems has heard: "make your operations idempotent." It gets repeated in design reviews and dropped into architecture docs as if the word itself is a solution.

The gap between understanding what idempotency means and actually building systems that enforce it under real-world failure conditions is enormous.

"Retry safely" is the starting point — not the destination.


Idempotency vs. Deduplication

Teams often conflate these. They are related but not the same:

  • Idempotency — the outcome is the same regardless of how many times the operation is applied
  • Deduplication — detecting and discarding duplicate requests so the operation executes exactly once

A PUT /users/{id} is naturally idempotent. A POST /orders is not — calling it ten times creates ten orders. To make it safe to retry, you need deduplication. And deduplication requires idempotency keys.


Idempotency Keys

A client-supplied identifier that scopes a request to a single logical operation. Key design decisions:

  • Client ownership — the client knows the intent; it generates the key
  • Time-bounded validity — keys should not live forever; align TTL with your retry window
  • Granularity — one key per logical operation, not per HTTP request
  • Request fingerprinting — hash the payload to detect same key with different payloads (client bugs)

The Two-Phase Reservation Pattern

The naive check-then-act pattern has a race condition. Two concurrent requests with the same key can both pass the existence check.

The correct approach:

  1. Reservation — atomically insert the key as IN_PROGRESS (insert-if-not-exists)
  2. Completion — after processing, update to COMPLETED with the response payload

This eliminates concurrent duplicate processing. Every production idempotency implementation should use this or an equivalent.


API Gateway vs. Application Layer

  • Gateway-level — short-circuit caching for the happy path; fast but doesn't protect your data layer
  • Application-level — owns the deduplication store, handles correctness, coordinates multi-step workflows

In production, they're complementary. The gateway is a performance optimization; the application layer is where correctness lives.


Failure Scenarios Most Teams Miss

  1. Completed-but-undelivered response — operation succeeds, connection drops before response reaches client
  2. Key reuse across different operations — client bug sends different payload with same key
  3. Concurrent requests with the same key — race condition without atomic reservation
  4. Partially-applied multi-step operations — saga crashes mid-workflow, retry re-runs completed steps
  5. Message queue consumers without idempotency — API-layer protection doesn't help queue consumers
  6. TTL expiry during active retry windows — key expires while client is still retrying
  7. Schema evolution breaking retry flows — new schema + old key = fingerprint mismatch rejection

Key Takeaway

Idempotency done right is a system-wide property, not a feature you add to an endpoint. It requires deliberate decisions about key semantics, durable deduplication stores, clear responsibility allocation between gateway and application layers, and explicit handling of failure scenarios beyond "retry on 5xx."

"Retry safely" is where you start. Building a system that is actually safe to retry under real-world conditions is the harder, more interesting problem.


Read the Full Article

This is a summary of my comprehensive deep dive into idempotency patterns. The full article covers each pattern in detail with production-grade implementation strategies:

👉 Idempotency in Distributed Systems — Full Article

The full article includes:

  • Detailed deduplication store architecture with Redis, PostgreSQL, and DynamoDB trade-offs
  • Complete two-phase reservation pattern walkthrough
  • All 7 failure scenarios with mitigations
  • Cross-cutting concerns: observability, testing strategy, and downstream service idempotency

Top comments (0)