- Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
- Also by me: Thinking in Go (2-book series): Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub. An IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You charged a customer twice last month. The fix in the post-mortem was "make the consumer idempotent." Two sprints later, you charged a different customer twice. Same root cause, different placement.
That's the bug at the centre of this post. The phrase "make it idempotent" sounds like one thing. It is at least three things, and they fail in different ways.
Why "make the consumer idempotent" is incomplete advice
Idempotency is a property of an operation: applying it twice does the same thing as applying it once. It's not a technique. The techniques are how you arrange your code so that the property holds.
Three techniques dominate in real systems:
- A dedup key in durable storage. Every message has an ID; you record IDs you've processed; you skip ones you've seen.
- A dedup window in memory. Same idea, but the record lives in process memory and only for a window.
- Idempotency by design. The handler is mathematically idempotent. There is nothing to skip because doing it twice has the same effect.
Each one catches a different failure mode. None catches all of them. A team I worked with had placement 1 in Redis, congratulated themselves, and shipped a duplicate charge two weeks later. The retry ran 90 seconds after the original, well inside their TTL, but the Redis cluster had failed over and the keyspace was empty. Wrong placement for that failure mode.
So before code, the matrix:
| Placement | Catches | Misses | Latency |
|---|---|---|---|
| Dedup key in storage | Genuine duplicates from at-least-once delivery | Reordering, semantic duplicates with different IDs | ~1–3 ms |
| Dedup window in memory | Hot-path duplicates within seconds | Anything after process restart or partition rebalance | ~0.01 ms |
| Idempotency by design | Everything, including replays from last Tuesday | Nothing on the consumer; requires careful handler design | 0 ms |
Pick by the failure mode you're trying to defeat, not by which one is easiest to write.
Placement 1: Dedup key in storage (Redis SET NX)
The workhorse. Every event arrives with a unique ID. Kafka offsets work, but most teams prefer a producer-assigned UUID so the dedup key survives topic rebalances and broker migrations. You record the ID in Redis with a TTL; if the SET fails because the key exists, you skip the message.
package consumer
import (
"context"
"errors"
"fmt"
"time"
"github.com/redis/go-redis/v9"
)
var ErrDuplicate = errors.New("duplicate event")
type DedupStore struct {
rdb *redis.Client
ttl time.Duration
}
func NewDedupStore(rdb *redis.Client, ttl time.Duration) *DedupStore {
return &DedupStore{rdb: rdb, ttl: ttl}
}
// Claim returns nil on first-time, ErrDuplicate on repeat.
// SET NX is the atomic primitive. Don't split into GET + SET.
func (d *DedupStore) Claim(ctx context.Context, eventID string) error {
key := fmt.Sprintf("dedup:order-events:%s", eventID)
ok, err := d.rdb.SetNX(ctx, key, "1", d.ttl).Result()
if err != nil {
// fail closed: better to reprocess than skip wrongly
return fmt.Errorf("dedup claim: %w", err)
}
if !ok {
return ErrDuplicate
}
return nil
}
Usage in the consumer loop:
func (c *Consumer) Handle(ctx context.Context, evt OrderEvent) error {
if err := c.dedup.Claim(ctx, evt.ID); err != nil {
if errors.Is(err, ErrDuplicate) {
c.metrics.Inc("dedup.skip")
return nil // ack the message, do nothing
}
return err // Redis is down; let broker redeliver
}
return c.process(ctx, evt)
}
What this catches: a broker that delivers the same message twice. Kafka does this on consumer-group rebalances; SQS does it because visibility-timeout. RabbitMQ does it because you nack'd on shutdown. Real, common, reliably caught.
What it misses:
- Reordering. Message A then B then A-again is three deliveries; the second A is the duplicate, but if you processed the first A successfully and B depended on it, the second A might still cause damage if your handler is also doing reads. The dedup catches the re-delivery, not the re-application of stale state.
-
Failover gaps. Redis is a cache. If your TTL is 1 hour and Redis fails over to a replica that hasn't caught up, or the cluster reboots, the dedup keyspace empties. Set TTL based on the broker's worst-case redelivery window, not on what feels reasonable. For Kafka with a 7-day retention and the possibility of a
seekfor reprocessing, 1 hour is wrong. - Different IDs, same semantic event. A producer that retries the publish (not the consumer that retries the consume) generates a new event with a new ID. Storage dedup at the consumer can't see that. Catch this at the producer with an idempotency key tied to the business action, not the wire-level event.
The gotcha worth a paragraph of its own: SET key value EX 3600 NX is one round trip. If anyone on your team writes GET then SET, the race window is wide enough to drive a truck through. The code review check is "does the dedup go through SetNX or its equivalent." If it goes through anything else, reject the PR.
Placement 2: Dedup window in memory (LRU + Bloom)
The hot path. When you're doing 50k events per second through a consumer, 1ms per event on Redis is 50 seconds of aggregate latency per wall-second across all instances. You need cheaper, and you accept a smaller guarantee.
A small in-process LRU catches the burst-redelivery pattern: a consumer crashes mid-batch, the next poll redelivers the same 200 messages, your in-memory set still has them. Roughly 99% of duplicates in a healthy system are this pattern.
package consumer
import (
"container/list"
"sync"
"github.com/bits-and-blooms/bloom/v3"
)
// WindowDedup combines an LRU of recent IDs with a Bloom backstop.
// The LRU gives exact answers within its capacity; the Bloom answers
// "probably seen" beyond it. False positives are acceptable here
// because we layer a real dedup behind this for safety.
type WindowDedup struct {
mu sync.Mutex
capacity int
order *list.List
index map[string]*list.Element
bloom *bloom.BloomFilter
}
func NewWindowDedup(capacity int, expectedItems uint, fpRate float64) *WindowDedup {
return &WindowDedup{
capacity: capacity,
order: list.New(),
index: make(map[string]*list.Element, capacity),
bloom: bloom.NewWithEstimates(expectedItems, fpRate),
}
}
// Seen returns true if the ID has been observed recently.
// false here is honest; true might be a Bloom false-positive.
func (w *WindowDedup) Seen(id string) bool {
w.mu.Lock()
defer w.mu.Unlock()
if _, ok := w.index[id]; ok {
return true
}
if w.bloom.TestString(id) {
// the bloom says maybe; caller decides what to do
return true
}
elem := w.order.PushFront(id)
w.index[id] = elem
w.bloom.AddString(id)
if w.order.Len() > w.capacity {
oldest := w.order.Back()
w.order.Remove(oldest)
delete(w.index, oldest.Value.(string))
// note: bloom doesn't support removal; it grows stale
// across restarts and rebuilds itself naturally
}
return false
}
This is two orders of magnitude faster than Redis. It's also a lie.
When the process restarts (deploy, OOM, rebalance), the LRU is empty. The first 30 seconds after restart, every event looks new. If the broker redelivers anything from before the crash, you reprocess. The Bloom filter doesn't survive restart either.
Use this placement only when the cost of a duplicate is acceptable and the volume makes storage dedup impractical. Analytics ingestion. Metrics aggregation. Log fan-out. Not payments. Not order placement. Not anything where the duplicate has an external side effect a customer can see.
There is a sub-pattern worth knowing. Some teams put the Bloom in a sidecar that survives the process restart: Redis Bloom module, or a small mmap-backed file. That moves you halfway back to Placement 1, with worse guarantees than a real SET NX. If you're going that far, just use Placement 1.
Placement 3: Idempotency by design (state-transition guard)
The only one that survives the broker's worst day. Make the handler such that running it twice produces the same state. No dedup table, no window, no infrastructure to fail.
The canonical example is "mark order paid." A naive handler reads the order, sets status = paid, calls Stripe, writes back. Run it twice and you charge twice. Run it idempotently by design and you don't.
public class MarkOrderPaidHandler {
private final OrderRepository orders;
private final PaymentGateway gateway;
public MarkOrderPaidHandler(OrderRepository orders, PaymentGateway gateway) {
this.orders = orders;
this.gateway = gateway;
}
public Result handle(OrderPaidEvent evt) {
// The state-transition guard: only flip pending -> paid.
// If the row is already paid, rowsAffected == 0 and we return.
// The UPDATE is atomic in Postgres at READ COMMITTED, which is
// what you almost certainly already have.
int rowsAffected = orders.markPaid(
evt.orderId(),
evt.amountCents(),
evt.paidAt()
);
if (rowsAffected == 0) {
// already paid, or never pending: both safe to skip
return Result.alreadyApplied();
}
// Side-effect after the guard. The idempotency key here
// is the order ID, not the event ID. Stripe will reject
// a second charge with the same key.
gateway.chargeWithIdempotencyKey(
evt.orderId().toString(),
evt.amountCents()
);
return Result.applied();
}
}
And the SQL:
UPDATE orders
SET status = 'paid',
paid_amount_cents = ?,
paid_at = ?
WHERE id = ?
AND status = 'pending';
That AND status = 'pending' is the entire trick. The handler doesn't need to know about duplicates. The database refuses to apply the transition twice. Run the consumer a thousand times against the same event and you get exactly one transition.
Three things to watch for here.
First, side effects on the second call. If Stripe is keyed on the order ID and they've already accepted that charge, they'll return the original charge object. That's the whole point of Stripe's idempotency keys. If your external system doesn't have an idempotency primitive, this technique doesn't fully save you and you're back to layering with Placement 1.
Second, optimistic locking is a different technique that solves a different problem (concurrent writers). The state guard above solves duplicate-application; it doesn't solve two consumers racing to be the first applier. If you can have two pods consuming the same partition at the same instant, you want both: the guard for duplicates and a version column or SELECT ... FOR UPDATE for concurrency.
Third, the design has to be possible. "Mark paid" is naturally idempotent because it's a one-way state transition. "Add 5 to balance" isn't. You need an applied_event_ids table or a version-keyed log to make it idempotent. When the operation isn't naturally a state transition, you pay the cost in storage somewhere; the design just shifts where.
The decision tree
Is the handler a state transition with a natural guard?
└─ yes → Placement 3. You're done. No infrastructure.
└─ no → Does it have an external side effect a customer sees?
└─ yes → Placement 1 (Redis/Postgres SET NX),
plus an idempotency key at the external API
if it supports one.
└─ no → Is throughput > ~5k/s per consumer?
└─ yes → Placement 2 (LRU+Bloom),
accept rare duplicates.
└─ no → Placement 1 anyway,
it's cheap enough.
Read it twice. The first branch eats most of the cases for free if you're willing to model your handlers as state transitions. The second branch keeps you safe when the handler can't be made naturally idempotent. The third branch is the only place where in-memory dedup is the right answer.
The hybrid real production systems run
Most large systems don't pick one. They layer. Cheap check first, expensive check second, design-time guarantee last:
func (c *Consumer) Handle(ctx context.Context, evt OrderEvent) error {
// Layer 1: in-memory window. Catches 99% of in-burst duplicates
// without a network hop. False positives bounce to Layer 2.
if c.window.Seen(evt.ID) {
c.metrics.Inc("dedup.window.hit")
// don't return yet; confirm with storage
}
// Layer 2: storage dedup. Authoritative for the TTL window.
switch err := c.dedup.Claim(ctx, evt.ID); {
case errors.Is(err, ErrDuplicate):
c.metrics.Inc("dedup.storage.hit")
return nil
case err != nil:
return err
}
// Layer 3: design-time. Even if both above lied (Redis
// failover, process restart), the state transition is
// the final backstop.
return c.process(ctx, evt)
}
Each layer earns its keep. Layer 1 saves you the Redis round trip on the hot path. Layer 2 catches what Layer 1 forgot after a restart. Layer 3 catches what Layer 2 forgot during a failover. The total latency is whatever Layer 1 takes for in-window events (microseconds) plus Layer 2 for cold ones (a few milliseconds), and the failure modes have to cascade across all three before a duplicate makes it through.
Retrofitting idempotency into a consumer you already shipped
You inherited a handler that isn't idempotent. The broker is at-least-once. You can't rewrite the world. Order of operations:
-
Add Placement 1 first. It's the lowest-risk change. Wrap the existing handler in a
SetNXcheck, give it a TTL longer than your worst-case redelivery window (look at the broker's settings, not your intuition), and ship behind a feature flag withdedup.skipanddedup.missmetrics. If those metrics aren't moving, you weren't getting duplicates in the first place. -
Audit handlers for natural state transitions. Anywhere a handler ends with "set status to X," try the
WHERE status = previousrewrite. This is a small, safe diff with a large blast-radius improvement. Often this lets you delete the Placement 1 wrapper later. - Add Placement 2 only if profiling says you need it. Premature in-memory dedup adds a class of bug you don't want before you've proven the throughput problem.
- Never trust the broker's "exactly-once" claim. Kafka EoS is real but narrow: it covers reads from one topic, writes to another, and an offset commit in a single transaction. The moment your handler calls Stripe, sends an email, or writes to a non-transactional system, you're back to at-least-once and back to needing one of the three placements above.
The phrase "make the consumer idempotent" looks small on a post-mortem action item. It hides a design decision per handler about which failure mode you're defeating. Pick the placement on purpose.
Which placement bit you in production, and how did you find out? Drop the war story in the comments. Bonus points if the duplicate cost real money.
If this was useful
Idempotency is the easy half. Where it gets interesting is the rest of the event-driven stack: outboxes that survive a database failover, sagas that compensate the right step, replay strategies that don't reprocess a Stripe charge from last Tuesday. That's what the Event-Driven Architecture Pocket Guide is built around: the patterns you reach for after the first "we shipped a duplicate" post-mortem, and the traps that turn the second one into something worse.

Top comments (0)