Gabriel Anhaia

Posted on Jun 13

Dead-Letter Replay: Doing It Without Double-Processing

#kafka #architecture #eventdriven #backend

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The incident is over. The bug that was poisoning your consumer is fixed and deployed. Now there are 4,000 messages sitting in the dead-letter queue, and somebody on the call says the obvious thing: "just replay them."

So you point the DLQ back at the main topic and drain it. Twenty minutes later a customer emails asking why they got charged twice. Half those 4,000 messages had already been partially processed before the consumer choked. Replaying them ran the side effects again.

The replay was the easy part. Replaying without re-applying work you already did is the part nobody writes a runbook for. This is that runbook.

Why naive replay double-processes

A dead-letter queue fills up for two reasons, and they look identical from the outside.

The first is a transient failure: a downstream API was down, the database connection pool was exhausted, a deploy was mid-flight. The message is fine. Once the failure clears, replaying it works.

The second is a poison message: malformed payload, a schema your consumer can't parse, a business rule it violates. Replaying it a thousand times fails a thousand times. It will never succeed, and every replay attempt burns a retry budget that healthy messages need.

The double-processing trap lives in the first category. A message that failed after it had already done part of its job. Your handler charged the card, then crashed before writing status = paid. The broker sees no ack, the message lands in the DLQ, and the payload still says "please charge this card." Replay it and you charge again.

So replay safety is not a replay problem. It's an idempotency problem wearing a replay costume.

The precondition: idempotent consumers

If your consumers aren't idempotent, stop here. Replay will hurt you no matter how careful the protocol is. The cheapest version of idempotency for replay is a state-transition guard: the handler only acts if the entity is in the expected starting state.

UPDATE orders
SET status = 'paid',
    paid_at = ?
WHERE id = ?
  AND status = 'pending';

If the order was already marked paid by the original (partial) run, this UPDATE affects zero rows and the handler returns without touching the payment gateway. Replay a message a thousand times and you get exactly one transition.

When the operation isn't a natural state transition, fall back to a processed-events table keyed on the message ID.

// Claim records the message ID before any side effect.
// A duplicate insert means we already handled this one.
func (s *Store) Claim(
    ctx context.Context, id string,
) (bool, error) {
    tag, err := s.db.Exec(ctx,
        `INSERT INTO processed_events (id)
         VALUES ($1)
         ON CONFLICT (id) DO NOTHING`, id)
    if err != nil {
        return false, err
    }
    return tag.RowsAffected() == 1, nil
}

Claim returns true only the first time it sees an ID. The handler wraps its side effects behind that boolean. The message ID has to be stable across the original delivery and the replay, so use a producer-assigned UUID, not the broker offset.

Quarantine poison messages before you replay anything

Before a bulk replay, separate the two categories. Transient failures get replayed. Poison messages get quarantined for a human.

The signal is the retry count. Every DLQ message should carry how many times it has already failed, stamped in a header.

type DeadLetter struct {
    ID         string
    Payload    []byte
    Attempts   int
    LastError  string
    FailedAt   time.Time
}

const poisonThreshold = 5

func (d DeadLetter) IsPoison() bool {
    return d.Attempts >= poisonThreshold
}

When you drain the DLQ for replay, route by that flag instead of replaying blind.

func partition(
    letters []DeadLetter,
) (replay, quarantine []DeadLetter) {
    for _, l := range letters {
        if l.IsPoison() {
            quarantine = append(quarantine, l)
        } else {
            replay = append(replay, l)
        }
    }
    return replay, quarantine
}

The quarantine set goes to a separate topic or table that nothing auto-consumes. A human inspects each one, decides whether to fix the payload, drop it, or patch the consumer, and only then moves it back. Poison messages that loop through an automatic replay are how a DLQ turns into an infinite-cost retry storm.

Partial-batch replay: don't drain the whole queue at once

The instinct after an incident is to replay everything in one shot. Resist it. A 4,000-message flood hits your downstream systems at a rate they never see in normal traffic, and you discover a second outage caused by the recovery from the first.

Replay in bounded batches with a pause between them. This gives you a kill switch: if the first batch shows a problem, you stop after 100 messages instead of 4,000.

func replayBatched(
    ctx context.Context,
    letters []DeadLetter,
    handler func(context.Context, DeadLetter) error,
    batchSize int,
    pause time.Duration,
) (replayed int, failed []DeadLetter, err error) {
    for i := 0; i < len(letters); i += batchSize {
        end := i + batchSize
        if end > len(letters) {
            end = len(letters)
        }
        for _, l := range letters[i:end] {
            if e := handler(ctx, l); e != nil {
                failed = append(failed, l)
                continue
            }
            replayed++
        }
        select {
        case <-ctx.Done():
            return replayed, failed, ctx.Err()
        case <-time.After(pause):
        }
    }
    return replayed, failed, nil
}

Two details matter here. A message that fails during replay goes into failed, not back into the live retry loop. You inspect those separately rather than letting them recycle. And the ctx.Done() check means an operator can cancel the whole replay between batches, which is the kill switch you want at 3 a.m.

The batch size is a function of your downstream capacity, not the queue depth. If the original consumer ran at 200 messages a second healthy, replay at a fraction of that. You are competing with live traffic for the same database connections.

The replay runbook

When the page fires and the DLQ is filling, follow the same sequence every time. Improvising the order is how the double-charge happens.

1. Confirm the bug is actually fixed and deployed. Replaying into a still-broken consumer just moves messages from the DLQ to the DLQ, plus the side effects that succeed before the failure point. Check the deploy SHA in production before touching the queue.

2. Snapshot the DLQ. Copy the current contents to an immutable store before you replay. If the replay goes wrong, you need the original messages to start over. Never replay directly out of the only copy you have.

3. Verify idempotency coverage for the affected handler. Pull up the handler and confirm there's a state guard or a Claim check in front of every external side effect. If there isn't, the replay is unsafe and the next step is to add one, not to replay.

4. Partition into replay and quarantine. Route poison messages out. Hand them to a human queue. Do not include them in the bulk replay.

5. Replay one small batch. Start with the smallest batch your tooling allows. Watch the metrics that matter for this handler: duplicate-skip counter, downstream error rate, business-side effects. The duplicate-skip counter is the one that tells you idempotency is doing its job.

6. Read the dedup metric before continuing. If dedup.skip is climbing, your idempotency layer is catching re-applications. That's the system working. If it stays flat and side effects are firing, stop. Something is letting duplicates through.

7. Drain the rest in batches with pauses. Keep the kill switch in reach. If error rates climb, stop between batches.

8. Reconcile. After the queue is empty, count what you replayed against what the DLQ held. Check the quarantine set is accounted for. Confirm no entity ended up in a state the original run plus the replay shouldn't have produced.

The step people skip is 6. They replay one batch, see no errors, and assume success. No errors during replay does not mean no duplicates. It means the duplicates were either prevented (good) or silently applied (very bad). The dedup metric is how you tell those two apart.

What "ack the duplicate" actually means

One subtlety trips up replay handlers. When the idempotency layer catches a message it has already processed, the handler must ack it, not fail it.

func (c *Consumer) Replay(
    ctx context.Context, l DeadLetter,
) error {
    fresh, err := c.store.Claim(ctx, l.ID)
    if err != nil {
        return err // real error: let it retry
    }
    if !fresh {
        c.metrics.Inc("dedup.skip")
        return nil // already done: ack, move on
    }
    return c.process(ctx, l.Payload)
}

Returning nil on a duplicate acknowledges the message and removes it from the queue. Returning an error would push it back, and a message that's already been processed would loop forever, looking like a poison message it isn't. The dedup.skip increment is what makes the difference visible on a dashboard during step 6 of the runbook.

What survives the post-mortem

Most replay incidents share the same root cause: the team treated the DLQ as a parking lot and the replay as a "drain" button. It isn't. A dead-letter queue is a record of work that failed at an unknown point, and the only safe assumption is that some of it half-finished.

Build the protocol once. Stamp retry counts so you can quarantine poison messages. Keep idempotency guards in front of every side effect so replay is a no-op for work already done. Replay in bounded batches with a kill switch. Watch the dedup metric, not only the error rate. The runbook is boring on purpose, because the alternative is exciting in the way an incident channel at 3 a.m. is exciting.

What's the worst thing your team replayed by accident, and how did you find out? Drop the story in the comments.

If this was useful

Replay is one corner of a larger problem: how an event-driven system behaves when delivery is at-least-once and the network has a bad day. The Event-Driven Architecture Pocket Guide walks through dead-letter handling alongside the outbox pattern, saga compensation, and the idempotency placements that make replay safe in the first place. If you're writing the replay runbook for the first time, the book has the failure modes mapped out so you don't learn them during the incident.