Myroslav Vivcharyk

Posted on Mar 10 • Edited on Mar 17 • Originally published at devgeist.com

Ordered Retries in Kafka: Why The Retry Topic Is Breaking Downstream

#kafka #go #eventdriven #distributedsystems

I was on-call when the P1 hit. Slack lighting up, a downstream team's database rejecting writes. Constraint violations everywhere.

I pulled up the logs. The events looked fine. The schema was correct. The data made sense. Except the timestamps didn't.

10:00:01 - Event 1: ADD 100 units to SKU-123
10:00:03 - Event 2: REMOVE 50 units from SKU-123

That's the sequence that happened in the real world. Here's what the consumers saw:

10:00:01 - Event 1: ADD 100 → FAILED (API timeout) → sent to retry topic
10:00:03 - Event 2: REMOVE 50 → FAILED (CHECK constraint: quantity < 0)
10:03:01 - Event 1: ADD 100 → SUCCESS (from retry topic, 3 minutes late)

The downstream inventory table had CHECK (quantity >= 0). Because Event 2 arrived before Event 1 was retried, the system tried to remove stock that didn't exist yet. PostgreSQL said no.

The retry logic did exactly what it was designed to do — keep the pipeline moving by shoving failed messages aside. In doing so, it broke the ordering that every downstream consumer depended on.

The postmortem took two days. The fix took even more. The root cause? Your retry decisions propagate to every downstream consumer.

The Downstream Ripple Effect

When you consume from a topic and publish to another, you aren't just moving data. You're a guardian of ordering semantics for everyone downstream.

The team consuming your enriched stream might be:

Building a state machine — order status must flow CREATED → PAID → SHIPPED, never backwards
Calculating running totals — account balances or inventory counts where State(N) depends on State(N-1)
Enforcing causality — you can't update a record that hasn't been "created" yet

The worst part? You won't know you've broken ordering until production data starts drifting. Or worse — a 2 AM call about corrupted state that's been silently building for weeks.

The Two Bad Options

If your operations are idempotent and commutative — A then B gives the same result as B then A — the standard retry-topic-and-DLQ pattern works fine. Uber does it. Confluent shows it.

But for stateful systems where order matters, you're stuck choosing between two bad options:

Option 1: Stop the World. Don't commit the offset. Retry in place until it succeeds.

func (h *Handler) ConsumeClaim(session sarama.ConsumerGroupSession, claim sarama.ConsumerGroupClaim) error {
    for msg := range claim.Messages() {
        for {
            err := h.process(session.Context(), msg)
            if err == nil {
                break
            }
            time.Sleep(h.backoff.Next())
        }
        session.MarkMessage(msg, "")
    }
    return nil
}

Order is preserved. But one poison pill blocks the entire partition — if offset 10 fails for three hours, offsets 11 through 1,000,000 sit waiting. Your client's background heartbeat thread won't save you — max.poll.interval.ms is a separate timer, and if you exceed it, Kafka assumes your consumer is livelocked, kicks it from the group, and triggers a rebalance storm.

Option 2: Skip and Pray. Send the failed message to a retry topic. Commit. Keep going.

func (h *Handler) ConsumeClaim(session sarama.ConsumerGroupSession, claim sarama.ConsumerGroupClaim) error {
    for msg := range claim.Messages() {
        if err := h.process(session.Context(), msg); err != nil {
            h.sendToRetryTopic(session.Context(), msg)
        }
        session.MarkMessage(msg, "")
    }
    return nil
}

Throughput maintained. Ordering sacrificed. This is what caused my P1.

The Third Way

We need the availability of Option 2 with the ordering guarantees of Option 1.

Picture a restaurant kitchen during dinner rush. Table 5 orders appetizers, main, dessert — in that order. The steak comes out wrong and needs a redo. You don't shut down the kitchen. You don't stop serving every other table. You hold Table 5's remaining courses until the steak is ready. Tables 6, 7, and 8 keep getting their food on time.

Each table is a message key. A failed dish doesn't block the kitchen — it blocks that table's order until resolved.

When a message for key K fails, all subsequent messages for key K must wait — but messages for other keys proceed normally.

This pattern is described in a Confluent blog post. The concept is straightforward. The implementation is where things get interesting.

The Architecture

To make this work across multiple consumer instances — say, 50 pods in Kubernetes — we need a way to coordinate which keys are currently blocked. A pod that locks a key must somehow tell every other pod. And when a pod crashes mid-retry, the lock must survive.

We'll use Kafka itself as the coordination layer. On top of your original topic, you need two additional topics:

Retry Topic — messages land here either because they failed or because a predecessor with the same key failed. A separate consumer processes them with backoff. After exhausting retries, messages move to a DLQ — that's a standard pattern we won't rehash here.
Lock Topic (compacted) — the shared lock registry. Every lock and unlock is a record here.

A retry topic alone isn't enough. It handles the failed message — but the main consumer doesn't know that SKU-123 is in trouble. When the next message for SKU-123 arrives, it processes immediately, out of order. We need a way to tell every consumer instance: "this key is blocked, don't touch it." That's what the lock topic does.

How Lock State Propagates

When a message fails, the consumer writes a record to the lock topic: key="SKU-123", value="LOCKED". When the retry eventually succeeds, it writes a tombstone (null value) to signal the lock is cleared.

Every consumer instance subscribes to the lock topic and builds a local in-memory map from the stream. Because the topic is compacted, Kafka periodically deduplicates records — keeping only the latest value per key. A new instance can read from the beginning and reconstruct the full lock state. The topic is the database.

This means each instance runs two consumer loops: one for business messages (main or retry), and one that continuously tails the lock topic to keep the local map in sync.

The Lock Check

Before processing any message, the consumer checks its local map: "Is this key currently blocked?"

New message arrives: key="order-123"
│
├─ Lock map: "order-123" locked?
│   │
│   ├─ YES → Redirect to retry topic
│   │        (preserve order behind predecessor)
│   │
│   └─ NO  → Process normally

If the key is locked, the message gets redirected to the retry topic — not because it failed, but because a predecessor with the same key failed. It lines up behind that predecessor and waits its turn.

Seeing It In Action

Here's how the full flow handles the scenario from the opening:

Event 3 (different key) was never blocked. Events 1 and 2 (same key) maintained their order despite the failure. The partition kept flowing.

Where It Breaks

That diagram shows the happy path. In a distributed system with 50 pods, there are at least five ways this breaks:

1. The Dual-Write. When a message fails, you need to write a lock and publish the message to the retry topic. If you crash between the two, you've created a zombie lock — a key locked forever with no retry message that will ever release it. Do you wrap both writes in a Kafka transaction? Or is there something cheaper?

2. The Reference Counting Gap. Two messages for the same key fail. You lock the key. The first retry succeeds — do you release the lock? If you do, the third message for that key processes immediately, before the second retry. The lock isn't a boolean. It's a counter.

3. The Self-Consumption Trap. Every instance publishes lock records and subscribes to the lock topic. Kafka doesn't filter — you'll consume your own messages. Apply the lock locally and from the broadcast, and you've double-counted every lock.

4. The Compaction Collision. Three failures for the same key produce three lock records with the same Kafka key. Compaction collapses them into one. After a restart, you replay one record instead of three. Your reference count is silently wrong.

5. The Rebalancing Race. A new pod starts, gets assigned a partition, and begins processing. But it hasn't finished reading the lock topic yet. It doesn't know what's locked. It processes a message out of order.

None of these are theoretical. They all showed up during implementation.

What's Next

I've built a Go library — kafka-resilience — that handles these edge cases.

In Part 2, we'll walk through each of the five problems above and their fixes. Code included.

Part 3 is the counterargument: operational costs, failure modes, and why you might choose a simpler solution.

The goal isn't to convince you to use this — it's to give you the full picture so you can make an informed decision.

DEV Community