DEV Community

Cover image for Ordered Retries in Kafka: Why You Probably Shouldn't Build This
Myroslav Vivcharyk
Myroslav Vivcharyk

Posted on • Originally published at devgeist.com

Ordered Retries in Kafka: Why You Probably Shouldn't Build This

I once spent three weeks in architectural meetings arguing that we should insert SQS between two Kafka topics.

I wasn't doing it because I liked SQS. I was doing it because the "clean" Kafka-native retry logic felt like a complexity bomb our team wasn't ready to defuse.

As engineers, our job isn't just to build clever systems for the next promotion cycle. Our job is to build systems that don't set the on-call engineer's life on fire at 3 AM.

In Part 1 and Part 2, we built a Rube Goldberg machine of distributed locks, compacted topics, and reference counting. It works. It maintains strict ordering during failures without blocking entire partitions.

Before you import my library or build your own, let's talk about the engineering hangover that comes after the party.

The Technical Costs

1. The Cold Start Penalty

Your service was stateless. Kubernetes could kill pods and spin up new ones in seconds. Not anymore.

Now your application carries state. That state lives in the lock topic. On startup, every pod must:

  1. Connect to Kafka
  2. Read the lock topic from the beginning
  3. Rebuild the in-memory map before processing anything

With 100 active locks? Instant. With 500,000 lock records after a bad day — every lock and unlock is a separate record, your pod startup goes from seconds to minutes.

Here's where it gets ugly: your Horizontal Pod Autoscaler sees growing lag and spins up more pods. Those pods sit there "syncing state..." while the lag grows. The new pods can't help because they're still starting up. You've created a scaling death spiral.

Mitigation:

  • Aggressive retention on the lock topic (minutes, not days)
  • Tune max.poll.interval.ms to accommodate sync time
  • Use this pattern only for low-cardinality keys

2. The Memory Pressure

The lock map itself is small — a Go map with 100,000 string keys and integer counters fits in single-digit megabytes. That's not the problem.

The problem is what happens upstream. During a failure storm, every failed message triggers a lock write and a retry redirect. The lock topic and retry topic both grow rapidly. Your Kafka consumers buffer incoming records in memory before processing. The lock consumer, replaying a suddenly-huge topic during a restart, can spike memory well beyond steady-state expectations.

Combine this with Cost #1: your pod OOM-kills during state restore, restarts, tries again, OOM-kills again. Crash loop.

Mitigation:

  • Circuit breaker: if lock count exceeds threshold, fall back to stop-the-world
  • Set memory limits with headroom for failure storms, not steady-state

3. Topic Overhead

You wanted one topic: orders.

To make ordered retries work, you now operate four:

  • orders — main topic
  • orders-retry — the retry queue
  • orders-lock — the compacted state topic
  • orders-dlq — dead letters

200 topics across 50 services won't bring your Kafka cluster down — production clusters run thousands. The cost is operational: each topic needs provisioning, monitoring, retention policies, and alerting. Your SRE team will love you.

Mitigation:

  • Only use this pattern for topics that truly need it
  • Budget the operational overhead before committing

4. Zombie Lock Operations

In Part 2, we discussed zombie locks — keys locked forever because the retry message never made it. The fix was compensating transactions and careful ordering.

But "rare" still means "happens." When it does, you need to detect and clean it up.

Detection:

  • Alert on locks older than N hours
  • Compare lock topic keys against retry topic — orphans are zombies
  • Dashboard showing lock age distribution

Cleanup:

  • Manual tombstone producer (you'll need the CLI tool)
  • Automated reaper: background job that scans for stale locks and releases them
  • TTL simulation: treat locks older than X hours as expired

The reaper sounds easy until you realize: how do you know a lock is stale vs. just slow? A retry with exponential backoff might legitimately take hours. You need metadata — timestamps, attempt counts — in the lock record itself.

The question: Do you have this tooling built? Do you have observability for it? If not, you're one zombie lock away from a very long night.

The 3 AM Test

For me, this matters more than the technical drawbacks — even though they're intertwined.

Before building any complex system, try to write the runbook for it.

A downstream team calls you: "Why is Order-123 not updating?"

How many hops to find the answer?

  1. Check the main consumer logs
  2. See a "key locked, redirecting" message
  3. Query the lock topic to confirm the lock exists
  4. Query the retry topic to find the original failure
  5. Realize the lock is 4 hours old and the retry message... isn't there

Now what? Do you have tooling to manually produce a tombstone? Do you know which partition that key hashes to? Can you do this at 3 AM, half-asleep, with Slack pinging?

Here's the kicker: the person debugging this at 3 AM might not be you. In many organizations, the on-call rotation is a separate team — engineers who've never seen your code, who are juggling alerts from a dozen services. Can they navigate your distributed lock registry with a runbook, or are they going to escalate and wake you up anyway?

And there's another problem: Kafka doesn't have key-based lookup. You can't just say "show me the lock for Order-123." To find a specific key, you either scan the entire topic or build custom tooling that maintains an index. Does your company have that tooling? Will you build and maintain it? How will this system behave during a global failure when hundreds of keys are locked simultaneously?

If you can't visualize this "penalty box" of locked keys — which keys are locked, for how long, and why — you haven't built a resilient system. You've built a black box.

The rule: If you can't write the runbook, you can't operate the system.

Observability

The alerts that matter:

  • acquired - released growing over time → locks aren't clearing, potential zombies
  • state_restore.duration approaching max.poll.interval.ms → rebalance loop incoming (see Part 2)
  • dlq.enqueued > 0 → manual intervention required
  • redirected spike without corresponding retry.processed increase → retries aren't succeeding

Without this, you're flying blind. Build the dashboards before you ship the feature.

Before You Build This

Before reaching for distributed locks, exhaust the simpler options.

Option 1: Make It Commutative

I know — in an existing system, this is easy to say and almost impossible to do. But if you're building something new, ask: do you really need strict ordering?

Sometimes you can redesign the problem away. Store deltas instead of absolute values. Use event sourcing with snapshots and recalculate on late arrivals. Add idempotency keys so duplicates and reordering become harmless. These aren't always possible. But when they are, they eliminate the problem entirely rather than managing it.

Option 2: External State Store (Redis/Dragonfly)

Most of you reading this probably thought: "Why not just use Redis and skip all this Kafka-native complexity?"

Fair point. Redis gives you:

  • Native TTL — zombie locks expire automatically, no reaper needed
  • Instant startup — no cold start problem, no state to rebuild
  • Fast lookups — no consumer lag to worry about, O(1) key access
  • Familiar tooling — your team probably already knows how to operate Redis

But this isn't free either. You're adding a new infrastructure dependency. Your Kafka consumer now fails if Redis is down. Network latency on every lock check adds up. And now you're debugging two systems instead of one.

A word on Redlock: if you're considering distributed locking with Redis, read Martin Kleppmann's analysis first. Distributed locking is hard regardless of the tool. Redis doesn't make the fundamental problem easier — it makes the infrastructure easier.

My take: If I were starting fresh today and needed ordered retries with blocking semantics, I'd reach for Redis/Dragonfly before building the Kafka-native solution we covered in Parts 1 and 2. Redis doesn't eliminate complexity — it moves it somewhere your team probably already understands.

Option 3: Accept the Trade-off

Sometimes the "stop-the-world" approach from Part 1 is actually fine.

If your topic has low throughput — blocking the partition during a retry might be acceptable. The complexity of distributed locks only pays off when throughput requirements make blocking unacceptable.

Run the numbers. If your worst-case retry takes 5 minutes and your partition processes 100 messages per hour, you're blocking 8 messages. Is that worth an operational burden?

When It's Worth It

After all that, there are cases where this pattern might be the right call:

  1. Strict ordering is contractual — Downstream systems will break or charge you money if events arrive out of order. Not "might cause weird bugs" — actually break, with consequences.

  2. Stop-the-world is unacceptable — Throughput requirements mean you can't block a partition for minutes. You've measured this, not assumed it.

  3. You have the tooling for debugging and observability — Your team can see which keys are locked, for how long, and why. You can produce a tombstone manually. An SRE can follow the runbook at 3 AM without escalating.

  4. Your team understands it — The next engineer can debug it without reading three blog posts. You've documented the architecture, the failure modes, and the recovery procedures.

If all four are true, go ahead. The kafka-resilience library handles the edge cases we covered in Part 2.

If any are false, reconsider.

The Tax

Engineering is the art of choosing which problems you want to have.

This solution solves the ordering problem. In exchange, it gives you the state management problem, the cold start problem, the memory pressure problem, and the 3 AM debugging problem.

The technical challenges are solvable. The question is whether you should be the one solving them — or whether a simpler solution, even an imperfect one, is the wiser choice for your team and your system.

Build the clever thing when you've exhausted the simple options. Not before.

Top comments (0)