Moslem Chalfouh

Posted on Feb 22 • Originally published at Medium on Feb 22

Kafka Retry Done Right: The Day I Chose a Simpler Fix Over @RetryableTopic

#java #springboot #microservices #architecture

When the event is valid but the entity isn’t ready

Context: Spring Kafka, Confluent Cloud, Java enterprise backend.

The Problem: The Update Succeeded… And That Was The Bug

A Kafka consumer. An incoming event carrying data to apply to a core entity. A downstream archival process. Until I found silently corrupted entities in production with zero errors in the logs.

The update logic was correct. The entity existed. The data in the event was valid.

But the entity wasn’t in the right lifecycle state to receive this update yet. An internal validation workflow was still in progress upstream.

My consumer didn’t know. It applied the update anyway — successfully — and silently corrupted the entity’s lifecycle in production.

Expected lifecycle vs. Reality: The update succeeds, but breaks the logic order.

A silent lifecycle violation is worse than a crash, nothing alerts, nothing fails visibly: a technically successful operation at the wrong moment is worse than a failure. A failure you can detect. A silent lifecycle corruption you cannot.

The gap between “Existence” and “Readiness” is where bugs live.

Two Retry Approaches That Look Fine (Until They Hurt)

Trap #1 — “I’ll just poll until it’s ready”

If I keep the record “in-flight” and refuse to commit the offset until the entity is ready, I don’t only delay this one message. Kafka is FIFO per partition : everything behind that offset on that partition is stuck.

With concurrency=3, it's not "the broker is blocked" — it's one partition's entire throughput that stalls, silently, under load.

Holding the offset freezes the queue. Partition 0 is stuck, while others flow.

Trap #2 — “I’ll Thread.sleep() and try again"

Sleeping in the listener thread is a classic way to accidentally trigger a rebalance. If the consumer stops polling for longer than max.poll.interval.ms, the broker assumes the consumer is dead, rebalances the group, and the message gets replayed — potentially forever.

The infinite rebalance loop: Broker thinks consumer is dead.

The “Ideal” Pattern I Didn’t Use (On Purpose)

In theory, non-blocking retry is the cleanest approach: acknowledge the message immediately, park it in a dedicated retry topic, and process it later without blocking the main partition. That’s exactly what Spring Kafka’s @RetryableTopic gives you.

@RetryableTopic(
    attempts = "4",
    backoff = @Backoff(delay = 30_000, multiplier = 4),
    include = EntityNotReadyException.class
)
@KafkaListener(topics = "entity-created")
public void consume(String message) { ... }

Under the hood, RetryableTopic creates dedicated retry topics automatically (e.g. entity-created-retry-0, entity-created-dlt).

The key benefit: the offset is committed immediately, so the main partition keeps flowing while the message is retried asynchronously.

It’s a clean option when you’re allowed to create the extra topics.

One important nuance: RetryableTopic doesn’t make retries disappear — it moves them to intermediate retry topics that Spring Kafka creates and consumes automatically. The delay is enforced by a separate consumer on those retry topics. If the retry topic consumer also fails, you end up managing a second-level DLT. Elegant, yes — but not zero-complexity.

Why I didn’t use it: in a restricted enterprise environment (managed Confluent Cloud cluster, governance rules), creating extra retry topics can be forbidden or slow to approve. Topic creation goes through a ticket queue — or a flat “no”. I couldn’t assume I had that lever.

So I went with the best solution available within my constraints.

The Business Question That Made The Architecture Obvious

Before designing anything, I asked one question:

“How long does it usually take for the entity to reach its stable state after the event fires?”

Answer: “A few minutes. Not guaranteed, but usually fast.”

That changed everything. I didn’t need a sophisticated non-blocking retry topology. I needed a pragmatic delay to cover the common case, plus a safety net for the rest. I sized the solution for the failure rate we actually observed — not for a theoretical worst case.

The Fix I Shipped: Reuse What Was Already There

The project already had a retry mechanism wired through Spring Kafka’s DefaultErrorHandler with exponential backoff — already handling transient failures like network timeouts. I just needed to plug in.

// This is essentially the only new business logic I added
if (!isEntityInStableState(fetchedEntity)) {
    throw new EntityNotReadyException(
        "Entity lifecycle not ready: " + event.getEntityId()
    );
}
// → DefaultErrorHandler takes over automatically

The (simplified for confidentiality) backoff configuration:

ExponentialBackOff backOff = new ExponentialBackOff(120_000L, 2.0);
backOff.setMaxInterval(600_000L); // cap at 10 min per attempt
backOff.setMaxElapsedTime(3_600_000L); // stop retrying after ~1h
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(kafkaTemplate);
DefaultErrorHandler errorHandler = new DefaultErrorHandler(recoverer, backOff);

Retry schedule: 2 min → 4 min → 8 min → 10 min → 10 min → … → DLQ

⚠️ Critical rule: max.poll.interval.ms must be strictly greater than maxInterval (your per-attempt cap), not maxElapsedTime. The thread only blocks for one interval at a time, not for the entire retry window.

Concrete example: with maxInterval = 600,000 ms (10 min), set max.poll.interval.ms = 660,000 (11 min). Setting it to match maxElapsedTime (1h+) would dangerously delay dead consumer detection.

In production, a CloudWatch alarm on the DLQ ensures the on-call team is notified if an entity never reaches its stable state after 1 hour. Don’t ship a retry mechanism without a safety net you can actually see.

Only 1/3 of throughput is paused during backoff. The rest flows.

Decision Framework (Cheat Sheet)

The cheat sheet: Start simple, reuse existing infrastructure, and only accept partition-blocking if your traffic allows it.

In short: if you can create topics → RetryableTopic. If you’re on a governed broker with rare failures → DefaultErrorHandler + backoff. If failures are frequent at high volume → fix the upstream contract.

What I Learned: Complexity Is a Choice

The temptation in event-driven systems is to reach for the most powerful tool available. Complexity is a choice — and in this case, I chose not to make it.

This wasn’t a Kafka feature problem. It was a definition problem : the event signaled data availability, while my process assumed entity readiness. That gap is invisible until it isn’t — and when it surfaces, it leaves no trace in your logs.

Technically, I could have fought for @RetryableTopic. I could have built a custom non-blocking retry topology. I could have asked the upstream team to delay their event trigger. Instead, I aligned with business reality ("usually a few minutes") and chose the simplest architecture that respected Kafka's partition semantics and my enterprise constraints.

The best architecture decision I made that week was asking a business question before opening my IDE.

DEV Community