Yogesh Kale

Posted on Apr 14 • Originally published at Medium

Building a Zero-Loss Kafka Consumer with Spring Kafka Retryable Topics

#kafka #springboot #softwaredevelopment #java

A production-grade approach to integrating Kafka with legacy systems using Spring Kafka Retryable Topics.

The Why

When we first discussed connecting a Kafka stream to our legacy downstream system, the obvious question was: why not just consume directly in the downstream app?

The honest answer — our legacy system simply isn't built for it. Kafka's polling model doesn't fit its threading architecture. Connections are tightly controlled, and sneaking a Kafka client into it would increase system risk more than it would solve. But the deeper issue is a fundamental mismatch: our downstream system is synchronous and deterministic. Kafka is asynchronous. That tension can't just be wished away.

So we built something in between — a stateless Kafka adaptor that absorbs event streams, converts them to synchronous HTTP calls, and guarantees zero-loss delivery. This post is everything we learned building it, including the parts that bit us in production.

Tech Stack

Java 21
Spring Boot 3.x
Spring Web (Spring MVC)
Spring Kafka (with @RetryableTopic)
OkHttp 5.x

1. The Core Challenge: The Stateless Synchronous Hand-off

Most Kafka consumer implementations follow a "consume-and-save" pattern — pull the message, persist it locally, acknowledge it. Simple and safe.

Our adaptor doesn't have that luxury. It has no local database. There's nowhere to buffer. So the only way to guarantee delivery is to make the Kafka acknowledgement contingent on the downstream system actually accepting the message.

We call this a Synchronous Hand-off: the consumer only ACKs the Kafka message once it receives a positive functional response from the downstream HTTP endpoint. The message is only "gone" from Kafka's perspective once it has safely landed downstream. No silent drops, no optimistic acks.

This one decision shapes the entire resilience strategy that follows.

2. A Multi-Layered Resilience Strategy

Running in production quickly teaches you to distinguish between a blip (a few seconds of network jitter) and a blackout (the downstream service is down for 30 minutes). A flat retry policy treats both the same way — and that's a problem.

So we built a tiered recovery model.

Layer 1 — Synchronous HTTP Retries

The first line of defence is a quick, synchronous retry loop inside our HttpClientService. For a transient 500 or a connection timeout, we don't want to immediately escalate to an async retry topic — that's expensive and slow. A few fast retries handle the vast majority of brief network hiccups before anything more serious kicks in.

Layer 2 — Asynchronous Non-Blocking Retries (`@RetryableTopic`)

If the synchronous retries are exhausted, we hand the problem to Spring Kafka's @RetryableTopic. This is really the heart of the whole design.

The key insight here is Head-of-Line blocking avoidance. Without retry topics, a single failed message can pin a partition — nothing behind it gets processed until it either succeeds or is manually skipped. With @RetryableTopic, the failed message is published to a separate retry topic and the main consumer moves on immediately. System throughput stays stable even while the downstream is recovering.

There's a tradeoff worth being upfront about: moving a message to a retry topic breaks ordering guarantees for that partition. Subsequent messages from the same partition will be processed before the retried one. For a lot of systems that's a deal breaker — but for ours it isn't, because each event is an independent financial instruction (an account update, a customer update, etc.) identified by its own transaction ID. The downstream system processes them idempotently by transaction ID, not by arrival sequence. If your domain requires strict ordering — say, a series of balance mutations on the same account that must apply in order — @RetryableTopic in this form is not the right tool without additional sequencing logic on top.

Here's what our actual consumer looks like:

@RetryableTopic(
  include = {DownStreamDownException.class},
  attempts = "${kafka.retry.topic.maxAttempts:1}",
  backoff = @Backoff(delayExpression = "${kafka.retry.backoff.delay:3600000}"),
  autoCreateTopics = "${kafka.autoCreateTopics:false}",
  dltTopicSuffix = DLQ_SUFFIX,
  retryTopicSuffix = RETRY_SUFFIX
)
@KafkaListener(
  topics = {"${spring.kafka.topic.updates.account}"},
  groupId = "${spring.kafka.consumer.group-id}",
  id = ACCOUNT_UPDATES_CONSUMER_ID
)
public void consumeMessage(
  ConsumerRecord<String, String> consumerRecord,
  @Header(value = KafkaHeaders.RECEIVED_PARTITION, required = false) Integer partition,
  @Header(value = KafkaHeaders.OFFSET, required = false) Long offset,
  @Header(value = KafkaHeaders.RECEIVED_TOPIC) String receivedTopic) {
    consume(consumerRecord, partition, offset, receivedTopic);
}

A few things worth calling out here:

include = {DownStreamDownException.class} — only retry on this specific exception. Validation errors and bad payloads should never retry; they should fail fast to DLT.
autoCreateTopics = false — this matters a lot in production (more on this below).
The backoff delay is a flat 1 hour by default (3600000ms, configurable from property file). This was a deliberate choice, not a default we left in place. In a financial context, the downstream outage pattern we've seen is almost never "back in 30 seconds" — it's either a brief HTTP blip (caught by Layer 1 sync retries) or a proper incident taking 20–60 minutes to resolve. A short exponential backoff would just fill the retry topic with noise during that window. One hour means the retry fires once, cleanly, after the service is likely recovered.

And when the HTTP call fails inside processMessage:

@Override
protected void processMessage(String messageWithMetadata, String endpoint,
  Integer originalPartition, Long originalOffset, String originalTopic) {

  boolean success = httpClientService
    .sendWithRetry(messageWithMetadata, endpoint, originalPartition, originalOffset, originalTopic)
    .orElse(false);

  if (!success) {
    log.debug("Request failed for Payload: {}", messageWithMetadata);
    throw new DownStreamDownException("DownStream service unavailable for topic: " + originalTopic);
  }

  log.debug("Request Payload: {}", messageWithMetadata);
  trackDeliverySuccess(originalTopic);
}

Throwing DownStreamDownException is the trigger. Spring Kafka catches it, sees it matches the include list, and routes the message to the retry topic. Clean and intentional.

Layer 3 — Dead Letter Topic (DLT)

If all retry attempts are exhausted, the message lands in the DLT (Spring Kafka calls it a Dead Letter Topic, not Queue — though the concept is the same).

Two scenarios send messages straight here without retrying:

All retry attempts exhausted.
Exceptions not in the include list — e.g., a malformed payload that would fail on every attempt anyway.

@DltHandler
public void handleDlt(
  ConsumerRecord<String, String> consumerRecord,
  @Header(value = KafkaHeaders.RECEIVED_PARTITION, required = false) Integer partition,
  @Header(value = KafkaHeaders.OFFSET, required = false) Long offset,
  Exception ex) {
    trackDlq();
    handleDlq(consumerRecord, partition, offset);
}

DLT messages are manually monitored. In practice, a message in the DLT is either a bad payload (fix the producer) or evidence of a prolonged downstream outage (manual replay needed after recovery).

3. Topic Naming & Consumer Group Gotcha (We Learned This the Hard Way)

Spring Kafka expects specific naming conventions for retry and DLT topics:

Retry topic → <main-topic>-retry
DLT topic → <main-topic>-dlt

It also automatically creates consumer groups for these by appending suffixes to your main group ID.

In UAT, Kafka had auto.create.topics.enable=true and group auto-creation on. Everything just worked. We moved to production — where auto-create is disabled for good reason — and nothing worked. The retry flow silently broke because the consumer groups didn't exist and we didn't have permission to create them.

The fix: explicitly create all three consumer groups before deploying and make sure your service account has the right ACLs on each. This should be part of your deployment checklist, not a production incident.

4. The Silent 20x Header Bloat

This one surprised us. By default, Spring Kafka attaches the full exception stack trace to the headers of every message it routes to a retry topic. In a high-throughput system, a 1KB message can balloon to 20KB just from headers.

Multiply that across millions of retries and you have a serious storage and network overhead problem that never showed up in unit tests.

We solved it with a KafkaProducerInterceptor:

@Override
public ProducerRecord<String, String> onSend(ProducerRecord<String, String> producerRecord) {
  Headers headers = producerRecord.headers();
  String topic = producerRecord.topic();

  if (topic.endsWith("-retry") || topic.endsWith("-dlt")) {
    // Strip the full stack trace - keep essential error metadata, drop the wall of text
    for (String headerKey : HEADERS_TO_REMOVE_RETRY) {
      headers.remove(headerKey);
    }
  }

  return new ProducerRecord<>(
    producerRecord.topic(),
    producerRecord.partition(),
    producerRecord.key(),
    producerRecord.value(),
    headers
  );
}

HEADERS_TO_REMOVE_RETRY includes kafka_exception-stacktrace and a few others that add size without adding operational value. The result: roughly 90% reduction in retry topic storage overhead, while still keeping the exception message and class for debugging.

props.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, HeaderFilteringProducerInterceptor.class.getName());

5. Preserving Original Metadata Across Retries

In financial systems, idempotency is non-negotiable. The downstream service needs to detect duplicates — and it does that using the original Kafka offset and partition as a correlation key.

The problem: when a message moves to a retry topic, Kafka assigns it a new offset and partition. If you blindly forward that metadata, your duplicate detection breaks.

Spring Kafka saves the day here by writing the original coordinates into headers (kafka_original-offset, kafka_original-topic, kafka_original-partition) before routing to the retry topic.

Our KafkaMessageUtil reads these:

public static Long resolveOffset(ConsumerRecord<?, ?> record, String topic) {
  if (topic.contains("-retry")) {
    // Use the original offset from headers, not the retry topic's offset
    return extractLongHeader(record, "kafka_original-offset");
  }
  return record.offset();
}

The downstream service always gets the coordinates of the first delivery attempt, regardless of how many retries have happened. This makes duplicate detection work cleanly across the entire message life cycle.

6. The Heartbeat & Operational Safety

There's a well-known failure mode called the Thundering Herd: your downstream comes back online after an outage, and every backed-up consumer immediately floods it with requests, crashing it again.

We prevent this with a HeartbeatScheduler that pings the downstream health endpoint every 300 seconds. If the health check fails, we call pause() on all active consumer containers. Messages accumulate safely in the broker. When the health check recovers, we call resume().

Worth being honest about what this approach does not cover. A heartbeat tells you if the service is alive — not if it's healthy. A downstream that is up but returning 500 on 40% of real requests will pass the health check and still get flooded the moment consumers resume. We accepted this limitation because our downstream health endpoint is a genuine deep check — it validates DB connectivity and core dependencies, not just a surface-level HTTP 200. Combined with the 1-hour backoff window, the retry storm risk is manageable. But it is a known gap, and it is the main reason we are moving to circuit breakers.

A ConsumerMetricsScheduler logs per-minute delivery success and DLT rates. This gives the SRE team a real-time view of system health without digging through logs.

7. Where We're Heading — Circuit Breakers

We're evaluating Resilience4j as a replacement for the Heartbeat approach. The key advantages:

Monitors actual traffic — rather than a synthetic health check.
Half-Open state — instead of resuming all consumers at once after recovery, it lets a few "probe" requests through first. Only if those succeed does it open the floodgates.
Removes the background polling thread — moves toward a fully event-driven resilience model.

Final Thoughts

Building a zero-loss Kafka adaptor is genuinely harder than it looks. The individual pieces — retries, dead letters, health checks — are all standard. The challenge is in the details: what happens to headers when you retry? What offset do you report to the downstream? What's your consumer group strategy in a locked-down production cluster?

The combination that worked for us:

A stateless consumer with synchronous HTTP hand-offs
Spring Kafka @RetryableTopic for async, non-blocking retry flows
A producer interceptor to keep retry-topic storage sane
Original metadata preservation for idempotency
Heartbeat-driven pause/resume to prevent thundering herd

It's production-proven now. Hopefully this saves someone else the three days of debugging we did.

Originally published on Medium.

Top comments (9)

Vinod Jagtap • Apr 20

Informative and practical approach explained in detailed for production grade zero loss kafka consumer, combined with idempotency.

buildbasekit • Apr 14

This is solid. What stands out is how you focused on real failure modes instead of just “happy path Kafka”.

The synchronous hand-off decision is the backbone here. Most people gloss over that tradeoff
Clear separation of retry layers makes the system predictable, not just “resilient” in theory
The header bloat issue is gold. That’s the kind of production detail most posts miss
Preserving original metadata for idempotency is a critical insight, especially in financial flows

The biggest value: this isn’t a Kafka tutorial, it’s a production playbook.

Anyone building event-driven systems with legacy constraints will relate to this.

Yogesh Kale • Apr 15

Thanks a lot—really appreciate you calling out those details!

The goal was exactly that: to focus on real-world failure scenarios and trade-offs rather than just the happy path, especially for systems with stricter reliability needs.

buildbasekit • Apr 15

That really comes through in the post.
The focus on failure modes makes it way more practical than most Kafka content out there.

Curious, did anything break in production that you didn’t anticipate while designing this?

Yogesh Kale • Apr 15

Yes — we hit an issue in production where Spring Kafka created derived consumer group IDs for retry/DLT topics, but we hadn’t provisioned ACLs for them.
It worked in UAT due to auto-creation, but in prod the retry flow silently broke.
Lesson: always include retry/DLT consumer groups + permissions in your deployment checklist.

Laura Ashaley • Apr 14

Building a “zero-loss” Kafka consumer is really about disciplined failure handling—using Spring Kafka’s retryable topics to isolate retries, control backoff, and prevent message blocking is a smart approach. The real win comes from combining this with idempotent processing, proper offset management, and dead-letter topics, so even under failure scenarios, no message is silently dropped and recovery stays predictable.

Yogesh Kale • Apr 14

Thanks, Laura—completely agree, achieving true zero-loss really comes down to combining retries with idempotency, proper offset handling, and robust DLT strategy.