<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yogesh Kale</title>
    <description>The latest articles on DEV Community by Yogesh Kale (@yogesh_kale_9e617cb1c2561).</description>
    <link>https://dev.to/yogesh_kale_9e617cb1c2561</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1909150%2F3592ad9f-00ad-441a-812c-d90d26046b11.png</url>
      <title>DEV Community: Yogesh Kale</title>
      <link>https://dev.to/yogesh_kale_9e617cb1c2561</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yogesh_kale_9e617cb1c2561"/>
    <language>en</language>
    <item>
      <title>Building a Zero-Loss Kafka Consumer with Spring Kafka Retryable Topics</title>
      <dc:creator>Yogesh Kale</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:36:57 +0000</pubDate>
      <link>https://dev.to/yogesh_kale_9e617cb1c2561/building-a-zero-loss-kafka-consumer-with-spring-kafka-retryable-topics-2f36</link>
      <guid>https://dev.to/yogesh_kale_9e617cb1c2561/building-a-zero-loss-kafka-consumer-with-spring-kafka-retryable-topics-2f36</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A production-grade approach to integrating Kafka with legacy systems using Spring Kafka Retryable Topics.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Why
&lt;/h2&gt;

&lt;p&gt;When we first discussed connecting a Kafka stream to our legacy downstream system, the obvious question was: &lt;strong&gt;&lt;em&gt;why not just consume directly in the downstream app?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The honest answer — our legacy system simply isn't built for it. Kafka's polling model doesn't fit its threading architecture. Connections are tightly controlled, and sneaking a Kafka client into it would increase system risk more than it would solve. But the deeper issue is a fundamental mismatch: our downstream system is &lt;strong&gt;synchronous and deterministic&lt;/strong&gt;. Kafka is &lt;strong&gt;asynchronous&lt;/strong&gt;. That tension can't just be wished away.&lt;/p&gt;

&lt;p&gt;So we built something in between — a &lt;strong&gt;stateless Kafka adaptor&lt;/strong&gt; that absorbs event streams, converts them to synchronous HTTP calls, and guarantees zero-loss delivery. This post is everything we learned building it, including the parts that bit us in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Java 21&lt;/li&gt;
&lt;li&gt;Spring Boot 3.x&lt;/li&gt;
&lt;li&gt;Spring Web (Spring MVC)&lt;/li&gt;
&lt;li&gt;Spring Kafka (with &lt;code&gt;@RetryableTopic&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;OkHttp 5.x&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. The Core Challenge: The Stateless Synchronous Hand-off
&lt;/h2&gt;

&lt;p&gt;Most Kafka consumer implementations follow a "consume-and-save" pattern — pull the message, persist it locally, acknowledge it. Simple and safe.&lt;/p&gt;

&lt;p&gt;Our adaptor doesn't have that luxury. It has &lt;strong&gt;no local database&lt;/strong&gt;. There's nowhere to buffer. So the only way to guarantee delivery is to make the Kafka acknowledgement contingent on the downstream system actually accepting the message.&lt;/p&gt;

&lt;p&gt;We call this a &lt;strong&gt;Synchronous Hand-off&lt;/strong&gt;: the consumer only ACKs the Kafka message once it receives a positive functional response from the downstream HTTP endpoint. The message is only "gone" from Kafka's perspective once it has safely landed downstream. No silent drops, no optimistic acks.&lt;/p&gt;

&lt;p&gt;This one decision shapes the entire resilience strategy that follows.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. A Multi-Layered Resilience Strategy
&lt;/h2&gt;

&lt;p&gt;Running in production quickly teaches you to distinguish between a &lt;strong&gt;blip&lt;/strong&gt; (a few seconds of network jitter) and a &lt;strong&gt;blackout&lt;/strong&gt; (the downstream service is down for 30 minutes). A flat retry policy treats both the same way — and that's a problem.&lt;/p&gt;

&lt;p&gt;So we built a tiered recovery model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1 — Synchronous HTTP Retries
&lt;/h3&gt;

&lt;p&gt;The first line of defence is a quick, synchronous retry loop inside our &lt;strong&gt;HttpClientService&lt;/strong&gt;. For a transient 500 or a connection timeout, we don't want to immediately escalate to an async retry topic — that's expensive and slow. A few fast retries handle the vast majority of brief network hiccups before anything more serious kicks in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2 — Asynchronous Non-Blocking Retries (&lt;code&gt;@RetryableTopic&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;If the synchronous retries are exhausted, we hand the problem to Spring Kafka's &lt;code&gt;@RetryableTopic&lt;/code&gt;. This is really the heart of the whole design.&lt;/p&gt;

&lt;p&gt;The key insight here is &lt;strong&gt;Head-of-Line blocking avoidance.&lt;/strong&gt; Without retry topics, a single failed message can pin a partition — nothing behind it gets processed until it either succeeds or is manually skipped. With &lt;code&gt;@RetryableTopic&lt;/code&gt;, the failed message is published to a separate retry topic and the main consumer moves on immediately. System throughput stays stable even while the downstream is recovering.&lt;/p&gt;

&lt;p&gt;There's a tradeoff worth being upfront about: &lt;strong&gt;moving a message to a retry topic breaks ordering guarantees for that partition&lt;/strong&gt;. Subsequent messages from the same partition will be processed before the retried one. For a lot of systems that's a deal breaker — but for ours it isn't, because each event is an independent financial instruction (an account update, a customer update, etc.) identified by its own transaction ID. The downstream system processes them idempotently by transaction ID, not by arrival sequence. If your domain requires strict ordering — say, a series of balance mutations on the same account that must apply in order — &lt;code&gt;@RetryableTopic&lt;/code&gt; in this form is not the right tool without additional sequencing logic on top.&lt;/p&gt;

&lt;p&gt;Here's what our actual consumer looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@RetryableTopic&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;include&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;DownStreamDownException&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;},&lt;/span&gt;
  &lt;span class="n"&gt;attempts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${kafka.retry.topic.maxAttempts:1}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;@Backoff&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delayExpression&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${kafka.retry.backoff.delay:3600000}"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;autoCreateTopics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${kafka.autoCreateTopics:false}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;dltTopicSuffix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;DLQ_SUFFIX&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;retryTopicSuffix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;RETRY_SUFFIX&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@KafkaListener&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"${spring.kafka.topic.updates.account}"&lt;/span&gt;&lt;span class="o"&gt;},&lt;/span&gt;
  &lt;span class="n"&gt;groupId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${spring.kafka.consumer.group-id}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;ACCOUNT_UPDATES_CONSUMER_ID&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;consumeMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;consumerRecord&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECEIVED_PARTITION&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Integer&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OFFSET&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECEIVED_TOPIC&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;receivedTopic&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;consume&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;consumerRecord&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;receivedTopic&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth calling out here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;include = {DownStreamDownException.class}&lt;/code&gt;&lt;/strong&gt; — &lt;em&gt;only&lt;/em&gt; retry on this specific exception. Validation errors and bad payloads should never retry; they should fail fast to DLT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;autoCreateTopics = false&lt;/code&gt;&lt;/strong&gt; — this matters a lot in production (more on this below).&lt;/li&gt;
&lt;li&gt;The backoff delay is a flat &lt;strong&gt;1 hour&lt;/strong&gt; by default (&lt;code&gt;3600000ms&lt;/code&gt;, configurable from property file). This was a deliberate choice, not a default we left in place. In a financial context, the downstream outage pattern we've seen is almost never "back in 30 seconds" — it's either a brief HTTP blip (caught by Layer 1 sync retries) or a proper incident taking 20–60 minutes to resolve. A short exponential backoff would just fill the retry topic with noise during that window. One hour means the retry fires once, cleanly, after the service is likely recovered.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And when the HTTP call &lt;strong&gt;fails inside &lt;code&gt;processMessage&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Override&lt;/span&gt;
&lt;span class="kd"&gt;protected&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;processMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;messageWithMetadata&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nc"&gt;Integer&lt;/span&gt; &lt;span class="n"&gt;originalPartition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;originalOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;originalTopic&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

  &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpClientService&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sendWithRetry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messageWithMetadata&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;originalPartition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;originalOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;originalTopic&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;orElse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Request failed for Payload: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messageWithMetadata&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;DownStreamDownException&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"DownStream service unavailable for topic: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;originalTopic&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Request Payload: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messageWithMetadata&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;trackDeliverySuccess&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;originalTopic&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Throwing &lt;code&gt;DownStreamDownException&lt;/code&gt; is the trigger. Spring Kafka catches it, sees it matches the &lt;code&gt;include&lt;/code&gt; list, and routes the message to the retry topic. Clean and intentional.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3 — Dead Letter Topic (DLT)
&lt;/h3&gt;

&lt;p&gt;If all retry attempts are exhausted, the message lands in the DLT (Spring Kafka calls it a Dead Letter &lt;strong&gt;Topic&lt;/strong&gt;, not Queue — though the concept is the same).&lt;/p&gt;

&lt;p&gt;Two scenarios send messages straight here without retrying:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;All retry attempts exhausted.&lt;/li&gt;
&lt;li&gt;Exceptions &lt;strong&gt;not&lt;/strong&gt; in the &lt;code&gt;include&lt;/code&gt; list — e.g., a malformed payload that would fail on every attempt anyway.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@DltHandler&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;handleDlt&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;consumerRecord&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECEIVED_PARTITION&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Integer&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OFFSET&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;trackDlq&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;handleDlq&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;consumerRecord&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DLT messages are manually monitored. In practice, a message in the DLT is either a bad payload (fix the producer) or evidence of a prolonged downstream outage (manual replay needed after recovery).&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Topic Naming &amp;amp; Consumer Group Gotcha (We Learned This the Hard Way)
&lt;/h2&gt;

&lt;p&gt;Spring Kafka expects specific naming conventions for retry and DLT topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retry topic → &lt;code&gt;&amp;lt;main-topic&amp;gt;-retry&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;DLT topic → &lt;code&gt;&amp;lt;main-topic&amp;gt;-dlt&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also &lt;strong&gt;automatically creates consumer groups&lt;/strong&gt; for these by appending suffixes to your main group ID.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3oagarybhvm3mbhjk79w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3oagarybhvm3mbhjk79w.png" alt=" " width="432" height="175"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In UAT, Kafka had &lt;code&gt;auto.create.topics.enable=true&lt;/code&gt; and group auto-creation on. Everything just worked. We moved to production — where auto-create is &lt;strong&gt;disabled&lt;/strong&gt; for good reason — and nothing worked. The retry flow silently broke because the consumer groups didn't exist and we didn't have permission to create them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; explicitly create all three consumer groups before deploying and make sure your service account has the right ACLs on each. This should be part of your deployment checklist, not a production incident.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Silent 20x Header Bloat
&lt;/h2&gt;

&lt;p&gt;This one surprised us. By default, Spring Kafka attaches the &lt;strong&gt;full exception stack trace&lt;/strong&gt; to the headers of every message it routes to a retry topic. In a high-throughput system, a 1KB message can balloon to 20KB just from headers.&lt;/p&gt;

&lt;p&gt;Multiply that across millions of retries and you have a serious storage and network overhead problem that never showed up in unit tests.&lt;/p&gt;

&lt;p&gt;We solved it with a &lt;strong&gt;KafkaProducerInterceptor&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Override&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;onSend&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;Headers&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endsWith&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"-retry"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endsWith&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"-dlt"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Strip the full stack trace - keep essential error metadata, drop the wall of text&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;headerKey&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;HEADERS_TO_REMOVE_RETRY&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;remove&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headerKey&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;
    &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;
  &lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;HEADERS_TO_REMOVE_RETRY&lt;/code&gt; includes &lt;code&gt;kafka_exception-stacktrace&lt;/code&gt; and a few others that add size without adding operational value. The result: roughly &lt;strong&gt;90% reduction&lt;/strong&gt; in retry topic storage overhead, while still keeping the exception message and class for debugging.&lt;/p&gt;

&lt;p&gt;Register it in your producer config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ProducerConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;INTERCEPTOR_CLASSES_CONFIG&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;HeaderFilteringProducerInterceptor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getName&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Preserving Original Metadata Across Retries
&lt;/h2&gt;

&lt;p&gt;In financial systems, idempotency is non-negotiable. The downstream service needs to detect duplicates — and it does that using the original Kafka offset and partition as a correlation key.&lt;/p&gt;

&lt;p&gt;The problem: when a message moves to a retry topic, Kafka assigns it a &lt;strong&gt;new offset and partition&lt;/strong&gt;. If you blindly forward that metadata, your duplicate detection breaks.&lt;/p&gt;

&lt;p&gt;Spring Kafka saves the day here by writing the original coordinates into headers (&lt;code&gt;kafka_original-offset&lt;/code&gt;, &lt;code&gt;kafka_original-topic&lt;/code&gt;, &lt;code&gt;kafka_original-partition&lt;/code&gt;) before routing to the retry topic.&lt;/p&gt;

&lt;p&gt;Our &lt;strong&gt;KafkaMessageUtil&lt;/strong&gt; reads these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="nf"&gt;resolveOffset&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?,&lt;/span&gt; &lt;span class="o"&gt;?&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"-retry"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Use the original offset from headers, not the retry topic's offset&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;extractLongHeader&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"kafka_original-offset"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The downstream service always gets the coordinates of the &lt;strong&gt;first delivery attempt&lt;/strong&gt;, regardless of how many retries have happened. This makes duplicate detection work cleanly across the entire message life cycle.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. The Heartbeat &amp;amp; Operational Safety
&lt;/h2&gt;

&lt;p&gt;There's a well-known failure mode called the &lt;strong&gt;Thundering Herd&lt;/strong&gt;: your downstream comes back online after an outage, and every backed-up consumer immediately floods it with requests, crashing it again.&lt;/p&gt;

&lt;p&gt;We prevent this with a &lt;strong&gt;HeartbeatScheduler&lt;/strong&gt; that pings the downstream health endpoint every 300 seconds. If the health check fails, we call &lt;code&gt;pause()&lt;/code&gt; on all active consumer containers. Messages accumulate safely in the broker. When the health check recovers, we call &lt;code&gt;resume()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Worth being honest about what this approach does &lt;strong&gt;not&lt;/strong&gt; cover. A heartbeat tells you if the service is alive — not if it's healthy. A downstream that is up but returning &lt;code&gt;500&lt;/code&gt; on 40% of real requests will pass the health check and still get flooded the moment consumers resume. We accepted this limitation because our downstream health endpoint is a genuine deep check — it validates DB connectivity and core dependencies, not just a surface-level HTTP 200. Combined with the 1-hour backoff window, the retry storm risk is manageable. But it is a known gap, and it is the main reason we are moving to circuit breakers.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;ConsumerMetricsScheduler&lt;/strong&gt; logs per-minute delivery success and DLT rates. This gives the SRE team a real-time view of system health without digging through logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Where We're Heading — Circuit Breakers
&lt;/h2&gt;

&lt;p&gt;We're evaluating &lt;strong&gt;Resilience4j&lt;/strong&gt; as a replacement for the Heartbeat approach. The key advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Monitors actual traffic&lt;/strong&gt; — rather than a synthetic health check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Half-Open state&lt;/strong&gt; — instead of resuming all consumers at once after recovery, it lets a few "probe" requests through first. Only if those succeed does it open the floodgates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Removes the background polling thread&lt;/strong&gt; — moves toward a fully event-driven resilience model.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Building a zero-loss Kafka adaptor is genuinely harder than it looks. The individual pieces — retries, dead letters, health checks — are all standard. The challenge is in the details: what happens to headers when you retry? What offset do you report to the downstream? What's your consumer group strategy in a locked-down production cluster?&lt;/p&gt;

&lt;p&gt;The combination that worked for us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A stateless consumer with synchronous HTTP hand-offs&lt;/li&gt;
&lt;li&gt;Spring Kafka &lt;code&gt;@RetryableTopic&lt;/code&gt; for async, non-blocking retry flows&lt;/li&gt;
&lt;li&gt;A producer interceptor to keep retry-topic storage sane&lt;/li&gt;
&lt;li&gt;Original metadata preservation for idempotency&lt;/li&gt;
&lt;li&gt;Heartbeat-driven pause/resume to prevent thundering herd&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's production-proven now. Hopefully this saves someone else the three days of debugging we did.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://medium.com/@yogeshdev024/building-a-zero-loss-kafka-adaptor-with-spring-kafka-retryable-topics-e63fcad10c41" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>springboot</category>
      <category>softwaredevelopment</category>
      <category>java</category>
    </item>
  </channel>
</rss>
