<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yogesh Kale</title>
    <description>The latest articles on DEV Community by Yogesh Kale (@yogesh_kale_9e617cb1c2561).</description>
    <link>https://dev.to/yogesh_kale_9e617cb1c2561</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1909150%2F3592ad9f-00ad-441a-812c-d90d26046b11.png</url>
      <title>DEV Community: Yogesh Kale</title>
      <link>https://dev.to/yogesh_kale_9e617cb1c2561</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yogesh_kale_9e617cb1c2561"/>
    <language>en</language>
    <item>
      <title>Guaranteed Message Ordering in Apache Kafka: What It Really Takes in Production (Spring Boot)</title>
      <dc:creator>Yogesh Kale</dc:creator>
      <pubDate>Wed, 22 Apr 2026 15:38:08 +0000</pubDate>
      <link>https://dev.to/yogesh_kale_9e617cb1c2561/guaranteed-message-ordering-in-apache-kafka-what-it-really-takes-in-production-spring-boot-2h3o</link>
      <guid>https://dev.to/yogesh_kale_9e617cb1c2561/guaranteed-message-ordering-in-apache-kafka-what-it-really-takes-in-production-spring-boot-2h3o</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A deep dive into the guarantees, tradeoffs, and production gotchas of implementing strictly ordered Kafka consumers — and the honest conversation about when not to.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Why
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/yogesh_kale_9e617cb1c2561/building-a-zero-loss-kafka-consumer-with-spring-kafka-retryable-topics-2f36"&gt;previous post&lt;/a&gt;, I built a zero-loss Kafka consumer using &lt;code&gt;@RetryableTopic&lt;/code&gt;. But I deliberately called out one tradeoff that deserves its own conversation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Moving a message to a retry topic breaks ordering guarantees for that partition."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sentence sent me down a rabbit hole. If retryable topics — one of Spring Kafka's most powerful resilience tools — inherently break ordering, what does it actually take to guarantee strict message order in a production Kafka system?&lt;/p&gt;

&lt;p&gt;This post is everything I found: the real guarantees Kafka provides, the configuration decisions that break ordering silently, and the honest tradeoffs you need to make before choosing this path.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Kafka Actually Guarantees (And What It Doesn't)
&lt;/h2&gt;

&lt;p&gt;Let's start with what Kafka promises, because most ordering bugs come from misreading this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kafka guarantees ordering within a partition. It makes no guarantees across partitions.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's it. Full stop. Not per topic. Not per consumer group. Per partition.&lt;/p&gt;

&lt;p&gt;Before going deeper, here is where each ordering guarantee actually lives in the pipeline — because each layer can independently break the guarantee above it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PRODUCER                    BROKER                      CONSUMER
─────────────────────────   ─────────────────────────   ─────────────────────────
Stable partition key        Single partition            concurrency=1 (or 1 per partition)
  → same key always           → messages stored         enable.auto.commit=false
    routes to same partition    in append order          AckMode.RECORD
enable.idempotence=true     Replication (acks=all)      No @Async hand-offs
max.in.flight=1               → no gaps from            CooperativeStickyAssignor
  → no broker reordering        leader failover          → rebalance-safe handoff
  from retried batches
                            ← Ordering guarantee lives across ALL three layers →
                            Breaking any one layer silently breaks end-to-end order
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is also the map for debugging ordering incidents in production: work left to right. A gap in your offset sequence? Check the producer's &lt;code&gt;max.in.flight&lt;/code&gt; setting. Out-of-order messages after a deployment? Check the rebalance assignor. Duplicates? Check idempotence and ack mode.&lt;/p&gt;

&lt;p&gt;This has a direct consequence: if messages that must be processed in sequence land on different partitions, you have no ordering guarantee regardless of how well you configure your consumer. The ordering problem starts at the producer, not the consumer.&lt;/p&gt;

&lt;p&gt;Consider an e-commerce order lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Event 1: ORDER_PLACED     (orderId: 1001)
Event 2: PAYMENT_CAPTURED (orderId: 1001)
Event 3: INVENTORY_RESERVED (orderId: 1001)
Event 4: ORDER_SHIPPED    (orderId: 1001)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All four events must be processed in sequence. If &lt;code&gt;EVENT_3&lt;/code&gt; is processed before &lt;code&gt;EVENT_2&lt;/code&gt;, inventory is reserved against an unpaid order. If &lt;code&gt;EVENT_4&lt;/code&gt; arrives before &lt;code&gt;EVENT_3&lt;/code&gt;, the shipment goes out against unreserved stock.&lt;/p&gt;

&lt;p&gt;This is exactly the class of problem where Kafka ordering matters — and where getting it wrong causes real business damage.&lt;/p&gt;

&lt;p&gt;Or consider a financial context (as a reference, not a recommendation to use Kafka for core ledger operations): a multi-leg settlement where debit must precede credit. Kafka can work here, but only with explicit ordering discipline. We will come back to where Kafka ordering fits and where it does not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Foundation: Partition Key Strategy
&lt;/h2&gt;

&lt;p&gt;Since ordering is partition-scoped, the partition key is the most important architectural decision you will make for ordered consumers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; all messages that must be processed in sequence must share the same partition key. Kafka routes messages with the same key to the same partition using a consistent hash.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Producer: always key by the entity whose events must be ordered&lt;/span&gt;
&lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"order-events"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrderId&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;   &lt;span class="c1"&gt;// ← This is the partition key&lt;/span&gt;
    &lt;span class="n"&gt;eventPayload&lt;/span&gt;
&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What goes wrong here in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Null keys&lt;/strong&gt; — Kafka round-robins null-key messages across all partitions. Order is immediately lost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Random UUIDs as keys&lt;/strong&gt; — same problem. Each message lands on a different partition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using producer instance ID or timestamp as the key&lt;/strong&gt; — technically unique per message, which defeats the purpose entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Changing partition count after go-live&lt;/strong&gt; — Kafka's hash function maps keys to partitions based on partition count. Increasing partitions remaps keys to different partitions, breaking the ordering guarantee for in-flight and future messages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Production rule:&lt;/strong&gt; treat your partition count as immutable once consumers are live. Over-provision partitions at topic creation time rather than resizing later.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consumer Concurrency: The Silent Ordering Killer
&lt;/h2&gt;

&lt;p&gt;This is where the majority of ordering bugs I have seen are introduced — not at the architecture level, but in a configuration property that is easy to overlook.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Bean&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ConcurrentKafkaListenerContainerFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;kafkaListenerContainerFactory&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;ConcurrentKafkaListenerContainerFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;factory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ConcurrentKafkaListenerContainerFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
    &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setConsumerFactory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;consumerFactory&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setConcurrency&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// ← This is the problem&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;concurrency &amp;gt; 1&lt;/code&gt; means Spring Kafka spins up multiple consumer threads, and under normal operation each thread is assigned its own partition — a single partition is never served by two threads simultaneously. The real ordering risk is subtler: during a rebalance, a partition can be moved from one thread to another. The new thread can start consuming from that partition before the previous thread has finished processing its current batch. That window — old thread still running, new thread already polling — is where out-of-order processing occurs. It is easy to overlook precisely because it only surfaces under rebalance conditions, not in steady-state load testing.&lt;/p&gt;

&lt;p&gt;For strict ordering, the rule is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setConcurrency&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// One thread per listener container&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or, if you want to scale with partition count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Set concurrency = number of partitions&lt;/span&gt;
&lt;span class="c1"&gt;// Each thread owns exactly one partition&lt;/span&gt;
&lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setConcurrency&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;partitionCount&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second option is correct but requires discipline: if you set &lt;code&gt;concurrency=6&lt;/code&gt; against a topic with 6 partitions, each thread owns one partition and ordering within each partition is preserved. The moment concurrency exceeds partition count, some partitions get no dedicated thread and ordering breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Equally important:&lt;/strong&gt; never use &lt;code&gt;@Async&lt;/code&gt; inside a &lt;code&gt;@KafkaListener&lt;/code&gt; method. The moment you hand processing off to another thread pool, the calling thread moves to the next message while the previous one is still being processed. Ordering is gone.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// This silently breaks ordering&lt;/span&gt;
&lt;span class="nd"&gt;@KafkaListener&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"order-events"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;consume&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;asyncService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;processAsync&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt; &lt;span class="c1"&gt;// returns immediately&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Processing must complete before the next message is touched&lt;/span&gt;
&lt;span class="nd"&gt;@KafkaListener&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"order-events"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;consume&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;orderProcessingService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;process&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt; &lt;span class="c1"&gt;// blocking, synchronous&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Offset Commit Strategy: Where Zero-Loss and Ordering Intersect
&lt;/h2&gt;

&lt;p&gt;If you read my previous post on zero-loss consumers, you already know that &lt;code&gt;enable.auto.commit=true&lt;/code&gt; is dangerous for reliability. For ordered consumers, it is doubly dangerous.&lt;/p&gt;

&lt;p&gt;Auto-commit uses a time-based interval. It has no awareness of whether your processing logic has completed. In a high-throughput system, you can auto-commit offsets for messages your application is still processing — and if the application crashes, those messages are never reprocessed, silently breaking both your delivery guarantee and your ordering assumptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The correct configuration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// application.yml&lt;/span&gt;
&lt;span class="nl"&gt;spring:&lt;/span&gt;
  &lt;span class="nl"&gt;kafka:&lt;/span&gt;
    &lt;span class="nl"&gt;consumer:&lt;/span&gt;
      &lt;span class="n"&gt;enable&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nl"&gt;commit:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="n"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nl"&gt;reset:&lt;/span&gt; &lt;span class="n"&gt;earliest&lt;/span&gt;

&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nc"&gt;In&lt;/span&gt; &lt;span class="n"&gt;your&lt;/span&gt; &lt;span class="nc"&gt;ContainerFactory&lt;/span&gt;
&lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getContainerProperties&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setAckMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ContainerProperties&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;AckMode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECORD&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;AckMode.RECORD&lt;/code&gt; commits the offset after each record is fully processed. For strict ordering this is the safest mode — you never advance past a message until it has been handled.&lt;/p&gt;

&lt;p&gt;If you need manual control (for example, batching acks across a transaction boundary):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getContainerProperties&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setAckMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ContainerProperties&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;AckMode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MANUAL_IMMEDIATE&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;MANUAL_IMMEDIATE&lt;/code&gt;, you must call &lt;code&gt;acknowledgment.acknowledge()&lt;/code&gt; explicitly. The critical rule: &lt;strong&gt;acknowledge in order&lt;/strong&gt;. Out-of-order acknowledgment creates offset gaps. When the consumer restarts, it will reprocess from the earliest unacknowledged offset, which may cause you to re-process messages you already handled.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error Handling: The Hardest Tradeoff in Ordered Systems
&lt;/h2&gt;

&lt;p&gt;This is where the real tension lives, and I want to be direct about it because most blog posts gloss over it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fundamental conflict:&lt;/strong&gt; in an ordered consumer, a failed message blocks the partition. Every message behind it waits. But the standard tool for Kafka resilience — &lt;code&gt;@RetryableTopic&lt;/code&gt; — resolves this by moving the failed message to a separate retry topic, which immediately breaks ordering for that partition.&lt;/p&gt;

&lt;p&gt;You have to choose. There is no configuration that gives you both non-blocking retries and guaranteed ordering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: Blocking Retry (Preserves Order, Risks Throughput)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Bean&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;DefaultErrorHandler&lt;/span&gt; &lt;span class="nf"&gt;errorHandler&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Retry up to 3 times with exponential backoff&lt;/span&gt;
    &lt;span class="nc"&gt;ExponentialBackOffWithMaxRetries&lt;/span&gt; &lt;span class="n"&gt;backOff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ExponentialBackOffWithMaxRetries&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;backOff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setInitialInterval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000L&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;backOff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setMultiplier&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;backOff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setMaxInterval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10_000L&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="nc"&gt;DefaultErrorHandler&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DefaultErrorHandler&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;DeadLetterPublishingRecoverer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kafkaTemplate&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="n"&gt;backOff&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Never retry these — they will always fail&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addNotRetryableExceptions&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;JsonParseException&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;ValidationException&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;DefaultErrorHandler&lt;/code&gt; and no retry topic, retries happen in-place. The partition is blocked while retries are attempted. This is the correct choice when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your messages have business-level dependencies on each other (true ordering requirement)&lt;/li&gt;
&lt;li&gt;A transient failure retrying for a few seconds is acceptable&lt;/li&gt;
&lt;li&gt;Your downstream can recover quickly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When retries are exhausted, the message goes to the DLT via &lt;code&gt;DeadLetterPublishingRecoverer&lt;/code&gt;. At that point, you have a decision: stop the consumer (strict ordering, no skipping) or skip to the next message (lose strict ordering, maintain throughput).&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: DLT With Consumer Pause (Best of Both Under Prolonged Failure)
&lt;/h3&gt;

&lt;p&gt;For scenarios where the downstream is fully down (not just a transient blip), blocking retries will pin your partition for a long time. A better pattern is to pause the consumer and let messages accumulate safely in the broker — they stay in their original partition order, untouched, ready to be processed in sequence once the dependency recovers.&lt;/p&gt;

&lt;p&gt;The trigger for a pause is your error handler detecting consecutive exhausted retries within a time window — not a single failure. A single transient failure should exhaust the backoff and either recover or go to DLT. It is only when failures are sustained (indicating downstream unavailability rather than a bad message) that pausing the entire consumer is warranted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Component&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrderedConsumerController&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;KafkaListenerEndpointRegistry&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Called by your error handler after N consecutive DLT publishes within a window,&lt;/span&gt;
    &lt;span class="c1"&gt;// indicating the downstream service is unavailable — not just a bad message&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;pauseConsumer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;listenerId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;MessageListenerContainer&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getListenerContainer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;listenerId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isRunning&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;warn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Consumer [{}] paused due to sustained downstream failure"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;listenerId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;resumeConsumer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;listenerId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;MessageListenerContainer&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getListenerContainer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;listenerId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;resume&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Consumer [{}] resumed after downstream recovery"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;listenerId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Recovery monitoring is what completes this pattern. In my &lt;a href="https://dev.to/yogesh_kale_9e617cb1c2561/building-a-zero-loss-kafka-consumer-with-spring-kafka-retryable-topics-2f36"&gt;previous post on zero-loss consumers&lt;/a&gt;, I used a &lt;code&gt;HeartbeatScheduler&lt;/code&gt; that periodically probed the downstream on a fixed interval and published a recovery event once the health check passed. The same approach applies here: a scheduled task that calls your downstream's health endpoint every 10–30 seconds, and calls &lt;code&gt;resumeConsumer()&lt;/code&gt; on success. The key difference in an ordered consumer is that you must not resume until you are confident the dependency is stable — a premature resume that fails immediately will cause another pause cycle and risk committing a partial offset.&lt;/p&gt;




&lt;h2&gt;
  
  
  Producer-Side Guarantees: The Part Most Ordering Guides Skip
&lt;/h2&gt;

&lt;p&gt;You can configure your consumer perfectly and still lose ordering at the broker level if your producer is misconfigured. Three settings matter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kafka&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;acks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;                    &lt;span class="c1"&gt;# Wait for all in-sync replicas to acknowledge&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;enable.idempotence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;max.in.flight.requests.per.connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;acks=all&lt;/code&gt;&lt;/strong&gt; — without this, a message can be acknowledged by the leader but not yet replicated. If the leader fails before replication, the message is lost. Your consumer never sees it, and the sequence has a permanent gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;enable.idempotence=true&lt;/code&gt;&lt;/strong&gt; — producer retries can cause duplicate messages at the broker if the original request succeeded but the acknowledgment was lost in transit. With idempotence enabled, Kafka deduplicates using a producer ID and sequence number. This is a prerequisite for correct ordered delivery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;max.in.flight.requests.per.connection=1&lt;/code&gt;&lt;/strong&gt; — this one is subtle. When a producer has multiple in-flight requests, a retry of request N can overtake request N+1 if N+1 was acknowledged first. The result: messages arrive at the broker out of order. Setting this to 1 ensures only one batch is in-flight at a time. Note: with &lt;code&gt;enable.idempotence=true&lt;/code&gt;, Kafka allows up to 5 in-flight requests without reordering risk, but setting it to 1 is the conservative and safest choice for strictly ordered workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rebalance Handling: The Ordering Disruption You Cannot Ignore
&lt;/h2&gt;

&lt;p&gt;A consumer group rebalance — triggered by a consumer joining, leaving, or timing out — temporarily pauses all consumers in the group while partition ownership is reassigned. Any in-flight processing during a rebalance can complete out of order relative to the messages that follow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the Cooperative Sticky Assignor:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kafka&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;partition.assignment.strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org.apache.kafka.clients.consumer.CooperativeStickyAssignor&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default &lt;code&gt;RangeAssignor&lt;/code&gt; is a stop-the-world rebalance: all consumers stop, all partitions are unassigned, then reassigned. &lt;code&gt;CooperativeStickyAssignor&lt;/code&gt; only reassigns partitions that actually need to move. Consumers that keep their partition assignments continue processing uninterrupted. For ordered consumers on high-partition topics, this is a significant improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Static Group Membership to reduce rebalance frequency:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kafka&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;group.instance.id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-consumer-instance-1&lt;/span&gt;  &lt;span class="c1"&gt;# Unique per instance&lt;/span&gt;
        &lt;span class="na"&gt;session.timeout.ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without static membership, every application restart triggers a rebalance. With &lt;code&gt;group.instance.id&lt;/code&gt;, Kafka recognises the consumer as the same instance rejoining and avoids a full rebalance if it comes back within the session timeout. For Kubernetes deployments where pods restart frequently, this is a material difference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implement &lt;code&gt;ConsumerRebalanceListener&lt;/code&gt; for clean handoffs:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Component&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrderedRebalanceListener&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;ConsumerRebalanceListener&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;TopicPartition&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;currentAssignment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HashSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;onPartitionsRevoked&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Collection&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;TopicPartition&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;partitions&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Complete or checkpoint in-flight work before partitions are handed off&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Partitions revoked: {}. Completing in-flight processing."&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partitions&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="c1"&gt;// Flush any buffered state, commit offsets manually if using MANUAL ack mode&lt;/span&gt;
        &lt;span class="n"&gt;currentAssignment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;removeAll&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;partitions&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;onPartitionsAssigned&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Collection&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;TopicPartition&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;partitions&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Partitions assigned: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partitions&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;currentAssignment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addAll&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;partitions&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Idempotent Processing: Non-Negotiable
&lt;/h2&gt;

&lt;p&gt;Rebalances cause redelivery. Network failures cause redelivery. Consumer restarts cause redelivery. Your consumer will process the same message more than once — the question is whether your business logic handles it correctly.&lt;/p&gt;

&lt;p&gt;Every handler in an ordered consumer must be idempotent. Processing the same message twice must produce the same result as processing it once.&lt;/p&gt;

&lt;p&gt;A simple deduplication pattern using the original offset and partition as a correlation key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Service&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IdempotentOrderProcessor&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;OrderRepository&lt;/span&gt; &lt;span class="n"&gt;orderRepository&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ProcessedMessageRepository&lt;/span&gt; &lt;span class="n"&gt;processedMessages&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;deduplicationKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processedMessages&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;existsByKey&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deduplicationKey&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Duplicate message detected, skipping. Key: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deduplicationKey&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Process the message&lt;/span&gt;
        &lt;span class="nc"&gt;OrderEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;deserialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
        &lt;span class="n"&gt;orderRepository&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;applyEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Mark as processed (ideally in the same transaction as the business operation)&lt;/span&gt;
        &lt;span class="n"&gt;processedMessages&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProcessedMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deduplicationKey&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important note on this pattern:&lt;/strong&gt; if you are using a relational database, saving the processed message key and applying the business operation should be in the same database transaction. If the application crashes between the business operation and saving the key, you get a reprocessed message on restart. Transactional outbox or a unique constraint on the key column enforced by the database are both valid approaches.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sequence Numbers: Defense in Depth
&lt;/h2&gt;

&lt;p&gt;Kafka offsets tell you where a message sits within a partition. But they do not tell you the intended business sequence — which can differ if messages are produced out of order by the upstream system, or if a producer retry creates a gap.&lt;/p&gt;

&lt;p&gt;For critical ordered consumers, embedding a sequence number in the message payload gives you an additional layer of defense:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrderEvent&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;eventType&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;sequenceNumber&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// Monotonically increasing per orderId&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;Instant&lt;/span&gt; &lt;span class="n"&gt;occurredAt&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In your consumer, validate before processing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OrderEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;expectedSeq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orderSequenceTracker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getNext&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrderId&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSequenceNumber&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;expectedSeq&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;warn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Duplicate or old event detected. orderId={}, seq={}, expected={}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrderId&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSequenceNumber&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;expectedSeq&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Idempotency: already processed&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSequenceNumber&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;expectedSeq&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Gap in sequence detected. orderId={}, seq={}, expected={}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrderId&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSequenceNumber&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;expectedSeq&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;SequenceGapException&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Missing event in sequence for order: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrderId&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;applyEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;orderSequenceTracker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;advance&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrderId&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;SequenceGapException&lt;/code&gt; thrown here will trigger your error handler. Whether you block-retry, pause the consumer, or route to a DLT depends on your business tolerance — but at least you have detected the gap rather than processing out-of-order silently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability: You Cannot Debug What You Cannot See
&lt;/h2&gt;

&lt;p&gt;Ordered consumers fail in subtle ways. A rebalance redelivery processed twice. A gap in the offset sequence. A message processed 200ms before its predecessor completed. None of these show up as errors in your application logs without deliberate instrumentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log partition and offset on every message:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@KafkaListener&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"order-events"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;groupId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"order-processor"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;consume&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECEIVED_PARTITION&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OFFSET&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Processing message. topic={}, partition={}, offset={}, key={}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

    &lt;span class="n"&gt;orderProcessingService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;process&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Completed message. topic={}, partition={}, offset={}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expose consumer lag as a metric.&lt;/strong&gt; Consumer lag — the gap between the latest offset produced and the latest offset committed by your consumer — is the single most important health signal for an ordered consumer. A growing lag means messages are accumulating unprocessed, which eventually translates to ordering delays even if no failures have occurred.&lt;/p&gt;

&lt;p&gt;With Micrometer and Spring Boot Actuator, consumer lag is exposed automatically if you include &lt;code&gt;spring-kafka&lt;/code&gt; metrics. Ensure your Prometheus scrape config picks it up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;management&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;application&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-consumer&lt;/span&gt;
  &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;web&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exposure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;health,metrics,prometheus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert on: lag growing beyond your SLA threshold, DLT publish rate above zero, offset gap between consecutive commits, and rebalance frequency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Kafka Ordering Fits — And Where It Does Not
&lt;/h2&gt;

&lt;p&gt;After all of the above, the honest answer: Kafka ordering is appropriate for &lt;strong&gt;event-driven workflows where sequence matters but the domain can tolerate the complexity overhead&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It fits well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Order lifecycle events (placed → paid → shipped → delivered)&lt;/li&gt;
&lt;li&gt;User action streams that must be replayed in order&lt;/li&gt;
&lt;li&gt;Audit trails where chronological sequence is required&lt;/li&gt;
&lt;li&gt;CDC (Change Data Capture) streams that must apply database changes in order&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is the wrong tool for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core financial ledger operations requiring ACID guarantees across multiple legs&lt;/li&gt;
&lt;li&gt;Workflows where any ordering failure must immediately halt the entire pipeline — a relational database with proper transaction management is simpler and safer&lt;/li&gt;
&lt;li&gt;Low-volume, low-latency scenarios where the operational complexity of Kafka outweighs its throughput benefits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your requirement is "these four operations must all succeed or all fail, in order, atomically" — that is a distributed transaction problem. Look at the Saga pattern with orchestration, or a transactional outbox, before reaching for Kafka ordering.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Picture: Configuration Reference
&lt;/h2&gt;

&lt;p&gt;Every property below has a reason. The two most commonly misconfigured ones in ordered consumers are &lt;code&gt;max.poll.interval.ms&lt;/code&gt; and &lt;code&gt;max.poll.records&lt;/code&gt; — and getting either wrong causes silent ordering failures rather than hard errors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kafka&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enable-auto-commit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="na"&gt;auto-offset-reset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;earliest&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;partition.assignment.strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org.apache.kafka.clients.consumer.CooperativeStickyAssignor&lt;/span&gt;
        &lt;span class="na"&gt;group.instance.id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${HOSTNAME}-consumer&lt;/span&gt;   &lt;span class="c1"&gt;# Static membership — prevents rebalance on pod restart&lt;/span&gt;
        &lt;span class="na"&gt;session.timeout.ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60000&lt;/span&gt;

        &lt;span class="c1"&gt;# Conservative ceiling: limits how many records a single poll() fetches.&lt;/span&gt;
        &lt;span class="c1"&gt;# If your processing logic takes 200ms per record, 50 records = 10s of work per poll.&lt;/span&gt;
        &lt;span class="c1"&gt;# This directly informs max.poll.interval.ms below — they must be sized together.&lt;/span&gt;
        &lt;span class="na"&gt;max.poll.records&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;

        &lt;span class="c1"&gt;# Must exceed your worst-case processing time for a full poll batch.&lt;/span&gt;
        &lt;span class="c1"&gt;# Formula: max.poll.interval.ms &amp;gt; max.poll.records × max_processing_time_per_record_ms&lt;/span&gt;
        &lt;span class="c1"&gt;# With 50 records at 200ms each = 10,000ms. 300,000ms gives a 30× safety margin.&lt;/span&gt;
        &lt;span class="c1"&gt;# If this deadline is breached, Kafka treats the consumer as dead, triggers a rebalance,&lt;/span&gt;
        &lt;span class="c1"&gt;# and the in-flight batch may be reprocessed by another thread — breaking ordering.&lt;/span&gt;
        &lt;span class="na"&gt;max.poll.interval.ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300000&lt;/span&gt;
    &lt;span class="na"&gt;producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;acks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;enable.idempotence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;max.in.flight.requests.per.connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Bean&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ConcurrentKafkaListenerContainerFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;kafkaListenerContainerFactory&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;ConcurrentKafkaListenerContainerFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;factory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ConcurrentKafkaListenerContainerFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
    &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setConsumerFactory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;consumerFactory&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setConcurrency&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Or = partition count with sticky assignment&lt;/span&gt;
    &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getContainerProperties&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setAckMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ContainerProperties&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;AckMode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECORD&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="nc"&gt;ExponentialBackOffWithMaxRetries&lt;/span&gt; &lt;span class="n"&gt;backOff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ExponentialBackOffWithMaxRetries&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;backOff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setInitialInterval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000L&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;backOff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setMultiplier&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;backOff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setMaxInterval&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10_000L&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCommonErrorHandler&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DefaultErrorHandler&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;DeadLetterPublishingRecoverer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kafkaTemplate&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="n"&gt;backOff&lt;/span&gt;
    &lt;span class="o"&gt;));&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Guaranteed message ordering in Kafka is achievable, but it is not a feature you turn on. It is a property that emerges from a set of consistent decisions made across the producer, the broker, and the consumer — and broken by any one decision made without understanding its implications.&lt;/p&gt;

&lt;p&gt;The combination that holds in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stable, business-meaningful partition keys set at the producer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;enable.idempotence=true&lt;/code&gt; and &lt;code&gt;max.in.flight.requests.per.connection=1&lt;/code&gt; on the producer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;enable.auto.commit=false&lt;/code&gt; with &lt;code&gt;AckMode.RECORD&lt;/code&gt; on the consumer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;concurrency=1&lt;/code&gt; (or one thread per partition) with no async hand-offs&lt;/li&gt;
&lt;li&gt;Blocking retry with &lt;code&gt;DefaultErrorHandler&lt;/code&gt; — not &lt;code&gt;@RetryableTopic&lt;/code&gt;, which trades ordering for throughput&lt;/li&gt;
&lt;li&gt;Cooperative sticky assignor and static group membership for rebalance stability&lt;/li&gt;
&lt;li&gt;Idempotent processing with deduplication keyed on topic + partition + offset&lt;/li&gt;
&lt;li&gt;Sequence numbers in the payload for gap detection&lt;/li&gt;
&lt;li&gt;Consumer lag and offset gap alerting as the primary health signals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have not deployed this exact pattern to production — but I have run the zero-loss consumer from my previous post in production, and the foundation is the same: respect Kafka's actual guarantees, design for redelivery, and make your tradeoffs explicit rather than discovering them in an incident.&lt;/p&gt;

&lt;p&gt;If you have implemented ordered consumers in production and found gaps in this — or made different tradeoff decisions — I would genuinely like to hear about it in the comments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of a series on production-grade Kafka with Spring Boot. The previous post covers &lt;a href="https://dev.to/yogesh_kale_9e617cb1c2561/building-a-zero-loss-kafka-consumer-with-spring-kafka-retryable-topics-2f36"&gt;zero-loss delivery with retryable topics&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;#kafka&lt;/code&gt; &lt;code&gt;#springboot&lt;/code&gt; &lt;code&gt;#java&lt;/code&gt; &lt;code&gt;#distributedsystems&lt;/code&gt; &lt;code&gt;#eventdrivenarchitecture&lt;/code&gt; &lt;code&gt;#microservices&lt;/code&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>distributedsystems</category>
      <category>java</category>
      <category>springboot</category>
    </item>
    <item>
      <title>Building a Zero-Loss Kafka Consumer with Spring Kafka Retryable Topics</title>
      <dc:creator>Yogesh Kale</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:36:57 +0000</pubDate>
      <link>https://dev.to/yogesh_kale_9e617cb1c2561/building-a-zero-loss-kafka-consumer-with-spring-kafka-retryable-topics-2f36</link>
      <guid>https://dev.to/yogesh_kale_9e617cb1c2561/building-a-zero-loss-kafka-consumer-with-spring-kafka-retryable-topics-2f36</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A production-grade approach to integrating Kafka with legacy systems using Spring Kafka Retryable Topics.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Why
&lt;/h2&gt;

&lt;p&gt;When we first discussed connecting a Kafka stream to our legacy downstream system, the obvious question was: &lt;strong&gt;&lt;em&gt;why not just consume directly in the downstream app?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The honest answer — our legacy system simply isn't built for it. Kafka's polling model doesn't fit its threading architecture. Connections are tightly controlled, and sneaking a Kafka client into it would increase system risk more than it would solve. But the deeper issue is a fundamental mismatch: our downstream system is &lt;strong&gt;synchronous and deterministic&lt;/strong&gt;. Kafka is &lt;strong&gt;asynchronous&lt;/strong&gt;. That tension can't just be wished away.&lt;/p&gt;

&lt;p&gt;So we built something in between — a &lt;strong&gt;stateless Kafka adaptor&lt;/strong&gt; that absorbs event streams, converts them to synchronous HTTP calls, and guarantees zero-loss delivery. This post is everything we learned building it, including the parts that bit us in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Java 21&lt;/li&gt;
&lt;li&gt;Spring Boot 3.x&lt;/li&gt;
&lt;li&gt;Spring Web (Spring MVC)&lt;/li&gt;
&lt;li&gt;Spring Kafka (with &lt;code&gt;@RetryableTopic&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;OkHttp 5.x&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. The Core Challenge: The Stateless Synchronous Hand-off
&lt;/h2&gt;

&lt;p&gt;Most Kafka consumer implementations follow a "consume-and-save" pattern — pull the message, persist it locally, acknowledge it. Simple and safe.&lt;/p&gt;

&lt;p&gt;Our adaptor doesn't have that luxury. It has &lt;strong&gt;no local database&lt;/strong&gt;. There's nowhere to buffer. So the only way to guarantee delivery is to make the Kafka acknowledgement contingent on the downstream system actually accepting the message.&lt;/p&gt;

&lt;p&gt;We call this a &lt;strong&gt;Synchronous Hand-off&lt;/strong&gt;: the consumer only ACKs the Kafka message once it receives a positive functional response from the downstream HTTP endpoint. The message is only "gone" from Kafka's perspective once it has safely landed downstream. No silent drops, no optimistic acks.&lt;/p&gt;

&lt;p&gt;This one decision shapes the entire resilience strategy that follows.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. A Multi-Layered Resilience Strategy
&lt;/h2&gt;

&lt;p&gt;Running in production quickly teaches you to distinguish between a &lt;strong&gt;blip&lt;/strong&gt; (a few seconds of network jitter) and a &lt;strong&gt;blackout&lt;/strong&gt; (the downstream service is down for 30 minutes). A flat retry policy treats both the same way — and that's a problem.&lt;/p&gt;

&lt;p&gt;So we built a tiered recovery model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1 — Synchronous HTTP Retries
&lt;/h3&gt;

&lt;p&gt;The first line of defence is a quick, synchronous retry loop inside our &lt;strong&gt;HttpClientService&lt;/strong&gt;. For a transient 500 or a connection timeout, we don't want to immediately escalate to an async retry topic — that's expensive and slow. A few fast retries handle the vast majority of brief network hiccups before anything more serious kicks in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2 — Asynchronous Non-Blocking Retries (&lt;code&gt;@RetryableTopic&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;If the synchronous retries are exhausted, we hand the problem to Spring Kafka's &lt;code&gt;@RetryableTopic&lt;/code&gt;. This is really the heart of the whole design.&lt;/p&gt;

&lt;p&gt;The key insight here is &lt;strong&gt;Head-of-Line blocking avoidance.&lt;/strong&gt; Without retry topics, a single failed message can pin a partition — nothing behind it gets processed until it either succeeds or is manually skipped. With &lt;code&gt;@RetryableTopic&lt;/code&gt;, the failed message is published to a separate retry topic and the main consumer moves on immediately. System throughput stays stable even while the downstream is recovering.&lt;/p&gt;

&lt;p&gt;There's a tradeoff worth being upfront about: &lt;strong&gt;moving a message to a retry topic breaks ordering guarantees for that partition&lt;/strong&gt;. Subsequent messages from the same partition will be processed before the retried one. For a lot of systems that's a deal breaker — but for ours it isn't, because each event is an independent financial instruction (an account update, a customer update, etc.) identified by its own transaction ID. The downstream system processes them idempotently by transaction ID, not by arrival sequence. If your domain requires strict ordering — say, a series of balance mutations on the same account that must apply in order — &lt;code&gt;@RetryableTopic&lt;/code&gt; in this form is not the right tool without additional sequencing logic on top.&lt;/p&gt;

&lt;p&gt;Here's what our actual consumer looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@RetryableTopic&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;include&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;DownStreamDownException&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;},&lt;/span&gt;
  &lt;span class="n"&gt;attempts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${kafka.retry.topic.maxAttempts:1}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;@Backoff&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delayExpression&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${kafka.retry.backoff.delay:3600000}"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;autoCreateTopics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${kafka.autoCreateTopics:false}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;dltTopicSuffix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;DLQ_SUFFIX&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;retryTopicSuffix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;RETRY_SUFFIX&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@KafkaListener&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"${spring.kafka.topic.updates.account}"&lt;/span&gt;&lt;span class="o"&gt;},&lt;/span&gt;
  &lt;span class="n"&gt;groupId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${spring.kafka.consumer.group-id}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;ACCOUNT_UPDATES_CONSUMER_ID&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;consumeMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;consumerRecord&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECEIVED_PARTITION&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Integer&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OFFSET&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECEIVED_TOPIC&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;receivedTopic&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;consume&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;consumerRecord&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;receivedTopic&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth calling out here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;include = {DownStreamDownException.class}&lt;/code&gt;&lt;/strong&gt; — &lt;em&gt;only&lt;/em&gt; retry on this specific exception. Validation errors and bad payloads should never retry; they should fail fast to DLT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;autoCreateTopics = false&lt;/code&gt;&lt;/strong&gt; — this matters a lot in production (more on this below).&lt;/li&gt;
&lt;li&gt;The backoff delay is a flat &lt;strong&gt;1 hour&lt;/strong&gt; by default (&lt;code&gt;3600000ms&lt;/code&gt;, configurable from property file). This was a deliberate choice, not a default we left in place. In a financial context, the downstream outage pattern we've seen is almost never "back in 30 seconds" — it's either a brief HTTP blip (caught by Layer 1 sync retries) or a proper incident taking 20–60 minutes to resolve. A short exponential backoff would just fill the retry topic with noise during that window. One hour means the retry fires once, cleanly, after the service is likely recovered.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And when the HTTP call &lt;strong&gt;fails inside &lt;code&gt;processMessage&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Override&lt;/span&gt;
&lt;span class="kd"&gt;protected&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;processMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;messageWithMetadata&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nc"&gt;Integer&lt;/span&gt; &lt;span class="n"&gt;originalPartition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;originalOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;originalTopic&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

  &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpClientService&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sendWithRetry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messageWithMetadata&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;originalPartition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;originalOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;originalTopic&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;orElse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Request failed for Payload: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messageWithMetadata&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;DownStreamDownException&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"DownStream service unavailable for topic: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;originalTopic&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Request Payload: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messageWithMetadata&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;trackDeliverySuccess&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;originalTopic&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Throwing &lt;code&gt;DownStreamDownException&lt;/code&gt; is the trigger. Spring Kafka catches it, sees it matches the &lt;code&gt;include&lt;/code&gt; list, and routes the message to the retry topic. Clean and intentional.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3 — Dead Letter Topic (DLT)
&lt;/h3&gt;

&lt;p&gt;If all retry attempts are exhausted, the message lands in the DLT (Spring Kafka calls it a Dead Letter &lt;strong&gt;Topic&lt;/strong&gt;, not Queue — though the concept is the same).&lt;/p&gt;

&lt;p&gt;Two scenarios send messages straight here without retrying:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;All retry attempts exhausted.&lt;/li&gt;
&lt;li&gt;Exceptions &lt;strong&gt;not&lt;/strong&gt; in the &lt;code&gt;include&lt;/code&gt; list — e.g., a malformed payload that would fail on every attempt anyway.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@DltHandler&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;handleDlt&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;consumerRecord&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RECEIVED_PARTITION&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Integer&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nd"&gt;@Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaHeaders&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OFFSET&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;trackDlq&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;handleDlq&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;consumerRecord&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DLT messages are manually monitored. In practice, a message in the DLT is either a bad payload (fix the producer) or evidence of a prolonged downstream outage (manual replay needed after recovery).&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Topic Naming &amp;amp; Consumer Group Gotcha (We Learned This the Hard Way)
&lt;/h2&gt;

&lt;p&gt;Spring Kafka expects specific naming conventions for retry and DLT topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retry topic → &lt;code&gt;&amp;lt;main-topic&amp;gt;-retry&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;DLT topic → &lt;code&gt;&amp;lt;main-topic&amp;gt;-dlt&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also &lt;strong&gt;automatically creates consumer groups&lt;/strong&gt; for these by appending suffixes to your main group ID.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3oagarybhvm3mbhjk79w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3oagarybhvm3mbhjk79w.png" alt=" " width="432" height="175"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In UAT, Kafka had &lt;code&gt;auto.create.topics.enable=true&lt;/code&gt; and group auto-creation on. Everything just worked. We moved to production — where auto-create is &lt;strong&gt;disabled&lt;/strong&gt; for good reason — and nothing worked. The retry flow silently broke because the consumer groups didn't exist and we didn't have permission to create them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; explicitly create all three consumer groups before deploying and make sure your service account has the right ACLs on each. This should be part of your deployment checklist, not a production incident.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Silent 20x Header Bloat
&lt;/h2&gt;

&lt;p&gt;This one surprised us. By default, Spring Kafka attaches the &lt;strong&gt;full exception stack trace&lt;/strong&gt; to the headers of every message it routes to a retry topic. In a high-throughput system, a 1KB message can balloon to 20KB just from headers.&lt;/p&gt;

&lt;p&gt;Multiply that across millions of retries and you have a serious storage and network overhead problem that never showed up in unit tests.&lt;/p&gt;

&lt;p&gt;We solved it with a &lt;strong&gt;KafkaProducerInterceptor&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Override&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;onSend&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;Headers&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endsWith&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"-retry"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endsWith&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"-dlt"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Strip the full stack trace - keep essential error metadata, drop the wall of text&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;headerKey&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;HEADERS_TO_REMOVE_RETRY&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;remove&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headerKey&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;
    &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;producerRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;
  &lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;HEADERS_TO_REMOVE_RETRY&lt;/code&gt; includes &lt;code&gt;kafka_exception-stacktrace&lt;/code&gt; and a few others that add size without adding operational value. The result: roughly &lt;strong&gt;90% reduction&lt;/strong&gt; in retry topic storage overhead, while still keeping the exception message and class for debugging.&lt;/p&gt;

&lt;p&gt;Register it in your producer config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ProducerConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;INTERCEPTOR_CLASSES_CONFIG&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;HeaderFilteringProducerInterceptor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getName&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Preserving Original Metadata Across Retries
&lt;/h2&gt;

&lt;p&gt;In financial systems, idempotency is non-negotiable. The downstream service needs to detect duplicates — and it does that using the original Kafka offset and partition as a correlation key.&lt;/p&gt;

&lt;p&gt;The problem: when a message moves to a retry topic, Kafka assigns it a &lt;strong&gt;new offset and partition&lt;/strong&gt;. If you blindly forward that metadata, your duplicate detection breaks.&lt;/p&gt;

&lt;p&gt;Spring Kafka saves the day here by writing the original coordinates into headers (&lt;code&gt;kafka_original-offset&lt;/code&gt;, &lt;code&gt;kafka_original-topic&lt;/code&gt;, &lt;code&gt;kafka_original-partition&lt;/code&gt;) before routing to the retry topic.&lt;/p&gt;

&lt;p&gt;Our &lt;strong&gt;KafkaMessageUtil&lt;/strong&gt; reads these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="nf"&gt;resolveOffset&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ConsumerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?,&lt;/span&gt; &lt;span class="o"&gt;?&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"-retry"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Use the original offset from headers, not the retry topic's offset&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;extractLongHeader&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"kafka_original-offset"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The downstream service always gets the coordinates of the &lt;strong&gt;first delivery attempt&lt;/strong&gt;, regardless of how many retries have happened. This makes duplicate detection work cleanly across the entire message life cycle.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. The Heartbeat &amp;amp; Operational Safety
&lt;/h2&gt;

&lt;p&gt;There's a well-known failure mode called the &lt;strong&gt;Thundering Herd&lt;/strong&gt;: your downstream comes back online after an outage, and every backed-up consumer immediately floods it with requests, crashing it again.&lt;/p&gt;

&lt;p&gt;We prevent this with a &lt;strong&gt;HeartbeatScheduler&lt;/strong&gt; that pings the downstream health endpoint every 300 seconds. If the health check fails, we call &lt;code&gt;pause()&lt;/code&gt; on all active consumer containers. Messages accumulate safely in the broker. When the health check recovers, we call &lt;code&gt;resume()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Worth being honest about what this approach does &lt;strong&gt;not&lt;/strong&gt; cover. A heartbeat tells you if the service is alive — not if it's healthy. A downstream that is up but returning &lt;code&gt;500&lt;/code&gt; on 40% of real requests will pass the health check and still get flooded the moment consumers resume. We accepted this limitation because our downstream health endpoint is a genuine deep check — it validates DB connectivity and core dependencies, not just a surface-level HTTP 200. Combined with the 1-hour backoff window, the retry storm risk is manageable. But it is a known gap, and it is the main reason we are moving to circuit breakers.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;ConsumerMetricsScheduler&lt;/strong&gt; logs per-minute delivery success and DLT rates. This gives the SRE team a real-time view of system health without digging through logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Where We're Heading — Circuit Breakers
&lt;/h2&gt;

&lt;p&gt;We're evaluating &lt;strong&gt;Resilience4j&lt;/strong&gt; as a replacement for the Heartbeat approach. The key advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Monitors actual traffic&lt;/strong&gt; — rather than a synthetic health check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Half-Open state&lt;/strong&gt; — instead of resuming all consumers at once after recovery, it lets a few "probe" requests through first. Only if those succeed does it open the floodgates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Removes the background polling thread&lt;/strong&gt; — moves toward a fully event-driven resilience model.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Building a zero-loss Kafka adaptor is genuinely harder than it looks. The individual pieces — retries, dead letters, health checks — are all standard. The challenge is in the details: what happens to headers when you retry? What offset do you report to the downstream? What's your consumer group strategy in a locked-down production cluster?&lt;/p&gt;

&lt;p&gt;The combination that worked for us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A stateless consumer with synchronous HTTP hand-offs&lt;/li&gt;
&lt;li&gt;Spring Kafka &lt;code&gt;@RetryableTopic&lt;/code&gt; for async, non-blocking retry flows&lt;/li&gt;
&lt;li&gt;A producer interceptor to keep retry-topic storage sane&lt;/li&gt;
&lt;li&gt;Original metadata preservation for idempotency&lt;/li&gt;
&lt;li&gt;Heartbeat-driven pause/resume to prevent thundering herd&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's production-proven now. Hopefully this saves someone else the three days of debugging we did.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://medium.com/@yogeshdev024/building-a-zero-loss-kafka-adaptor-with-spring-kafka-retryable-topics-e63fcad10c41" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>springboot</category>
      <category>softwaredevelopment</category>
      <category>java</category>
    </item>
  </channel>
</rss>
