DEV Community

Alexandre Amado de Castro
Alexandre Amado de Castro

Posted on • Edited on • Originally published at platformtoolsmith.com

Kafka Retries: Implementing Consumer Retry with Go

Three kafka topics with messages inside of it and a Golang gopher running to retry.

You don’t “need retries in Kafka” until the day one of your handlers starts failing and you’re forced into a choice: block consumption (and watch lag climb) or keep consuming and retry somewhere else.

This post is about one very pragmatic approach: commit the Kafka offset even when processing fails, then push the failed message into a Go retry queue. Kafka keeps moving, and your application owns the retry policy.

Quick context (assuming you already speak Kafka): consumer groups split partitions across consumers for parallelism. Offsets are committed per partition. In this approach, an offset commit means “Kafka can move on,” not necessarily “side effects succeeded.”

If you’re looking for the other style of “consumer retry” (don’t advance the offset until the handler succeeds), that’s a different design with different tradeoffs. This post is explicitly about keeping consumption moving and absorbing failures in a retry pipeline.

Working with examples

Let's create an example. Imagine you work for a company named ACME, and you have a Kafka topic that receives a new message every time a new customer is created. This customer needs to receive an email saying that their account is fully created and that they need to verify their email.

When we read something like "when something happens" do "something else," always think about events! Events, in Kafka, are messages!

We're going to create a microservice that will:

  • Receive this new customer message from Kafka.
  • Compose a nice email.
  • Send it to the customer.
  • Add a new message to another topic saying that the message was sent!

Let's diagram that for more visibility:

Diagram to visualize the concept above.

Where are the retries?

Cool, but, this is a blog post about retries right? Where are the retries in all of it?

Well, any part of the processing can go wrong, and we don't want our customers to miss out on their account creation emails, do we? So, let's talk about a few things that could go wrong and how we can handle them with retries:

  • The email template database might be down. No problem, we'll just keep trying to connect until it's back up.
  • The template persisted inside of it might be invalid. We'll check for this and alert the team to update it if needed.
  • The SMTP server might be down or drop a message. We'll keep trying to send the email until it goes through.
  • The application might compose the SMTP message badly and the SMTP server reject it. We'll catch this error and fix the composition before trying again.
  • Producing the message to the other Kafka topic might return an error on Kafka's server side. We'll keep trying until it's successfully sent.

Some of the errors mentioned may require code or data modifications to resolve, such as updating a template or correcting the way the application composes an SMTP message. These types of errors may not be suitable for retries as they may require manual intervention.

On the other hand, other errors such as temporary database downtime, flaky message production, or overloaded SMTP servers can potentially be resolved by retrying the operation.

Unlocking Kafka Consumer Retries

This comes up constantly in event-driven systems: Kafka gives you a great log, but it doesn’t give you a “retry policy” button.

I’ve tried a bunch of architectures for retrying message processing. None is perfect, but one is consistently the easiest to operate when your top priority is “keep consuming,” as long as you’re explicit about the semantics trade.

Before diving into the solution, let me share some background on my previous attempts.

AWS SQS: A Great Option, But With Limitations

As a strong advocate for AWS, my first thought when considering retries was AWS SQS.

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables the decoupling and scaling of microservices, distributed systems, and serverless applications. It comes with built-in retry behavior (visibility timeout + redrive policies + DLQ) and supports a maximum message size of 256 KB.

Now, don’t get me wrong: the grown-up solution is the pointer pattern — put the real payload in object storage (like S3) and send a small pointer through SQS.

That pattern works, but it adds moving parts:

  • You’ve introduced a second persistence system.
  • You need retention/lifecycle for blobs.
  • You need consistency between “pointer message” and “payload object”.

Also, quick correction on Kafka: Kafka message size is configurable (defaults vary, but ~1MB is common). You can tune limits, but you still need an explicit payload contract, or producers will eventually surprise you in production.

So while SQS is a great option, it wasn’t the best fit for what I wanted here: keep the entire pipeline “Kafka-native” without introducing a second storage path just to make retries easier.

Kafka

Yes, that's exactly it, since SQS isn't a suitable solution, why not use Kafka itself?

I attempted to create an internal application Kafka topic that would only be used by the application, where the application would push messages to it, and in case of a failure, it would enqueue the message again with a new count attribute in the header. Let me diagram that to make it more clear:

Diagram to visualize the concept above.

While this approach does work, it has some significant downsides. Firstly, setting up Kafka topics is not as straightforward as SQS, creating multiple topics can be somewhat cumbersome, and dealing with all the consumers and producers within the code can become quite messy.

Kafka also doesn’t give you a first-class “retry this message N times” feature at the broker level, and that’s mostly a good thing: Kafka can’t know whether your side effects are safe to repeat. Retries are an application-level decision tied to offsets, commits, backpressure, and idempotency.

Databases

Desperate Times Call for Desperate Measures, right?

As a last resort, I turned to use databases for message retries. Modern SQL databases have LOCKING mechanisms that can be leveraged to use a table as a queue.

Diagram to visualize the concept above.

To my surprise, this approach worked! However, it's not a perfect solution. Databases are not designed to function as queues, and using them in this way can be a stretch. There are a few downsides to this approach:

  • Databases are built for consistency using transactions, which can make them performant, but they will never perform as well as a specialized tool like Kafka. This can easily become a bottleneck in the system.
  • Managing all of this code can easily become problematic, and it is not an easy pattern to share across an entire company.

So, what now? Memory

After trying all of the above solutions and being unsatisfied with the results, I decided to look at how other companies and frameworks handle retries.

One framework that stood out to me is Spring, an open-source Java framework that provides a comprehensive programming and configuration model for building modern enterprise applications.

The Spring way

One of the modules within Spring is Spring Kafka, which provides several ways to handle retries when consuming messages. These include using the RetryTemplate and RetryCallback interfaces from the Spring Retry library to define a retry policy, using the @KafkaListener annotation to configure retry behavior on a per-method basis, or using AOP to handle retries by using a RetryInterceptor.

Spring Kafka allows developers to choose the approach that best fits their use case, whether it is a global retry policy or a more fine-grained per-method retry configuration.

It's been a long time since I worked with Java, and Spring code is far from easy to understand (it's pretty advanced stuff), but after diving deeper into the Spring Kafka framework, I discovered that it implements an error handler using a strategy for handling exceptions thrown by the consumers.

When an exception that is not a BatchListenerFailedException (which Spring knows is impossible to retry) is thrown, the error handler will retry the batch of records from memory. This prevents a consumer rebalance during an extended retry sequence and allows for a more elegant solution.

In my honest opinion, the Spring Kafka framework offers a robust and flexible approach to handling retries that is well worth considering for any enterprise application.

It provides a range of options for developers to choose from and allows for more fine-grained control over retry behavior, making it a great choice for any organization looking to implement a retry mechanism for their messaging system.

Inspired by Spring Kafka (but different semantics) in Go

I currently work with (and love ) Go, and, as far as my research went, I didn't find any rewrite of the Spring Kafka and Retry framework in Go, so, let's build our own!

Here’s the key difference in my implementation compared to Spring Kafka’s approach: I’m not trying to pause a partition and “hold the offset line” until processing succeeds.

Instead, I commit the offset and move on, then retry in the background.

My approach includes:

  • Creating a consumer that reads from the topic.
  • Creating a Go channel that handles retries.
  • If a message fails to process, it is sent to the retry channel.
  • The main consumer continues to process the main topic while the other messages are retried.
  • If retries are exhausted, the message is sent to a persisted DLQ for later investigation.

⚠️ Warning:
This pattern intentionally decouples Kafka offsets from processing success.

If you commit a message and your process crashes before the retry succeeds, Kafka will not redeliver that message. With an in-memory Go channel, that means a restart can drop “in-flight retries.”

If you need stronger guarantees than that, the retry queue must be durable (retry topic, DB table, external queue) and your handler must be idempotent.

Also note that retries can happen later than subsequent messages, so you’re trading away strict ordering for throughput.

When not to use this pattern

If you need strict ordering guarantees, or you must be able to restart at any time without losing “in-flight retries,” don’t use an in-memory retry queue behind an early commit. In those cases, you either need a durable retry queue, or you need an offset-holding strategy (pause/seek and commit only after success).

Here’s the full example

This example uses an in-memory Go channel as the retry queue. That keeps Kafka consumption moving, but it also means retries are lost on process restart. In real systems, I usually evolve this into a durable retry queue.

One more practical note: your retry queue is part of your backpressure story. If it fills up, enqueueing can block and slow consumption. The "keep consuming no matter what" version of this pattern needs an explicit overflow strategy (drop, spill to durable storage, or fail fast into DLQ). Also consider adding jitter to your backoff to avoid retry storms when multiple messages fail simultaneously.

Quick orientation: ProcessRetryHandler is the interface your handler implements — Process does the work, MoveToDLQ handles exhausted retries. ConsumerWithRetryOptions wires everything together. The main loop reads from Kafka and enqueues failures; the goroutine retries them with backoff.

Semantics (don’t skip this)

This pattern is “Kafka keeps moving,” which means you’re making a deliberate trade: Kafka offsets can advance even when processing hasn’t succeeded yet.

What Kafka guarantees (in this pattern)

If you commit before success (or your client auto-commits on read), Kafka won’t redeliver on restart. That’s effectively at-most-once for your side effects.

If you’re following along with kafka-go, note that Reader.ReadMessage with a GroupID advances committed offsets independently of whether your handler succeeds.

How to get stronger guarantees

If you want at-least-once for the side effects, the retry queue can’t be in-memory. It needs to be durable (retry topic, DB table, external queue) and your handler needs to be idempotent.

Idempotency (practical version)

If processing record X twice causes two emails, two charges, or two rows, you’re going to have a bad week. The fix is usually an idempotency key + write-once constraint (or upsert).

Poison pills

A poison pill is a record that will never succeed without code/data changes. This pattern is nice because poison pills don’t block Kafka consumption — they show up as retry storms and DLQ events instead.

Decision table

In my head, this “commit early + retry elsewhere” idea has a few maturity levels:

Option Kafka consumption Failure semantics Operational complexity When I’d pick it
In-memory retry channel Never blocks consumption Retries are lost on restart Low Quick wins, low blast radius workflows
Durable retry queue (Kafka retry topic / external queue) Never blocks consumption At-least-once is achievable (with idempotency) Medium Most production systems
DB-backed state machine / outbox Never blocks consumption Strongest auditability and control High Complex workflows, compliance, long retries

In conclusion

Kafka retries are not a broker feature; they’re a semantics decision.

This “commit early + retry elsewhere” approach is great when your top priority is keeping Kafka consumption unimpaired. The cost is that you’re now running a second system: your retry pipeline. Treat it like a first-class dependency.

If I were shipping this pattern, I’d page on:

  • Retry queue depth (and time-in-queue).
  • Retry exhaustion rate (messages going to DLQ).
  • DLQ age (DLQ isn’t success; it’s a promise you’ll look).
  • Processing success latency (p95/p99) for the retry worker.
  • Consumer lag as a sanity check (it should stay flat even during downstream outages).

Dealing with distributed systems challenges or building reliable event-driven architectures? Let's talk.

Top comments (0)