Introduction
Apache Kafka has become the backbone of modern event-driven architectures, enabling high-throughput pipelines, real-time analytics, and loosely coupled microservices. But building resilient Kafka consumers-especially in Go, a language that emphasizes simplicity and low-level control-presents a unique set of challenges.
In production, failures aren’t edge cases-they’re daily realities. Network partitions, broker restarts, corrupted messages, serialization mismatches, and poisoned pills can bring down consumers if not handled properly. Without thoughtful design, you risk message loss, duplication, retry storms, or system-wide downtime.
This article dives deep into production-tested patterns for handling failures in Kafka consumers written in Go. Using the excellent kafka-go library, we’ll cover:
- Retry logic with exponential backoff
- Dead-letter queues for unrecoverable messages
- Manual offset commits for at-least-once delivery
- Supervision patterns for resilient runtime behavior
- Schema validation to catch bad data before it crashes your logic
Whether you’re building real-time systems or maintaining a high-throughput message processor, these Go patterns will help you design Kafka consumers that can gracefully survive the messiness of production.
🔁 1. Retry Logic with Backoff (for transient errors)
Scenario: Your consumer hits a timeout when fetching messages or a temporary processing failure.
Pattern:
Use exponential backoff with jitter to avoid retry storms and give downstream systems time to recover. Mostly, I prefer https://pkg.go.dev/github.com/cenkalti/backoff/v4 for retries.
Example:
func retryWithBackoff(fn func() error) error {
backoff := time.Millisecond * 100
for attempts := 0; attempts < 3; attempts++ {
err := fn()
if err == nil {
return nil
}
time.Sleep(backoff + time.Duration(rand.Intn(100))*time.Millisecond)
backoff *= 2
}
return errors.New("max retry attempts reached")
}
Apply to message processing:
for {
m, err := r.ReadMessage(ctx)
if err != nil {
log.Printf("read error: %v", err)
continue
}
retryWithBackoff(func() error {
return handleMessage(m)
})
}
🪵 2. Dead Letter Queue (DLQ) for poison pills
Scenario: A malformed message keeps causing panics or errors every time it's retried.
Pattern: After a fixed number of retries, send the message to a separate DLQ topic for analysis.
Example:
func handleWithDLQ(m kafka.Message, producer *kafka.Writer) {
err := retryWithBackoff(func() error {
return processMessage(m)
})
if err != nil {
log.Printf("failed to process message: %v", err)
dlqMessage := kafka.Message{
Key: m.Key,
Value: m.Value,
Headers: []kafka.Header{
{Key: "error", Value: []byte(err.Error())},
},
}
_ = producer.WriteMessages(context.Background(), dlqMessage)
}
}
🔄 3. At-Least-Once Delivery with Manual Commit
Scenario: You want to ensure messages are never lost, even if the process crashes.
Pattern: Read messages, process them, and commit offsets after successful handling.
Example:
r := kafka.NewReader(kafka.ReaderConfig{
Brokers: []string{"localhost:9092"},
Topic: "events",
GroupID: "event-handler",
MinBytes: 10e3,
MaxBytes: 10e6,
CommitInterval: 0, // disable auto-commit
})
for {
m, err := r.ReadMessage(ctx)
if err != nil {
log.Printf("read error: %v", err)
continue
}
if err := handleMessage(m); err == nil {
_ = r.CommitMessages(ctx, m)
} else {
log.Printf("processing error: %v", err)
// optionally write to DLQ
}
}
🚦 4. Consumer Supervision & Graceful Restart
Scenario: Your consumer crashes due to an unhandled panic or dependency failure.
Pattern: Use a supervisor process that restarts consumers with metrics, backoff, and alerting.
Example (basic concept):
run := func() (err error) {
defer func() {
if r := recover(); r != nil {
err = errors.Join(err, fmt.Errorf("panic: %v", r))
}
}()
if err = runConsumer(); err != nil {
return fmt.Errorf("run consumer: %w", err)
}
return
}
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
err := run()
if err != nil && !errors.Is(err, context.Canceled) {
logger.Error("run consumer", "err", err)
time.Sleep(time.Second)
continue
}
if err == nil {
return nil
}
if ctx.Err() != nil {
logger.Warn("consumer graceful shutdown timeout")
return ctx.Err()
}
return nil
}
🧪 5. Contract Testing with Schema Validation
Scenario: A consumer crashes due to unexpected payload shape.
Pattern: Validate against schema (Protobuf/Avro) at read-time.
With Go + Protobuf + Schema Registry:
- Use Buf for linted schema definitions
- Validate decoded messages before processing
msg := &MyEvent{}
if err := proto.Unmarshal(m.Value, msg); err != nil {
log.Printf("unmarshal failed: %v", err)
// send to DLQ
}
Or event u can use buf.build/go/protovalidate for validation proto
msg := &MyEvent{}
if err := proto.Unmarshal(m.Value, msg); err != nil {
log.Printf("unmarshal failed: %v", err)
// send to DLQ
}
err := protovalidate.Validate(msg)
if err != nil {
log.Printf("validation failed: %v", err)
// send to DLQ
}
✅ Final Thoughts
Failure handling in Kafka consumers is a balancing act between reliability, throughput, and cost. Go, with its simplicity and kafka-go's flexibility, gives you the right primitives to implement production-grade systems.
Checklist for resilient consumers:
- Backoff and retries
- Dead-letter routing
- Manual offset commit
- Panic recovery and supervision
- Schema validation and message contracts
Top comments (0)