DEV Community

Cover image for Handling Failures in Kafka Consumers: Go Patterns That Work in Production
Vlad Pistun
Vlad Pistun

Posted on

Handling Failures in Kafka Consumers: Go Patterns That Work in Production

Introduction

Apache Kafka has become the backbone of modern event-driven architectures, enabling high-throughput pipelines, real-time analytics, and loosely coupled microservices. But building resilient Kafka consumers-especially in Go, a language that emphasizes simplicity and low-level control-presents a unique set of challenges.

In production, failures aren’t edge cases-they’re daily realities. Network partitions, broker restarts, corrupted messages, serialization mismatches, and poisoned pills can bring down consumers if not handled properly. Without thoughtful design, you risk message loss, duplication, retry storms, or system-wide downtime.

This article dives deep into production-tested patterns for handling failures in Kafka consumers written in Go. Using the excellent kafka-go library, we’ll cover:

  • Retry logic with exponential backoff
  • Dead-letter queues for unrecoverable messages
  • Manual offset commits for at-least-once delivery
  • Supervision patterns for resilient runtime behavior
  • Schema validation to catch bad data before it crashes your logic

Whether you’re building real-time systems or maintaining a high-throughput message processor, these Go patterns will help you design Kafka consumers that can gracefully survive the messiness of production.

🔁 1. Retry Logic with Backoff (for transient errors)

Scenario: Your consumer hits a timeout when fetching messages or a temporary processing failure.

Pattern:
Use exponential backoff with jitter to avoid retry storms and give downstream systems time to recover. Mostly, I prefer https://pkg.go.dev/github.com/cenkalti/backoff/v4 for retries.

Example:


func retryWithBackoff(fn func() error) error {
    backoff := time.Millisecond * 100
    for attempts := 0; attempts < 3; attempts++ {
        err := fn()
        if err == nil {
            return nil
        }
        time.Sleep(backoff + time.Duration(rand.Intn(100))*time.Millisecond)
        backoff *= 2
    }
    return errors.New("max retry attempts reached")
}

Enter fullscreen mode Exit fullscreen mode

Apply to message processing:


for {
    m, err := r.ReadMessage(ctx)
    if err != nil {
        log.Printf("read error: %v", err)
        continue
    }

    retryWithBackoff(func() error {
        return handleMessage(m)
    })
}

Enter fullscreen mode Exit fullscreen mode

🪵 2. Dead Letter Queue (DLQ) for poison pills

Scenario: A malformed message keeps causing panics or errors every time it's retried.

Pattern: After a fixed number of retries, send the message to a separate DLQ topic for analysis.

Example:

func handleWithDLQ(m kafka.Message, producer *kafka.Writer) {
    err := retryWithBackoff(func() error {
        return processMessage(m)
    })

    if err != nil {
        log.Printf("failed to process message: %v", err)
        dlqMessage := kafka.Message{
            Key:   m.Key,
            Value: m.Value,
            Headers: []kafka.Header{
                {Key: "error", Value: []byte(err.Error())},
            },
        }
        _ = producer.WriteMessages(context.Background(), dlqMessage)
    }
}
Enter fullscreen mode Exit fullscreen mode

🔄 3. At-Least-Once Delivery with Manual Commit

Scenario: You want to ensure messages are never lost, even if the process crashes.

Pattern: Read messages, process them, and commit offsets after successful handling.

Example:

r := kafka.NewReader(kafka.ReaderConfig{
    Brokers:  []string{"localhost:9092"},
    Topic:    "events",
    GroupID:  "event-handler",
    MinBytes: 10e3,
    MaxBytes: 10e6,
    CommitInterval: 0, // disable auto-commit
})

for {
    m, err := r.ReadMessage(ctx)
    if err != nil {
        log.Printf("read error: %v", err)
        continue
    }

    if err := handleMessage(m); err == nil {
        _ = r.CommitMessages(ctx, m)
    } else {
        log.Printf("processing error: %v", err)
        // optionally write to DLQ
    }
}
Enter fullscreen mode Exit fullscreen mode

🚦 4. Consumer Supervision & Graceful Restart

Scenario: Your consumer crashes due to an unhandled panic or dependency failure.

Pattern: Use a supervisor process that restarts consumers with metrics, backoff, and alerting.

Example (basic concept):

run := func() (err error) {
        defer func() {
            if r := recover(); r != nil {
                err = errors.Join(err, fmt.Errorf("panic: %v", r))
            }
        }()

        if err = runConsumer(); err != nil {
            return fmt.Errorf("run consumer: %w", err)
        }

        return
    }

    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        default:
        }

        err := run()
        if err != nil && !errors.Is(err, context.Canceled) {
            logger.Error("run consumer", "err", err)
            time.Sleep(time.Second)
            continue
        }

        if err == nil {
            return nil
        }

        if ctx.Err() != nil {
            logger.Warn("consumer graceful shutdown timeout")
            return ctx.Err()
        }

        return nil
    }
Enter fullscreen mode Exit fullscreen mode

🧪 5. Contract Testing with Schema Validation

Scenario: A consumer crashes due to unexpected payload shape.

Pattern: Validate against schema (Protobuf/Avro) at read-time.

With Go + Protobuf + Schema Registry:

  • Use Buf for linted schema definitions
  • Validate decoded messages before processing
msg := &MyEvent{}
if err := proto.Unmarshal(m.Value, msg); err != nil {
    log.Printf("unmarshal failed: %v", err)
    // send to DLQ
}

Enter fullscreen mode Exit fullscreen mode

Or event u can use buf.build/go/protovalidate for validation proto

msg := &MyEvent{}
if err := proto.Unmarshal(m.Value, msg); err != nil {
    log.Printf("unmarshal failed: %v", err)
    // send to DLQ
}
err := protovalidate.Validate(msg)
if err != nil {
    log.Printf("validation failed: %v", err)
    // send to DLQ
}

Enter fullscreen mode Exit fullscreen mode

✅ Final Thoughts

Failure handling in Kafka consumers is a balancing act between reliability, throughput, and cost. Go, with its simplicity and kafka-go's flexibility, gives you the right primitives to implement production-grade systems.

Checklist for resilient consumers:

  • Backoff and retries
  • Dead-letter routing
  • Manual offset commit
  • Panic recovery and supervision
  • Schema validation and message contracts

Top comments (0)