Handling Failures in Kafka Consumers: Go Patterns That Work in Production

#go #kafka #microservices

Introduction

Apache Kafka has become the backbone of modern event-driven architectures, enabling high-throughput pipelines, real-time analytics, and loosely coupled microservices. But building resilient Kafka consumers-especially in Go, a language that emphasizes simplicity and low-level control-presents a unique set of challenges.

In production, failures aren’t edge cases-they’re daily realities. Network partitions, broker restarts, corrupted messages, serialization mismatches, and poisoned pills can bring down consumers if not handled properly. Without thoughtful design, you risk message loss, duplication, retry storms, or system-wide downtime.

This article dives deep into production-tested patterns for handling failures in Kafka consumers written in Go. Using the excellent kafka-go library, we’ll cover:

Retry logic with exponential backoff
Dead-letter queues for unrecoverable messages
Manual offset commits for at-least-once delivery
Supervision patterns for resilient runtime behavior
Schema validation to catch bad data before it crashes your logic

Whether you’re building real-time systems or maintaining a high-throughput message processor, these Go patterns will help you design Kafka consumers that can gracefully survive the messiness of production.

🔁 1. Retry Logic with Backoff (for transient errors)

Scenario: Your consumer hits a timeout when fetching messages or a temporary processing failure.

Pattern:
Use exponential backoff with jitter to avoid retry storms and give downstream systems time to recover. Mostly, I prefer https://pkg.go.dev/github.com/cenkalti/backoff/v4 for retries.

Example:


func retryWithBackoff(fn func() error) error {
    backoff := time.Millisecond * 100
    for attempts := 0; attempts < 3; attempts++ {
        err := fn()
        if err == nil {
            return nil
        }
        time.Sleep(backoff + time.Duration(rand.Intn(100))*time.Millisecond)
        backoff *= 2
    }
    return errors.New("max retry attempts reached")
}

Apply to message processing:


for {
    m, err := r.ReadMessage(ctx)
    if err != nil {
        log.Printf("read error: %v", err)
        continue
    }

    retryWithBackoff(func() error {
        return handleMessage(m)
    })
}

🪵 2. Dead Letter Queue (DLQ) for poison pills

Scenario: A malformed message keeps causing panics or errors every time it's retried.

Pattern: After a fixed number of retries, send the message to a separate DLQ topic for analysis.

Example:

func handleWithDLQ(m kafka.Message, producer *kafka.Writer) {
    err := retryWithBackoff(func() error {
        return processMessage(m)
    })

    if err != nil {
        log.Printf("failed to process message: %v", err)
        dlqMessage := kafka.Message{
            Key:   m.Key,
            Value: m.Value,
            Headers: []kafka.Header{
                {Key: "error", Value: []byte(err.Error())},
            },
        }
        _ = producer.WriteMessages(context.Background(), dlqMessage)
    }
}

🔄 3. At-Least-Once Delivery with Manual Commit

Scenario: You want to ensure messages are never lost, even if the process crashes.

Pattern: Read messages, process them, and commit offsets after successful handling.

Example:

r := kafka.NewReader(kafka.ReaderConfig{
    Brokers:  []string{"localhost:9092"},
    Topic:    "events",
    GroupID:  "event-handler",
    MinBytes: 10e3,
    MaxBytes: 10e6,
    CommitInterval: 0, // disable auto-commit
})

for {
    m, err := r.ReadMessage(ctx)
    if err != nil {
        log.Printf("read error: %v", err)
        continue
    }

    if err := handleMessage(m); err == nil {
        _ = r.CommitMessages(ctx, m)
    } else {
        log.Printf("processing error: %v", err)
        // optionally write to DLQ
    }
}

🚦 4. Consumer Supervision & Graceful Restart

Scenario: Your consumer crashes due to an unhandled panic or dependency failure.

Pattern: Use a supervisor process that restarts consumers with metrics, backoff, and alerting.

Example (basic concept):

run := func() (err error) {
        defer func() {
            if r := recover(); r != nil {
                err = errors.Join(err, fmt.Errorf("panic: %v", r))
            }
        }()

        if err = runConsumer(); err != nil {
            return fmt.Errorf("run consumer: %w", err)
        }

        return
    }

    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        default:
        }

        err := run()
        if err != nil && !errors.Is(err, context.Canceled) {
            logger.Error("run consumer", "err", err)
            time.Sleep(time.Second)
            continue
        }

        if err == nil {
            return nil
        }

        if ctx.Err() != nil {
            logger.Warn("consumer graceful shutdown timeout")
            return ctx.Err()
        }

        return nil
    }

🧪 5. Contract Testing with Schema Validation

Scenario: A consumer crashes due to unexpected payload shape.

Pattern: Validate against schema (Protobuf/Avro) at read-time.

With Go + Protobuf + Schema Registry:

Use Buf for linted schema definitions
Validate decoded messages before processing

msg := &MyEvent{}
if err := proto.Unmarshal(m.Value, msg); err != nil {
    log.Printf("unmarshal failed: %v", err)
    // send to DLQ
}

Or event u can use buf.build/go/protovalidate for validation proto

msg := &MyEvent{}
if err := proto.Unmarshal(m.Value, msg); err != nil {
    log.Printf("unmarshal failed: %v", err)
    // send to DLQ
}
err := protovalidate.Validate(msg)
if err != nil {
    log.Printf("validation failed: %v", err)
    // send to DLQ
}

✅ Final Thoughts

Failure handling in Kafka consumers is a balancing act between reliability, throughput, and cost. Go, with its simplicity and kafka-go's flexibility, gives you the right primitives to implement production-grade systems.

Checklist for resilient consumers: