🧠 Building Intelligent Kafka Health Probes in Go

#go #kafka #devops #softwaredevelopment

In production environments, health checks are your early warning system — but not all checks are created equal. While Kubernetes’ liveness and readiness probes can tell you if a service is running, they can’t tell you if it's working. We hit this problem head-on while building a Kafka-based CVE processing pipeline, and it led us to create intelligent probes that know whether the consumer is truly healthy.

🚨 The Problem with Basic Probes

Our Kafka consumer processes CVE data, stores it in Postgres, and runs in Kubernetes. Initially, we used standard HTTP liveness/readiness endpoints that simply returned 200 OK if the process was up.

But we started running into invisible failures:

The consumer would be running but stuck due to a downstream DB issue.
It would be assigned a partition but not progressing.
Or it would silently lag behind, and no alerts were triggered.

These kinds of failures wouldn’t be caught until days later, sometimes after critical delays in CVE ingestion.

🧭 What We Needed

We wanted probes that could:

Detect if the consumer is stuck or not making progress
Track offset movement per partition
Check if the Kafka topic and brokers are reachable
Verify Postgres availability
Handle partition rebalances without stale state

🧪 How We Solved It

We used Go, the Sarama Kafka client, and some smart in-memory tracking. Here’s how it works:

🧮 1. Offset Tracking

We used sync.Map to store:

offsetMap: the most recent offset we’ve seen per partition
lastCommittedOffsets: used to detect if the consumer has stopped progressing

offsetMap.Store(message.Partition, message.Offset)

🔍 2. Intelligent Liveness Check

The /liveness endpoint does more than return 200 OK. It:

Connects to Kafka
Fetches the latest offset for each partition
Compares it with our last seen offset
Checks if there’s no new data OR if the offset hasn’t changed over time (indicating a stuck consumer)

if currentOffset == committedOffset || currentOffset == committedOffset+1 {
    // no new messages to process
} else if lastCommitted == committedOffset {
    // stuck — offset hasn't advanced
    unhealthy = true
}

📦 3. Readiness Check

The /readiness probe ensures:

At least one Kafka broker is reachable
The Kafka topic exists
Postgres is reachable using .PingContext()

Only if all of these are true does the service report itself as “ready.”

🔁 4. Handling Rebalances

During rebalancing, partitions are reassigned, and we clean up the stale state:

offsetMap.Delete(claim.Partition())
lastCommittedOffsets.Delete(claim.Partition())

This prevents invalid health status when the partition is no longer assigned to the consumer.

💡 Why This is "Intelligent"

Unlike typical health checks, our probes:

Track progress, not just availability
Are partition-aware
Can detect silent failures
Check both Kafka and DB dependencies
React to consumer rebalances

This allowed us to catch real issues — like stuck consumers or dead DB connections — before they caused data loss or delays.

📈 Lessons Learned

Don’t rely on basic probes for stateful services like Kafka consumers
Offset movement is a powerful signal for consumer health
Observability is about understanding behavior, not just uptime

✅ Conclusion

This approach has saved us from multiple production incidents by proactively identifying when the consumer isn’t doing useful work. We now treat our consumers like stateful workers — with health checks that are aware of what they’re supposed to be doing, not just whether they’re online.

🙌 Let’s Connect

Thanks for reading! If you enjoyed this post or have thoughts to share, feel free to reach out — I’d love to chat about Kafka, distributed systems, or anything DevOps.

📬 Connect with me on LinkedIn