DEV Community

Nitesh More
Nitesh More

Posted on

🧠 Building Intelligent Kafka Health Probes in Go

In production environments, health checks are your early warning system — but not all checks are created equal. While Kubernetes’ liveness and readiness probes can tell you if a service is running, they can’t tell you if it's working. We hit this problem head-on while building a Kafka-based CVE processing pipeline, and it led us to create intelligent probes that know whether the consumer is truly healthy.


Visual diagram

🚨 The Problem with Basic Probes

Our Kafka consumer processes CVE data, stores it in Postgres, and runs in Kubernetes. Initially, we used standard HTTP liveness/readiness endpoints that simply returned 200 OK if the process was up.

But we started running into invisible failures:

  • The consumer would be running but stuck due to a downstream DB issue.
  • It would be assigned a partition but not progressing.
  • Or it would silently lag behind, and no alerts were triggered.

These kinds of failures wouldn’t be caught until days later, sometimes after critical delays in CVE ingestion.


🧭 What We Needed

We wanted probes that could:

  • Detect if the consumer is stuck or not making progress
  • Track offset movement per partition
  • Check if the Kafka topic and brokers are reachable
  • Verify Postgres availability
  • Handle partition rebalances without stale state

🧪 How We Solved It

We used Go, the Sarama Kafka client, and some smart in-memory tracking. Here’s how it works:

🧮 1. Offset Tracking

We used sync.Map to store:

  • offsetMap: the most recent offset we’ve seen per partition
  • lastCommittedOffsets: used to detect if the consumer has stopped progressing
offsetMap.Store(message.Partition, message.Offset)
Enter fullscreen mode Exit fullscreen mode

🔍 2. Intelligent Liveness Check

The /liveness endpoint does more than return 200 OK. It:

  • Connects to Kafka
  • Fetches the latest offset for each partition
  • Compares it with our last seen offset
  • Checks if there’s no new data OR if the offset hasn’t changed over time (indicating a stuck consumer)
if currentOffset == committedOffset || currentOffset == committedOffset+1 {
    // no new messages to process
} else if lastCommitted == committedOffset {
    // stuck — offset hasn't advanced
    unhealthy = true
}
Enter fullscreen mode Exit fullscreen mode

📦 3. Readiness Check

The /readiness probe ensures:

  • At least one Kafka broker is reachable
  • The Kafka topic exists
  • Postgres is reachable using .PingContext()

Only if all of these are true does the service report itself as “ready.”


🔁 4. Handling Rebalances

During rebalancing, partitions are reassigned, and we clean up the stale state:

offsetMap.Delete(claim.Partition())
lastCommittedOffsets.Delete(claim.Partition())
Enter fullscreen mode Exit fullscreen mode

This prevents invalid health status when the partition is no longer assigned to the consumer.


💡 Why This is "Intelligent"

Unlike typical health checks, our probes:

  • Track progress, not just availability
  • Are partition-aware
  • Can detect silent failures
  • Check both Kafka and DB dependencies
  • React to consumer rebalances

This allowed us to catch real issues — like stuck consumers or dead DB connections — before they caused data loss or delays.


📈 Lessons Learned

  • Don’t rely on basic probes for stateful services like Kafka consumers
  • Offset movement is a powerful signal for consumer health
  • Observability is about understanding behavior, not just uptime

✅ Conclusion

This approach has saved us from multiple production incidents by proactively identifying when the consumer isn’t doing useful work. We now treat our consumers like stateful workers — with health checks that are aware of what they’re supposed to be doing, not just whether they’re online.


🙌 Let’s Connect

Thanks for reading! If you enjoyed this post or have thoughts to share, feel free to reach out — I’d love to chat about Kafka, distributed systems, or anything DevOps.

📬 Connect with me on LinkedIn

Top comments (0)