In production environments, health checks are your early warning system — but not all checks are created equal. While Kubernetes’ liveness and readiness probes can tell you if a service is running, they can’t tell you if it's working. We hit this problem head-on while building a Kafka-based CVE processing pipeline, and it led us to create intelligent probes that know whether the consumer is truly healthy.
🚨 The Problem with Basic Probes
Our Kafka consumer processes CVE data, stores it in Postgres, and runs in Kubernetes. Initially, we used standard HTTP liveness/readiness endpoints that simply returned 200 OK
if the process was up.
But we started running into invisible failures:
- The consumer would be running but stuck due to a downstream DB issue.
- It would be assigned a partition but not progressing.
- Or it would silently lag behind, and no alerts were triggered.
These kinds of failures wouldn’t be caught until days later, sometimes after critical delays in CVE ingestion.
🧭 What We Needed
We wanted probes that could:
- Detect if the consumer is stuck or not making progress
- Track offset movement per partition
- Check if the Kafka topic and brokers are reachable
- Verify Postgres availability
- Handle partition rebalances without stale state
🧪 How We Solved It
We used Go, the Sarama Kafka client, and some smart in-memory tracking. Here’s how it works:
🧮 1. Offset Tracking
We used sync.Map
to store:
-
offsetMap
: the most recent offset we’ve seen per partition -
lastCommittedOffsets
: used to detect if the consumer has stopped progressing
offsetMap.Store(message.Partition, message.Offset)
🔍 2. Intelligent Liveness Check
The /liveness
endpoint does more than return 200 OK
. It:
- Connects to Kafka
- Fetches the latest offset for each partition
- Compares it with our last seen offset
- Checks if there’s no new data OR if the offset hasn’t changed over time (indicating a stuck consumer)
if currentOffset == committedOffset || currentOffset == committedOffset+1 {
// no new messages to process
} else if lastCommitted == committedOffset {
// stuck — offset hasn't advanced
unhealthy = true
}
📦 3. Readiness Check
The /readiness
probe ensures:
- At least one Kafka broker is reachable
- The Kafka topic exists
- Postgres is reachable using
.PingContext()
Only if all of these are true does the service report itself as “ready.”
🔁 4. Handling Rebalances
During rebalancing, partitions are reassigned, and we clean up the stale state:
offsetMap.Delete(claim.Partition())
lastCommittedOffsets.Delete(claim.Partition())
This prevents invalid health status when the partition is no longer assigned to the consumer.
💡 Why This is "Intelligent"
Unlike typical health checks, our probes:
- Track progress, not just availability
- Are partition-aware
- Can detect silent failures
- Check both Kafka and DB dependencies
- React to consumer rebalances
This allowed us to catch real issues — like stuck consumers or dead DB connections — before they caused data loss or delays.
📈 Lessons Learned
- Don’t rely on basic probes for stateful services like Kafka consumers
- Offset movement is a powerful signal for consumer health
- Observability is about understanding behavior, not just uptime
✅ Conclusion
This approach has saved us from multiple production incidents by proactively identifying when the consumer isn’t doing useful work. We now treat our consumers like stateful workers — with health checks that are aware of what they’re supposed to be doing, not just whether they’re online.
🙌 Let’s Connect
Thanks for reading! If you enjoyed this post or have thoughts to share, feel free to reach out — I’d love to chat about Kafka, distributed systems, or anything DevOps.
Top comments (0)