When a service handles a high volume of requests, health probes are not just a small deployment detail anymore. They become part of the reliability strategy.
In Kubernetes, liveness and readiness probes help the platform decide two different things:
- Is this container still alive and capable of making forward progress?
- Is this pod ready to receive production traffic right now?
This distinction matters. Even one minute of unavailability in a busy service can mean thousands of dropped requests. In loss-sensitive systems like Audit trail, logging systems, and financial services, dropped requests are not just temporary failures. They can become permanent data loss.
What we will cover
In this article, we will look at:
- what
livezandreadyzactually mean - why both probes are needed and why one cannot replace the other
- how dependencies like Kafka, Redis, and OpenSearch affect probe design
- how to think about readiness failures vs liveness failures
- how probe frequency, timeout, and thresholds affect recovery behavior
- common mistakes that make Kubernetes restart pods too aggressively
livez vs readyz
Both probes solve different problems.
Liveness
Liveness answers a simple question: should Kubernetes restart this container?
If the process is stuck, deadlocked, or unable to make progress, the liveness probe should fail so Kubernetes can recreate the pod.
Readiness
Readiness answers another question: should this pod receive traffic?
If the process is alive but temporarily unable to serve requests safely, the readiness probe should fail so Kubernetes removes the pod from the set of ready endpoints without killing it.
This separation is very important in distributed systems. Not every dependency issue should cause a restart. Some failures are transient, and restarting the service can actually make things worse by creating churn, reconnection storms, or repeated cold starts.
One more important point: for automated health decisions, the HTTP status code is what matters. Human-readable details are useful for debugging, but machines should rely on the status code instead of parsing response text.
Visual summary
Why Probe Design Matters in Real Services
Let's take an example of a production service that depends on external infrastructure such as Kafka, OpenSearch, databases, and Redis. If those dependencies are not reachable, the service may no longer be able to accept, process, store, or retrieve requests correctly.
Some dependencies are critical. Others are optional or only affect performance. That is exactly why probe design has to be deliberate:
- If every transient Redis failure returns
500from readiness, pods may flap in and out of service too aggressively. - If a dependency is essential for correctness, masking the failure may keep serving bad or incomplete responses.
- If a dependency is only an optimization, such as Redis caching, keeping the pod ready may be the right decision while the service falls back to a slower path.
- If a dependency failure causes the process to become unresponsive or completely fail, liveness should fail so Kubernetes can restart it.
There is no single rule that fits every dependency. Health checks should reflect what role that dependency plays in the request path.
A Practical Implementation Pattern
One practical pattern is to keep liveness and readiness intentionally different.
livez pattern
A good livez endpoint focuses on forward progress. For read-heavy services, that may mean checking the critical read path. For write-heavy services, it may mean checking process responsiveness and whether repeated Kafka publish failures have pushed the service into an unhealthy state.
readyz pattern
readyz should answer one question: Can this pod safely receive traffic right now? If a critical dependency, such as Kafka is unavailable or not yet ready, readiness should fail so Kubernetes stops routing traffic to the pod without restarting it.
Example Kubernetes Probe Configuration
Here is a typical probe configuration:
livenessProbe:
httpGet:
path: /livez
port: app-port
initialDelaySeconds: 15
failureThreshold: 3
periodSeconds: 60
timeoutSeconds: 30
readinessProbe:
httpGet:
path: /readyz
port: app-port
initialDelaySeconds: 5
periodSeconds: 30
timeoutSeconds: 30
These settings control when probing starts, how often it runs, how long Kubernetes waits, and how many failures it tolerates before acting. In many cases, readiness is checked more frequently than liveness so traffic can be drained quickly without restarting pods too aggressively.
What Can Go Wrong in Distributed Systems
Failures are expected in distributed systems. They are not edge cases. They are normal operating conditions.
A dependency can become unavailable for many reasons:
- network partitions between services
- DNS lookup failures
- broker node failures in Kafka
- leader elections or replica rebalancing
- intermittent packet loss or latency spikes
- TLS handshake or certificate issues
- connection pool exhaustion
- CPU starvation or event loop stalls
- storage latency spikes in downstream systems
- OpenSearch cluster pressure or shard unavailability
- authentication token expiry or authorization failures
- service mesh or load balancer routing problems
So the real job of probe design is to distinguish between a pod that is temporarily degraded and a pod that genuinely needs to be removed or restarted.
Intermittent Failures Need Intelligent Handling
For example, suppose Redis disconnects briefly:
- If Redis is only an optimization and requests can still be served correctly, failing readiness may be unnecessary.
- If Redis is required to prevent overload or enforce correctness, readiness may need to fail.
- Liveness should almost never fail for a short Redis outage unless the service is truly wedged and cannot recover without a restart.
Many systems get probe behavior wrong by treating every dependency failure as a reason to restart. A better pattern is this:
- use liveness for stuck or unrecoverable local process failure
- use readiness for temporary inability to serve traffic safely
- use internal retry logic, circuit breakers, and connection recovery for transient downstream issues
Threshold-based liveness logic follows this principle by reacting to repeated failure instead of a single failed operation.
Tuning Probe Frequency and Thresholds
If probes run too frequently:
- they add unnecessary load to the service
- they can add load to downstream systems
- they may amplify transient failures into noisy operational events
If probes run too infrequently:
- unhealthy pods remain in rotation longer
- recovery is delayed
- downtime or bad responses last longer than necessary
There is no universal number that works for every service. The right settings depend on:
- how quickly the service must stop receiving traffic after a failure
- how expensive the probe logic is
- whether downstream systems can tolerate frequent health checks
- whether restarts are cheap or expensive
- whether the service is stateless, stateful, or loss-sensitive
For a loss-sensitive service, probe tuning should be conservative enough to avoid false positives, but still responsive enough to stop sending traffic to a pod that cannot safely process requests.
Practical Guidelines
When designing livez and readyz endpoints for a production service, these rules help:
- Keep
livezfocused on process health and forward progress. - Use
readyzto reflect whether the pod can safely serve traffic now. - Do not make liveness fail for every transient downstream issue.
- Avoid expensive checks inside high-frequency probes.
- Use thresholds and recovery logic for intermittent failures.
- Treat critical dependencies differently from optional or degradable ones.
- Tune probe intervals and thresholds based on service criticality, not guesswork.
Closing Thoughts
Liveness and readiness probes look simple, but they strongly influence availability, recovery behavior, and data safety in Kubernetes.
In simple words:
-
livezchecks whether the process is still functioning in its assigned role -
readyzdecides whether the pod should continue receiving traffic
That separation is what makes Kubernetes probes useful. They should not just report status. They should reflect operational intent.
If you design them carefully, they help Kubernetes make the right decision during failures. If you design them poorly, they can turn small dependency blips into wider outages.




Top comments (0)