SagarTrimukhe

Posted on May 9

Why `livez` and `readyz` Matter for Kubernetes Health Probes

#kubernetes #distributedsystems

When a service handles a high volume of requests, health probes are not just a small deployment detail anymore. They become part of the reliability strategy.

In Kubernetes, liveness and readiness probes help the platform decide two different things:

Is this container still alive and capable of making forward progress?
Is this pod ready to receive production traffic right now?

This distinction matters. Even one minute of unavailability in a busy service can mean thousands of dropped requests. In loss-sensitive systems like Audit trail, logging systems, and financial services, dropped requests are not just temporary failures. They can become permanent data loss.

What we will cover

In this article, we will look at:

what livez and readyz actually mean
why both probes are needed and why one cannot replace the other
how dependencies like Kafka, Redis, and OpenSearch affect probe design
how to think about readiness failures vs liveness failures
how probe frequency, timeout, and thresholds affect recovery behavior
common mistakes that make Kubernetes restart pods too aggressively

`livez` vs `readyz`

Both probes solve different problems.

Liveness

Liveness answers a simple question: should Kubernetes restart this container?

If the process is stuck, deadlocked, or unable to make progress, the liveness probe should fail so Kubernetes can recreate the pod.

Readiness

Readiness answers another question: should this pod receive traffic?

If the process is alive but temporarily unable to serve requests safely, the readiness probe should fail so Kubernetes removes the pod from the set of ready endpoints without killing it.

This separation is very important in distributed systems. Not every dependency issue should cause a restart. Some failures are transient, and restarting the service can actually make things worse by creating churn, reconnection storms, or repeated cold starts.

One more important point: for automated health decisions, the HTTP status code is what matters. Human-readable details are useful for debugging, but machines should rely on the status code instead of parsing response text.

Visual summary

Why Probe Design Matters in Real Services

Let's take an example of a production service that depends on external infrastructure such as Kafka, OpenSearch, databases, and Redis. If those dependencies are not reachable, the service may no longer be able to accept, process, store, or retrieve requests correctly.

Some dependencies are critical. Others are optional or only affect performance. That is exactly why probe design has to be deliberate:

If every transient Redis failure returns 500 from readiness, pods may flap in and out of service too aggressively.
If a dependency is essential for correctness, masking the failure may keep serving bad or incomplete responses.
If a dependency is only an optimization, such as Redis caching, keeping the pod ready may be the right decision while the service falls back to a slower path.
If a dependency failure causes the process to become unresponsive or completely fail, liveness should fail so Kubernetes can restart it.

There is no single rule that fits every dependency. Health checks should reflect what role that dependency plays in the request path.

A Practical Implementation Pattern

One practical pattern is to keep liveness and readiness intentionally different.

`livez` pattern

A good livez endpoint focuses on forward progress. For read-heavy services, that may mean checking the critical read path. For write-heavy services, it may mean checking process responsiveness and whether repeated Kafka publish failures have pushed the service into an unhealthy state.

`readyz` pattern

readyz should answer one question: Can this pod safely receive traffic right now? If a critical dependency, such as Kafka is unavailable or not yet ready, readiness should fail so Kubernetes stops routing traffic to the pod without restarting it.

Example Kubernetes Probe Configuration

Here is a typical probe configuration:

livenessProbe:
  httpGet:
    path: /livez
    port: app-port
  initialDelaySeconds: 15
  failureThreshold: 3
  periodSeconds: 60
  timeoutSeconds: 30

readinessProbe:
  httpGet:
    path: /readyz
    port: app-port
  initialDelaySeconds: 5
  periodSeconds: 30
  timeoutSeconds: 30

These settings control when probing starts, how often it runs, how long Kubernetes waits, and how many failures it tolerates before acting. In many cases, readiness is checked more frequently than liveness so traffic can be drained quickly without restarting pods too aggressively.

What Can Go Wrong in Distributed Systems

Failures are expected in distributed systems. They are not edge cases. They are normal operating conditions.

A dependency can become unavailable for many reasons:

network partitions between services
DNS lookup failures
broker node failures in Kafka
leader elections or replica rebalancing
intermittent packet loss or latency spikes
TLS handshake or certificate issues
connection pool exhaustion
CPU starvation or event loop stalls
storage latency spikes in downstream systems
OpenSearch cluster pressure or shard unavailability
authentication token expiry or authorization failures
service mesh or load balancer routing problems

So the real job of probe design is to distinguish between a pod that is temporarily degraded and a pod that genuinely needs to be removed or restarted.

Intermittent Failures Need Intelligent Handling

For example, suppose Redis disconnects briefly:

If Redis is only an optimization and requests can still be served correctly, failing readiness may be unnecessary.
If Redis is required to prevent overload or enforce correctness, readiness may need to fail.
Liveness should almost never fail for a short Redis outage unless the service is truly wedged and cannot recover without a restart.

Many systems get probe behavior wrong by treating every dependency failure as a reason to restart. A better pattern is this:

use liveness for stuck or unrecoverable local process failure
use readiness for temporary inability to serve traffic safely
use internal retry logic, circuit breakers, and connection recovery for transient downstream issues

Threshold-based liveness logic follows this principle by reacting to repeated failure instead of a single failed operation.

Tuning Probe Frequency and Thresholds

If probes run too frequently:

they add unnecessary load to the service
they can add load to downstream systems
they may amplify transient failures into noisy operational events

If probes run too infrequently:

unhealthy pods remain in rotation longer
recovery is delayed
downtime or bad responses last longer than necessary

There is no universal number that works for every service. The right settings depend on:

how quickly the service must stop receiving traffic after a failure
how expensive the probe logic is
whether downstream systems can tolerate frequent health checks
whether restarts are cheap or expensive
whether the service is stateless, stateful, or loss-sensitive

For a loss-sensitive service, probe tuning should be conservative enough to avoid false positives, but still responsive enough to stop sending traffic to a pod that cannot safely process requests.

Practical Guidelines

When designing livez and readyz endpoints for a production service, these rules help:

Keep livez focused on process health and forward progress.
Use readyz to reflect whether the pod can safely serve traffic now.
Do not make liveness fail for every transient downstream issue.
Avoid expensive checks inside high-frequency probes.
Use thresholds and recovery logic for intermittent failures.
Treat critical dependencies differently from optional or degradable ones.
Tune probe intervals and thresholds based on service criticality, not guesswork.

Closing Thoughts

Liveness and readiness probes look simple, but they strongly influence availability, recovery behavior, and data safety in Kubernetes.

In simple words:

livez checks whether the process is still functioning in its assigned role
readyz decides whether the pod should continue receiving traffic

That separation is what makes Kubernetes probes useful. They should not just report status. They should reflect operational intent.

If you design them carefully, they help Kubernetes make the right decision during failures. If you design them poorly, they can turn small dependency blips into wider outages.

DEV Community

Why `livez` and `readyz` Matter for Kubernetes Health Probes

What we will cover