Dylan Dumont

Posted on Apr 7

Health Checks That Actually Work: Liveness vs Readiness vs Startup Probes

#architecture #observability #backend #devops

A crash loop isn't always a bug; sometimes it's the orchestrator restarting a service before it's actually capable of serving traffic.

What We're Building

We are defining three distinct health check strategies for a Go-based microservice running in a container orchestration environment. The scope covers implementing HTTP endpoints and external checks that signal the scheduler when the application is initializing, when it is safe to receive traffic, and when it requires a restart. We will define configuration structures that separate initialization logic from traffic readiness logic, ensuring the system recovers gracefully from deadlocks without dropping requests.

Step 1 — Define the Probe Lifecycle

The core concept is treating health checks as state transitions rather than a single boolean. We must separate the duration required to become ready from the duration required to recover from a failure.

type ProbeConfig struct {
    Startup   Probe `json:"startup"`
    Readiness Probe `json:"readiness"`
    Liveness  Probe `json:"liveness"`
}

type Probe struct {
    Enabled    bool `json:"enabled"`
    Path       string `json:"path"`
    FailureThreshold int `json:"failureThreshold"`
    PeriodSeconds int `json:"periodSeconds"`
}

This abstraction ensures you never mix startup timing with traffic readiness, allowing the scheduler to apply different backoff strategies for initialization versus recovery.

Step 2 — Implement Startup Probes for Cold Starts

The startup probe tells the orchestrator to ignore health check failures during initialization. Without this, a pod might be marked unhealthy while loading database migrations or JIT-compiling binaries.

func initHandler(w http.ResponseWriter, r *http.Request) {
    if !serverReady {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    // Only succeed if initialization is truly complete
    w.WriteHeader(http.StatusOK)
}

This prevents Kubernetes from killing the pod during heavy initialization, allowing the process to finish loading resources before the system assumes failure.

Step 3 — Implement Readiness Probes for Traffic Gating

The readiness probe signals when the application can safely handle external requests. It should verify external dependencies, like database connections, before exposing the service to the load balancer.

Startup   >----> [Ready?] --> (Startup Done)
         |           |
         v           v
    (Not Ready)  (Readiness Done)
               |           |
               v           v
    (Restart)  [Traffic] --> (Readiness Check)
               ^           |
               |           v
         (Liveness Check) <--- [Traffic]

Traffic is only routed when the application successfully connects to its data layer, preventing database connection timeouts for users.

Step 4 — Implement Liveness Probes for Recovery

The liveness probe detects if the application is stuck in a bad state, such as an infinite loop or resource exhaustion. If this check fails, the orchestrator restarts the process.

func liveHandler(w http.ResponseWriter, r *http.Request) {
    // Simulate a check for stuck goroutines or memory leaks
    if !gcSafe() {
        // Reset internal state if possible before failing
        forceGC()
    }
    w.WriteHeader(http.StatusOK)
}

A reset restarts the process if the application thinks it is alive but is stuck, clearing deadlocks without dropping active user requests.

Step 5 — Configure Failure Thresholds

Configuring thresholds is critical for avoiding noise. A single momentary database timeout should not trigger a restart if the application can recover within the periodSeconds window.

config := ProbeConfig{
    Startup: Probe{
        Path:         "/startup",
        PeriodSeconds: 10,
        FailureThreshold: 30, // Allow ~5 mins for heavy init
    },
    Readiness: Probe{
        Path:         "/ready",
        PeriodSeconds: 10,
        FailureThreshold: 3,  // Max ~30s before traffic stops
    },
    Liveness: Probe{
        Path:         "/live",
        PeriodSeconds: 10,
        FailureThreshold: 3,  // Restart quickly if stuck
    },
}

Aggressive thresholds cause unnecessary restarts that degrade user experience, while lenient ones hide slow-starting services.

Step 6 — Validate End-to-End Flows

Verification must be automated to ensure the probes work in production, not just development environments. You should mock the readiness dependencies during unit tests to ensure the probe returns 503 correctly when dependencies are simulated as down.

func TestProbeEndpoints(t *testing.T) {
    // Test cases verify 500/503 status codes during failures
    server := httptest.NewServer(http.HandlerFunc(liveHandler))
    defer server.Close()
    assertStatusCode(t, server.URL + "/live", 200)
}

Automated verification ensures the probes reflect reality in CI pipelines, catching misconfigurations before they reach production.

Key Takeaways

State Machine: Treat health checks as distinct signals for initialization, traffic readiness, and system recovery rather than a single status.
Traffic Gating: Always implement readiness probes to ensure no requests are sent to a service that cannot serve data.
Recovery: Use liveness probes specifically to detect and recover from application hangs or deadlocks automatically.
Initialization: Configure startup probes with higher failure thresholds to prevent premature kills during heavy bootstrapping.
Dependencies: Health checks must verify external resources like databases or caches, not just internal HTTP listeners.

What's Next?

Integrate these probe endpoints into your existing monitoring stack to expose metrics on success rates and latency.
Review your orchestration provider's documentation to understand default backoff strategies and how they interact with custom probes.
Consider implementing readiness endpoints that check specific queue depths or thread pool utilization for high-concurrency services.
Audit your application logs to ensure failure messages from probes align with the root cause of restarts.

DEV Community