"Deploying a service that crashes during its warm-up window is a guaranteed outage until your liveness probes detect and restart it."
What We're Building
We are implementing a Go-based backend service that adheres to strict observability standards within a Kubernetes environment. The goal is to distinguish between the time an application spins up, the time it accepts traffic, and the time it detects internal deadlocks. We will use three distinct HTTP endpoints to signal different states to the orchestrator. This architecture prevents traffic from hitting a pod that cannot serve requests, while allowing the orchestrator to restart a frozen process without dropping in-flight connections.
Startup Probe: [--- Waiting ---] -> Ready to Accept
Readiness Probe: [--- Idle ---] -> [--- Serving ---]
Liveness Probe: [--- Running ---] -> [--- Crash ---]
Step 1 — Startup Probe Implementation
The startup probe tells the orchestrator to ignore traffic checks until the application has initialized core components like the signal handler or logging. We set an initial delay and a long timeout to match the heavy initialization cost.
In Go, we create a handler that simulates a cold start. We use a boolean flag ready initialized to false that flips only after heavy logic.
var ready = false
func main() {
// Simulate initialization
time.Sleep(10 * time.Second)
ready = true
}
This specific choice matters because it allows the container to boot up without failing the liveness check prematurely.
Step 2 — Readiness Probe Implementation
The readiness probe validates whether the application can actually handle requests from the user base. This check must verify database connections and external dependencies before returning a 200 status.
func readinessHandler(w http.ResponseWriter, r *http.Request) {
if !dbIsConnected {
http.Error(w, "Dependencies not ready", http.StatusServiceUnavailable)
return
}
if ready {
w.WriteHeader(http.StatusOK)
} else {
w.WriteHeader(http.StatusServiceUnavailable)
}
}
This separation ensures that the liveness loop does not kill a process that is merely waiting for a connection pool to fill.
Step 3 — Liveness Probe Implementation
The liveness probe detects if the process is stuck in a bad state, such as a deadlock or excessive garbage collection pauses. If this check fails, the orchestrator must restart the container.
We keep this handler extremely lightweight to avoid increasing latency during the check window.
func livenessHandler(w http.ResponseWriter, r *http.Request) {
// If stuck, this never returns.
// Orchestration triggers restart here.
http.Error(w, "Service is alive", http.StatusOK)
}
Using a heavy check here would defeat the purpose, so we strictly monitor process survival here.
Step 4 — Dependency Validation Logic
Readiness probes often fail when external services become unreachable. We implement a retry loop in the startup logic to prevent rapid restart loops.
func connectToDB() error {
// Attempt connection with exponential backoff
for i := 0; i < 5; i++ {
conn, err := db.New()
if err == nil {
return nil
}
time.Sleep(time.Duration(i) * 2 * time.Second)
}
return errors.New("db unreachable")
}
This prevents a pod from entering a crash loop if the database is temporarily overloaded or undergoing migration.
Step 5 — Configuration and Timing Tuning
Finally, we expose these checks via configuration. We define the intervals and thresholds in the deployment manifest rather than hardcoding them.
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /health/live
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 3
This approach allows DevOps to tune stability parameters without recompiling the application binary.
Key Takeaways
- Startup Probes should delay traffic routing until the process has finished heavy initialization tasks.
- Readiness Probes must verify external connectivity before accepting live user requests.
- Liveness Probes are strictly for detecting process deadlocks and triggering restarts.
- Graceful Shutdowns must drain connections before terminating to satisfy the liveness contract.
- Failure Thresholds define how many consecutive failures trigger an action without false positives.
What's Next
You can now integrate these patterns into your CI/CD pipelines. Consider adding custom metrics to track probe latency over time.
Further Reading
- Designing Data-Intensive Applications (Kleppmann) — covers reliability engineering and the failure detection patterns that underpin health check design.
- A Philosophy of Software Design (Ousterhout) — invaluable guide for keeping probe logic simple and avoiding the complexity that leads to false positives.
Part of the Architecture Patterns series.
Top comments (0)