A crash loop isn't always a bug; sometimes it's the orchestrator restarting a service before it's actually capable of serving traffic.
What We're Building
We are defining three distinct health check strategies for a Go-based microservice running in a container orchestration environment. The scope covers implementing HTTP endpoints and external checks that signal the scheduler when the application is initializing, when it is safe to receive traffic, and when it requires a restart. We will define configuration structures that separate initialization logic from traffic readiness logic, ensuring the system recovers gracefully from deadlocks without dropping requests.
Step 1 — Define the Probe Lifecycle
The core concept is treating health checks as state transitions rather than a single boolean. We must separate the duration required to become ready from the duration required to recover from a failure.
type ProbeConfig struct {
Startup Probe `json:"startup"`
Readiness Probe `json:"readiness"`
Liveness Probe `json:"liveness"`
}
type Probe struct {
Enabled bool `json:"enabled"`
Path string `json:"path"`
FailureThreshold int `json:"failureThreshold"`
PeriodSeconds int `json:"periodSeconds"`
}
This abstraction ensures you never mix startup timing with traffic readiness, allowing the scheduler to apply different backoff strategies for initialization versus recovery.
Step 2 — Implement Startup Probes for Cold Starts
The startup probe tells the orchestrator to ignore health check failures during initialization. Without this, a pod might be marked unhealthy while loading database migrations or JIT-compiling binaries.
func initHandler(w http.ResponseWriter, r *http.Request) {
if !serverReady {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
// Only succeed if initialization is truly complete
w.WriteHeader(http.StatusOK)
}
This prevents Kubernetes from killing the pod during heavy initialization, allowing the process to finish loading resources before the system assumes failure.
Step 3 — Implement Readiness Probes for Traffic Gating
The readiness probe signals when the application can safely handle external requests. It should verify external dependencies, like database connections, before exposing the service to the load balancer.
Startup >----> [Ready?] --> (Startup Done)
| |
v v
(Not Ready) (Readiness Done)
| |
v v
(Restart) [Traffic] --> (Readiness Check)
^ |
| v
(Liveness Check) <--- [Traffic]
Traffic is only routed when the application successfully connects to its data layer, preventing database connection timeouts for users.
Step 4 — Implement Liveness Probes for Recovery
The liveness probe detects if the application is stuck in a bad state, such as an infinite loop or resource exhaustion. If this check fails, the orchestrator restarts the process.
func liveHandler(w http.ResponseWriter, r *http.Request) {
// Simulate a check for stuck goroutines or memory leaks
if !gcSafe() {
// Reset internal state if possible before failing
forceGC()
}
w.WriteHeader(http.StatusOK)
}
A reset restarts the process if the application thinks it is alive but is stuck, clearing deadlocks without dropping active user requests.
Step 5 — Configure Failure Thresholds
Configuring thresholds is critical for avoiding noise. A single momentary database timeout should not trigger a restart if the application can recover within the periodSeconds window.
config := ProbeConfig{
Startup: Probe{
Path: "/startup",
PeriodSeconds: 10,
FailureThreshold: 30, // Allow ~5 mins for heavy init
},
Readiness: Probe{
Path: "/ready",
PeriodSeconds: 10,
FailureThreshold: 3, // Max ~30s before traffic stops
},
Liveness: Probe{
Path: "/live",
PeriodSeconds: 10,
FailureThreshold: 3, // Restart quickly if stuck
},
}
Aggressive thresholds cause unnecessary restarts that degrade user experience, while lenient ones hide slow-starting services.
Step 6 — Validate End-to-End Flows
Verification must be automated to ensure the probes work in production, not just development environments. You should mock the readiness dependencies during unit tests to ensure the probe returns 503 correctly when dependencies are simulated as down.
func TestProbeEndpoints(t *testing.T) {
// Test cases verify 500/503 status codes during failures
server := httptest.NewServer(http.HandlerFunc(liveHandler))
defer server.Close()
assertStatusCode(t, server.URL + "/live", 200)
}
Automated verification ensures the probes reflect reality in CI pipelines, catching misconfigurations before they reach production.
Key Takeaways
- State Machine: Treat health checks as distinct signals for initialization, traffic readiness, and system recovery rather than a single status.
- Traffic Gating: Always implement readiness probes to ensure no requests are sent to a service that cannot serve data.
- Recovery: Use liveness probes specifically to detect and recover from application hangs or deadlocks automatically.
- Initialization: Configure startup probes with higher failure thresholds to prevent premature kills during heavy bootstrapping.
- Dependencies: Health checks must verify external resources like databases or caches, not just internal HTTP listeners.
What's Next?
- Integrate these probe endpoints into your existing monitoring stack to expose metrics on success rates and latency.
- Review your orchestration provider's documentation to understand default backoff strategies and how they interact with custom probes.
- Consider implementing readiness endpoints that check specific queue depths or thread pool utilization for high-concurrency services.
- Audit your application logs to ensure failure messages from probes align with the root cause of restarts.
Further Reading
- Designing Data-Intensive Applications (Kleppmann) — Relevant for understanding the reliability requirements of data stores that health probes often interact with.
- A Philosophy of Software Design (Ousterhout) — Useful for understanding why keeping probe logic separate from core business logic reduces complexity.
Part of the Architecture Patterns series.
Top comments (0)