- Book: System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
It's 3am. Postgres has a 4-second hiccup. Pager goes off. By the time someone is on the call, half your fleet is in CrashLoopBackOff and the other half is restarting in a wave. The DB recovered after 30 seconds. Your service didn't.
The bug is one line of YAML. Someone copy-pasted the readiness probe block into the liveness slot a year ago. Nobody caught it because nothing went wrong until the database burped.
This is the most common Kubernetes config mistake in the wild, and it's a symptom of a deeper confusion: there isn't one kind of health check. There are four. Each answers a different question. Pick the wrong one and you trade a transient dependency hiccup for a cascading restart storm.
Four shapes, four jobs
| Shape | Question it answers | Action on failure | Frequency | Scope |
|---|---|---|---|---|
| Liveness | Is the process stuck? | Kill and restart the pod | 10–30s | In-process only |
| Readiness | Should I get traffic now? | Remove from load balancer | 2–10s | Process + dependencies |
| Startup | Am I still booting? | Suppress liveness during boot | 5–10s, generous failure threshold | In-process only |
| Gossip | Do my peers know I'm alive? | Mark node dead in cluster | 250ms–2s | Cluster membership |
The thing to notice: only readiness should ever check dependencies. Liveness and startup are about this process. Gossip is about the cluster's view of this node. Mixing them is what makes 3am outages so reliably awful.
Liveness: "restart me if I'm stuck"
Liveness answers one question: is this process still doing something useful, or is it wedged? Deadlocked goroutine, infinite loop, OOM but somehow still responding to TCP. That's what liveness catches.
It must be cheap. It must not touch your database. It must not touch Redis, Kafka, the auth service, or any sidecar that could become unhealthy on its own schedule. If your liveness probe depends on Postgres, your service's life depends on Postgres being up. That's not what you want.
A minimal liveness handler in Go:
// pkg/health/live.go
func Liveness(w http.ResponseWriter, r *http.Request) {
// no DB, no cache, no downstream. just "i'm here".
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("ok"))
}
In Node:
// src/health/live.js
app.get('/livez', (req, res) => {
// intentionally trivial. if this 500s, the process is wedged.
res.status(200).send('ok');
});
You can get fancier by exposing a watchdog that the main loop tickles, so a deadlock in your event processor actually fails liveness:
// pkg/health/watchdog.go
type Watchdog struct {
last atomic.Int64
}
func (w *Watchdog) Tick() {
w.last.Store(time.Now().UnixNano())
}
func (w *Watchdog) Liveness(rw http.ResponseWriter, r *http.Request) {
age := time.Since(time.Unix(0, w.last.Load()))
if age > 30*time.Second {
rw.WriteHeader(http.StatusServiceUnavailable)
return
}
rw.WriteHeader(http.StatusOK)
}
The main loop calls wd.Tick() after each iteration. If it stops ticking for 30 seconds, the kubelet kills the pod. That's the right shape: in-process signal, no external dependency.
Readiness: "don't send me traffic right now"
Readiness is where the dependency checks live. Can I reach the database? Is my connection pool warm? Has the message broker accepted my consumer registration? If any of that is broken, I don't want traffic. But I also don't want to die.
That distinction is the entire point. A readiness failure pulls you out of the load balancer. A liveness failure kills your process.
// pkg/health/ready.go
func (h *Handlers) Readiness(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 500*time.Millisecond)
defer cancel()
if err := h.db.PingContext(ctx); err != nil {
http.Error(w, "db: "+err.Error(), http.StatusServiceUnavailable)
return
}
if err := h.redis.Ping(ctx).Err(); err != nil {
http.Error(w, "redis: "+err.Error(), http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
// src/health/ready.js
app.get('/readyz', async (req, res) => {
try {
await pool.query('SELECT 1');
await redis.ping();
res.status(200).send('ok');
} catch (err) {
// log but don't crash. readiness *is allowed* to fail.
res.status(503).send(`not ready: ${err.message}`);
}
});
Two gotchas worth flagging.
First: bound the dependency checks with a short timeout. If your DB ping has no deadline and the DB is slow rather than dead, every readiness probe consumes a connection for 30 seconds. You'll exhaust the pool with health checks alone.
Second: cache aggressively. A readiness check that hits five downstream services every 2 seconds across 200 pods is 500 requests-per-second of synthetic traffic. Cache the result for 1–2 seconds and serve from memory.
// pkg/health/cached.go
type cachedCheck struct {
mu sync.Mutex
last time.Time
result error
ttl time.Duration
check func(context.Context) error
}
func (c *cachedCheck) Run(ctx context.Context) error {
c.mu.Lock()
defer c.mu.Unlock()
if time.Since(c.last) < c.ttl {
return c.result
}
c.result = c.check(ctx)
c.last = time.Now()
return c.result
}
Startup: "wait, I'm still booting"
The most commonly missing probe. Without it, a slow-starting service is in a race with its own liveness check. JVM warmup, big in-memory cache hydration, model loading on an inference pod. Anything that takes more than a few seconds to come up risks getting killed mid-boot.
The fix isn't to crank initialDelaySeconds on liveness to 5 minutes. That's a worst-of-both-worlds hack: now you've also delayed liveness forever for the rest of the pod's life, so a wedged process takes 5 minutes to detect.
Startup probes solve this cleanly. They suppress liveness and readiness while they're running. Once they pass, they disable themselves and the other two take over.
// pkg/health/startup.go
func (h *Handlers) Startup(w http.ResponseWriter, r *http.Request) {
if !h.bootDone.Load() {
http.Error(w, "still booting", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
// src/health/startup.js
app.get('/startupz', (req, res) => {
if (!app.locals.bootComplete) {
return res.status(503).send('warming caches');
}
res.status(200).send('ready');
});
Set bootDone to true after migrations have run, caches are populated, and your first dependency ping succeeds. That's the moment you're ready to be evaluated like a normal pod.
Gossip heartbeats: "peers, I'm here"
The other heartbeat shape lives outside Kubernetes entirely. Cassandra nodes gossip every second to maintain ring membership. Consul members ping each other via the SWIM protocol. Serf and HashiCorp's gossip stack are the same idea. Akka Cluster does its own variant.
These run on sub-second timescales. Cassandra's default failure_detector.phi_convict_threshold translates to roughly 8–10 seconds before a node is marked DOWN. Consul's LAN gossip uses 200ms probes with a 500ms timeout and a 1-second indirect probe before marking a peer suspect.
The gotcha that bites everyone: GC pauses eat your gossip budget. A 4-second stop-the-world pause on a JVM Cassandra node makes peers convict it as dead. They redirect reads and writes, and when the node un-pauses it thinks it's been part of the ring the whole time. Hello, split brain. Hello, stale reads. Hello, that ticket nobody wants to triage.
Tuning the failure detector is the cure. If your worst observed GC pause is 3 seconds, your gossip-convict threshold needs to sit comfortably above that. Eight to ten seconds is the common Cassandra setting. For Consul, you raise serf_lan.probe_timeout and probe_interval together. For G1GC, tune MaxGCPauseMillis so pauses stay predictable rather than relying on heroic timeouts.
The principle: gossip timeout must exceed worst-case GC pause, by a comfortable margin. Otherwise you're using your heartbeat to detect garbage collection, which is not what it's for.
The 3am outage pattern: readiness as liveness
Here's the original config. It looks fine on a casual read.
# deployment.yaml: the broken version
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-api
spec:
replicas: 30
template:
spec:
containers:
- name: orders
image: orders-api:v2.14.0
livenessProbe:
httpGet:
path: /readyz # <-- the bug
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 2
Both probes hit /readyz. /readyz pings Postgres. Postgres slows down for 35 seconds. A long checkpoint, a noisy neighbour on the cloud volume, anything. Every pod fails three liveness probes in a row, gets killed, and restarts. All 30 of them. At once.
Now the restarting pods open new connections to the DB during boot. The DB, already slow, gets pummeled. Restarts continue. The pods that come back can't pass readiness because the DB is now actually overloaded. The dashboard shows green-then-red-then-green at 5-second intervals. Someone wakes up.
The fix is the YAML diff below.
# deployment.yaml: fixed
spec:
containers:
- name: orders
image: orders-api:v2.14.0
startupProbe:
httpGet:
path: /startupz
port: 8080
periodSeconds: 5
failureThreshold: 24 # 2 minutes for slow boots
livenessProbe:
httpGet:
path: /livez # in-process only, no DB
port: 8080
periodSeconds: 15
timeoutSeconds: 2
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz # dependency checks live here
port: 8080
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 2
successThreshold: 1
Three changes. Liveness no longer touches the DB. A startup probe gives slow boots breathing room without weakening liveness afterward. Readiness can fail freely without killing pods. They just drop out of the load balancer until the dep recovers.
Run this with the same Postgres hiccup and the behaviour is completely different. All 30 pods fail readiness for 35 seconds. The load balancer holds requests (or returns 503, depending on your mesh). Postgres recovers. Readiness passes. Pods rejoin the LB. Nobody pages anybody.
The right endpoints for each probe
Don't share endpoints between probes. Each one needs its own path, and each path does one thing.
-
GET /livez: in-process check. Watchdog timestamp, optionally a "can I allocate memory" smoke test. No I/O. Should be measured in microseconds. -
GET /readyz: dependency check. DB ping, Redis ping, downstream synthetic. Bounded timeouts. Cached for 1–2 seconds. Should be under 100ms. -
GET /startupz: boot completion gate. Returns 503 until your initial migrations, cache warmup, and first downstream connection succeed. Then returns 200 forever.
The naming convention with the z suffix (popularised by Kubernetes itself) is a hint to humans that these are machine endpoints, not part of your public API.
A small but real gotcha: don't put /livez behind your auth middleware. The kubelet doesn't carry your JWT. You'll get the world's most embarrassing CrashLoopBackOff diagnosing the fact that your liveness probe gets 401s.
Tuning the numbers: honest defaults
There's no one-size-fits-all, but these defaults survive most contact with reality. Adjust based on your service's actual boot time and the SLOs of your dependencies.
| Setting | Liveness | Readiness | Startup | Notes |
|---|---|---|---|---|
initialDelaySeconds |
0 | 0 | 0 | Startup probe handles boot, no delay needed |
periodSeconds |
15 | 5 | 5 | Liveness can be slow, readiness should be tight |
timeoutSeconds |
2 | 2 | 2 | Hard cap. Slow probe = unhealthy by definition |
failureThreshold |
3 | 2 | 24 | Startup gets generous threshold for long boots |
successThreshold |
1 | 1 | 1 | Readiness can be tuned to 2 if you flap |
For gossip-style heartbeats:
| Setting | Cassandra default | Consul LAN default |
|---|---|---|
| Probe interval | ~1s | 1s |
| Convict / suspect threshold | phi 8.0 (≈ 8–10s) | 3s suspect, 30s+ leave |
| Floor (worst-case GC pause) | tune above your worst GC | tune above your worst GC |
The principle stays the same across all of them: timeouts must exceed worst-case stop-the-world pauses with margin. If you don't know your worst-case GC pause, you don't know if your gossip is correctly tuned.
Five-item checklist
Before you ship a service to production, run through this:
-
Three separate endpoints.
/livez,/readyz,/startupz. No shared paths, no auth middleware. -
No dependencies in liveness. If
/livezcan fail because the DB is slow, you have the 3am bug. -
Startup probe for anything that takes >5s to boot. Don't bury slow boots under
initialDelaySecondson liveness. -
Readiness drains during shutdown. Flip the readiness flag to false on SIGTERM, sleep for
periodSeconds × failureThreshold, then close listeners. Otherwise the load balancer keeps sending traffic until your last in-flight request errors out. - Gossip timeouts above worst-case GC pause. Cassandra, Consul, Serf, Akka Cluster. Measure your real GC behaviour, then tune above it.
Most services miss at least one of these. The one that bites hardest is #2. The one that bites silently for months is #4. Graceful shutdown without a readiness drain means every deploy drops a small number of requests, and the rate is low enough that nobody connects it to the deploy until much later.
What's your shape-mixing horror story? Was it readiness in liveness, no startup probe, or gossip eating your GC pauses? Drop it in the comments.
The four shapes aren't optional pieces of a checklist. They're four distinct questions your platform asks your service, and each one has a different correct answer. Get them straight on paper before you write the YAML, and you'll skip a category of outages entirely.
If this was useful
The deeper question under all of this is what's a system actually made of? Process boundaries, dependency edges, control planes, data planes, and the tiny protocols that hold them together. The chapter on health, membership, and failure detection in the System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems walks through the building blocks above and the classes of failure each one is supposed to catch, so you can recognise the shape of a system before the YAML hits production.

Top comments (0)