Here's the article:
Your health check endpoint returns 200 OK while your app serves 500s to every real request. Sound familiar?
Most health checks are useless. They confirm the process is running — something your orchestrator already knows. A real health check system needs three distinct probes, each answering a different question.
Three Probes, Three Questions
Liveness: "Is this process stuck?" — If no, kill it and restart.
Readiness: "Can this instance handle traffic?" — If no, stop sending requests.
Startup: "Has this instance finished initializing?" — If no, wait before checking liveness.
They are not interchangeable. Mixing them up causes cascading failures.
What Each Probe Should (and Should NOT) Check
| Probe | Should Check | Should NOT Check |
|---|---|---|
| Liveness | Event loop responsive, no deadlock | Database connectivity, downstream services |
| Readiness | DB connected, migrations done, cache warm | External third-party APIs |
| Startup | Config loaded, DB pool created, initial cache fill | Anything that should be checked continuously |
The cardinal rule: liveness probes must never check external dependencies. If your database goes down and your liveness probe fails, Kubernetes restarts your pod. The new pod also can't reach the database. It gets restarted too. Now you have a restart loop on top of a database outage.
Implementation
// health.ts
import { FastifyInstance } from 'fastify';
import { Pool } from 'pg';
import { Redis } from 'ioredis';
interface HealthDeps {
db: Pool;
redis: Redis;
startedAt: number;
}
interface CheckResult {
status: 'healthy' | 'degraded' | 'unhealthy';
latencyMs?: number;
message?: string;
}
async function checkDb(db: Pool): Promise<CheckResult> {
const start = Date.now();
try {
await db.query('SELECT 1');
return { status: 'healthy', latencyMs: Date.now() - start };
} catch (err) {
return { status: 'unhealthy', message: (err as Error).message };
}
}
async function checkRedis(redis: Redis): Promise<CheckResult> {
const start = Date.now();
try {
await redis.ping();
return { status: 'healthy', latencyMs: Date.now() - start };
} catch (err) {
return { status: 'unhealthy', message: (err as Error).message };
}
}
export function registerHealthRoutes(app: FastifyInstance, deps: HealthDeps) {
// Liveness: is the process alive and not deadlocked?
app.get('/healthz', async (_req, reply) => {
reply.code(200).send({ status: 'alive' });
});
// Readiness: can we serve traffic?
app.get('/readyz', async (_req, reply) => {
const [db, redis] = await Promise.all([
checkDb(deps.db),
checkRedis(deps.redis),
]);
const ready = db.status === 'healthy' && redis.status === 'healthy';
reply.code(ready ? 200 : 503).send({
status: ready ? 'ready' : 'not_ready',
checks: { db, redis },
});
});
// Startup: has initialization completed?
let startupComplete = false;
app.get('/startupz', async (_req, reply) => {
if (startupComplete) {
return reply.code(200).send({ status: 'started' });
}
// Check if all init tasks are done
const [db, redis] = await Promise.all([
checkDb(deps.db),
checkRedis(deps.redis),
]);
if (db.status === 'healthy' && redis.status === 'healthy') {
startupComplete = true;
return reply.code(200).send({ status: 'started' });
}
reply.code(503).send({
status: 'starting',
checks: { db, redis },
uptimeMs: Date.now() - deps.startedAt,
});
});
}
Notice: the liveness probe does zero I/O. It confirms the HTTP server can respond. That's it.
A Better Liveness Probe
The basic version above works, but you can detect event loop stalls:
// event-loop-monitor.ts
let lastTick = Date.now();
const MAX_DELAY_MS = 3000;
setInterval(() => {
lastTick = Date.now();
}, 1000);
export function isEventLoopHealthy(): boolean {
return Date.now() - lastTick < MAX_DELAY_MS;
}
// In your route:
app.get('/healthz', async (_req, reply) => {
if (!isEventLoopHealthy()) {
return reply.code(503).send({ status: 'stuck', detail: 'event loop stalled' });
}
reply.code(200).send({ status: 'alive' });
});
This catches the real failure mode: CPU-bound work or a synchronous call blocking the loop.
Graceful Degradation
Not every dependency failure should make your service unready. If Redis is your cache layer and you can fall back to the database, don't pull yourself out of rotation:
interface DependencyConfig {
name: string;
check: () => Promise<CheckResult>;
required: boolean; // required = must be healthy for readiness
}
async function evaluateReadiness(deps: DependencyConfig[]) {
const results = await Promise.allSettled(
deps.map(async (d) => ({
name: d.name,
required: d.required,
result: await Promise.race([
d.check(),
timeout(2000).then((): CheckResult => ({
status: 'unhealthy',
message: 'check timed out',
})),
]),
}))
);
const checks: Record<string, CheckResult & { required: boolean }> = {};
let ready = true;
for (const r of results) {
if (r.status === 'fulfilled') {
checks[r.value.name] = { ...r.value.result, required: r.value.required };
if (r.value.required && r.value.result.status === 'unhealthy') {
ready = false;
}
}
}
const degraded = Object.values(checks).some(
(c) => !c.required && c.status === 'unhealthy'
);
return {
status: ready ? (degraded ? 'degraded' : 'ready') : 'not_ready',
checks,
};
}
// Usage
const dependencies: DependencyConfig[] = [
{ name: 'postgres', check: () => checkDb(db), required: true },
{ name: 'redis', check: () => checkRedis(redis), required: false },
{ name: 'email_api', check: () => checkEmailService(), required: false },
];
Three statuses: ready (everything fine), degraded (non-critical deps down, still serving), not_ready (pull from rotation). Your metrics layer should alert on degraded even though traffic still flows.
Kubernetes Configuration
apiVersion: v1
kind: Pod
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /healthz
port: 3000
periodSeconds: 10
failureThreshold: 3 # 30s of failures before restart
timeoutSeconds: 2
readinessProbe:
httpGet:
path: /readyz
port: 3000
periodSeconds: 5
failureThreshold: 2 # 10s before removing from service
timeoutSeconds: 3
startupProbe:
httpGet:
path: /startupz
port: 3000
periodSeconds: 5
failureThreshold: 24 # 2 minutes to start up
timeoutSeconds: 3
Key detail: failureThreshold * periodSeconds defines your tolerance window. Startup gets a generous window (2 min). Readiness is aggressive (10s) — if you can't serve, stop receiving. Liveness is moderate (30s) — don't restart on a brief hiccup.
The startup probe disables liveness and readiness checks until it succeeds. This is critical for apps with slow init (connection pools, cache warming, migration checks). Without it, the liveness probe kills your pod before it finishes starting.
Timeouts on Health Checks
Always timeout your dependency checks. A hanging database connection will make your readiness probe hang, which makes the kubelet timeout the probe, which is slower and less informative:
function timeout(ms: number): Promise<never> {
return new Promise((_, reject) =>
setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms)
);
}
async function checkWithTimeout(
check: () => Promise<CheckResult>,
ms: number
): Promise<CheckResult> {
try {
return await Promise.race([check(), timeout(ms)]);
} catch {
return { status: 'unhealthy', message: `timed out after ${ms}ms` };
}
}
Set check timeouts lower than the probe's timeoutSeconds. If your probe times out at 3s, timeout your checks at 2s so you return a meaningful error instead of a generic timeout.
Common Mistakes
1. Liveness probe checks the database. Database goes down, all pods restart in a loop, now you have zero capacity when the DB recovers.
2. No startup probe. Slow-starting apps get killed by liveness probes during init. You see pods in CrashLoopBackOff and increase the liveness timeout to 60s, which means actual stuck processes take a minute to detect.
3. Readiness checks external APIs you don't control. A third-party API blip removes all your pods from service. Only check dependencies you need to serve your core functionality.
4. Health checks share the main thread pool. If your app is overloaded, health checks queue behind real requests and time out. Run probes on a separate port or use a lightweight framework for the health server.
5. No timeouts on dependency checks. One hanging connection makes your probe hang, then Kubernetes decides your app is dead.
6. Same endpoint for all three probes. You lose the ability to express "I'm alive but can't serve traffic" versus "I'm completely stuck." These are fundamentally different states requiring different responses from the orchestrator.
Health checks are the API your application exposes to its infrastructure. Treat them with the same rigor as your public API.
Part of my Production Backend Patterns series. Follow for more practical backend engineering.
If this was useful, consider:
- Sponsoring on GitHub to support more open-source tools
- Buying me a coffee on Ko-fi
Top comments (0)