DEV Community

Young Gao
Young Gao

Posted on

Kubernetes Health Probes Done Right: Liveness, Readiness, and Startup (2026 Guide)

Here's the article:


Your health check endpoint returns 200 OK while your app serves 500s to every real request. Sound familiar?

Most health checks are useless. They confirm the process is running — something your orchestrator already knows. A real health check system needs three distinct probes, each answering a different question.

Three Probes, Three Questions

Liveness: "Is this process stuck?" — If no, kill it and restart.
Readiness: "Can this instance handle traffic?" — If no, stop sending requests.
Startup: "Has this instance finished initializing?" — If no, wait before checking liveness.

They are not interchangeable. Mixing them up causes cascading failures.

What Each Probe Should (and Should NOT) Check

Probe Should Check Should NOT Check
Liveness Event loop responsive, no deadlock Database connectivity, downstream services
Readiness DB connected, migrations done, cache warm External third-party APIs
Startup Config loaded, DB pool created, initial cache fill Anything that should be checked continuously

The cardinal rule: liveness probes must never check external dependencies. If your database goes down and your liveness probe fails, Kubernetes restarts your pod. The new pod also can't reach the database. It gets restarted too. Now you have a restart loop on top of a database outage.

Implementation

// health.ts
import { FastifyInstance } from 'fastify';
import { Pool } from 'pg';
import { Redis } from 'ioredis';

interface HealthDeps {
  db: Pool;
  redis: Redis;
  startedAt: number;
}

interface CheckResult {
  status: 'healthy' | 'degraded' | 'unhealthy';
  latencyMs?: number;
  message?: string;
}

async function checkDb(db: Pool): Promise<CheckResult> {
  const start = Date.now();
  try {
    await db.query('SELECT 1');
    return { status: 'healthy', latencyMs: Date.now() - start };
  } catch (err) {
    return { status: 'unhealthy', message: (err as Error).message };
  }
}

async function checkRedis(redis: Redis): Promise<CheckResult> {
  const start = Date.now();
  try {
    await redis.ping();
    return { status: 'healthy', latencyMs: Date.now() - start };
  } catch (err) {
    return { status: 'unhealthy', message: (err as Error).message };
  }
}

export function registerHealthRoutes(app: FastifyInstance, deps: HealthDeps) {
  // Liveness: is the process alive and not deadlocked?
  app.get('/healthz', async (_req, reply) => {
    reply.code(200).send({ status: 'alive' });
  });

  // Readiness: can we serve traffic?
  app.get('/readyz', async (_req, reply) => {
    const [db, redis] = await Promise.all([
      checkDb(deps.db),
      checkRedis(deps.redis),
    ]);

    const ready = db.status === 'healthy' && redis.status === 'healthy';

    reply.code(ready ? 200 : 503).send({
      status: ready ? 'ready' : 'not_ready',
      checks: { db, redis },
    });
  });

  // Startup: has initialization completed?
  let startupComplete = false;

  app.get('/startupz', async (_req, reply) => {
    if (startupComplete) {
      return reply.code(200).send({ status: 'started' });
    }

    // Check if all init tasks are done
    const [db, redis] = await Promise.all([
      checkDb(deps.db),
      checkRedis(deps.redis),
    ]);

    if (db.status === 'healthy' && redis.status === 'healthy') {
      startupComplete = true;
      return reply.code(200).send({ status: 'started' });
    }

    reply.code(503).send({
      status: 'starting',
      checks: { db, redis },
      uptimeMs: Date.now() - deps.startedAt,
    });
  });
}
Enter fullscreen mode Exit fullscreen mode

Notice: the liveness probe does zero I/O. It confirms the HTTP server can respond. That's it.

A Better Liveness Probe

The basic version above works, but you can detect event loop stalls:

// event-loop-monitor.ts
let lastTick = Date.now();
const MAX_DELAY_MS = 3000;

setInterval(() => {
  lastTick = Date.now();
}, 1000);

export function isEventLoopHealthy(): boolean {
  return Date.now() - lastTick < MAX_DELAY_MS;
}

// In your route:
app.get('/healthz', async (_req, reply) => {
  if (!isEventLoopHealthy()) {
    return reply.code(503).send({ status: 'stuck', detail: 'event loop stalled' });
  }
  reply.code(200).send({ status: 'alive' });
});
Enter fullscreen mode Exit fullscreen mode

This catches the real failure mode: CPU-bound work or a synchronous call blocking the loop.

Graceful Degradation

Not every dependency failure should make your service unready. If Redis is your cache layer and you can fall back to the database, don't pull yourself out of rotation:

interface DependencyConfig {
  name: string;
  check: () => Promise<CheckResult>;
  required: boolean; // required = must be healthy for readiness
}

async function evaluateReadiness(deps: DependencyConfig[]) {
  const results = await Promise.allSettled(
    deps.map(async (d) => ({
      name: d.name,
      required: d.required,
      result: await Promise.race([
        d.check(),
        timeout(2000).then((): CheckResult => ({
          status: 'unhealthy',
          message: 'check timed out',
        })),
      ]),
    }))
  );

  const checks: Record<string, CheckResult & { required: boolean }> = {};
  let ready = true;

  for (const r of results) {
    if (r.status === 'fulfilled') {
      checks[r.value.name] = { ...r.value.result, required: r.value.required };
      if (r.value.required && r.value.result.status === 'unhealthy') {
        ready = false;
      }
    }
  }

  const degraded = Object.values(checks).some(
    (c) => !c.required && c.status === 'unhealthy'
  );

  return {
    status: ready ? (degraded ? 'degraded' : 'ready') : 'not_ready',
    checks,
  };
}

// Usage
const dependencies: DependencyConfig[] = [
  { name: 'postgres', check: () => checkDb(db), required: true },
  { name: 'redis', check: () => checkRedis(redis), required: false },
  { name: 'email_api', check: () => checkEmailService(), required: false },
];
Enter fullscreen mode Exit fullscreen mode

Three statuses: ready (everything fine), degraded (non-critical deps down, still serving), not_ready (pull from rotation). Your metrics layer should alert on degraded even though traffic still flows.

Kubernetes Configuration

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /healthz
          port: 3000
        periodSeconds: 10
        failureThreshold: 3      # 30s of failures before restart
        timeoutSeconds: 2
      readinessProbe:
        httpGet:
          path: /readyz
          port: 3000
        periodSeconds: 5
        failureThreshold: 2      # 10s before removing from service
        timeoutSeconds: 3
      startupProbe:
        httpGet:
          path: /startupz
          port: 3000
        periodSeconds: 5
        failureThreshold: 24     # 2 minutes to start up
        timeoutSeconds: 3
Enter fullscreen mode Exit fullscreen mode

Key detail: failureThreshold * periodSeconds defines your tolerance window. Startup gets a generous window (2 min). Readiness is aggressive (10s) — if you can't serve, stop receiving. Liveness is moderate (30s) — don't restart on a brief hiccup.

The startup probe disables liveness and readiness checks until it succeeds. This is critical for apps with slow init (connection pools, cache warming, migration checks). Without it, the liveness probe kills your pod before it finishes starting.

Timeouts on Health Checks

Always timeout your dependency checks. A hanging database connection will make your readiness probe hang, which makes the kubelet timeout the probe, which is slower and less informative:

function timeout(ms: number): Promise<never> {
  return new Promise((_, reject) =>
    setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms)
  );
}

async function checkWithTimeout(
  check: () => Promise<CheckResult>,
  ms: number
): Promise<CheckResult> {
  try {
    return await Promise.race([check(), timeout(ms)]);
  } catch {
    return { status: 'unhealthy', message: `timed out after ${ms}ms` };
  }
}
Enter fullscreen mode Exit fullscreen mode

Set check timeouts lower than the probe's timeoutSeconds. If your probe times out at 3s, timeout your checks at 2s so you return a meaningful error instead of a generic timeout.

Common Mistakes

1. Liveness probe checks the database. Database goes down, all pods restart in a loop, now you have zero capacity when the DB recovers.

2. No startup probe. Slow-starting apps get killed by liveness probes during init. You see pods in CrashLoopBackOff and increase the liveness timeout to 60s, which means actual stuck processes take a minute to detect.

3. Readiness checks external APIs you don't control. A third-party API blip removes all your pods from service. Only check dependencies you need to serve your core functionality.

4. Health checks share the main thread pool. If your app is overloaded, health checks queue behind real requests and time out. Run probes on a separate port or use a lightweight framework for the health server.

5. No timeouts on dependency checks. One hanging connection makes your probe hang, then Kubernetes decides your app is dead.

6. Same endpoint for all three probes. You lose the ability to express "I'm alive but can't serve traffic" versus "I'm completely stuck." These are fundamentally different states requiring different responses from the orchestrator.

Health checks are the API your application exposes to its infrastructure. Treat them with the same rigor as your public API.


Part of my Production Backend Patterns series. Follow for more practical backend engineering.


If this was useful, consider:

Top comments (0)