Iurii Rogulia

Posted on Jun 29 • Originally published at iurii.rogulia.fi

Health Check Endpoint in Node.js: Liveness vs Readiness

#node #typescript #hono #bullmq

Your load balancer is routing traffic to a server whose database connection pool is exhausted. Docker restarted a container that never finished its startup migrations. Kubernetes replaced a healthy pod because the liveness probe hit an endpoint that returned 503 on a transient Redis timeout.

All of these happen because /health returned the wrong thing — or because nobody designed it carefully enough.

A health check endpoint is not a ping. It is the interface between your application and the infrastructure that decides whether your application lives or dies, receives traffic or gets replaced. Getting this interface wrong costs you incidents. Getting it right costs you an afternoon.

Liveness vs Readiness vs Startup

Kubernetes formalised a distinction that applies to any containerised workload:

Liveness probe — "Is this process still running correctly, or has it become a zombie?" If the liveness probe fails, the container is restarted. The question is about process health, not dependency health. If your app is live but Postgres is down, you do not want the container restarted — you want it to stop receiving traffic until Postgres recovers.

Readiness probe — "Is this instance ready to serve requests right now?" If the readiness probe fails, the container is removed from the load balancer rotation. Traffic stops coming to it. The container keeps running. When the probe passes again, traffic resumes. This is the correct mechanism for handling a temporarily unavailable database or Redis.

Startup probe — "Has the application finished initialising?" Some apps take 10–30 seconds on boot — running migrations, warming caches, establishing connection pools. The startup probe gives you time to do this without triggering false liveness failures. Once the startup probe passes, liveness and readiness probes take over.

In Docker Compose or Docker Swarm without Kubernetes, you get a single HEALTHCHECK directive. The semantics are simpler: healthy or unhealthy. If a container is unhealthy for a certain number of consecutive checks, Docker will restart it — there is no "remove from load balancer rotation" equivalent. This is an important difference: in Docker Compose, a failed healthcheck always means a restart, not a graceful drain. The consequence: if you check external dependencies (Postgres, Redis) in your Docker healthcheck and those dependencies go down, your container will be restarted — even though a restart cannot fix a database outage. Design accordingly: set retries high (3–5) and interval long (30s+) to tolerate transient failures without triggering unnecessary restarts. Save the aggressive dependency checking for your monitoring system.

The practical mapping:

Probe	What it checks	Failure action
Liveness	Process is alive, not deadlocked	Restart container
Readiness	Dependencies reachable, app ready for load	Remove from load balancer rotation
Startup	App initialisation complete	Delay liveness/readiness probes

What to Check

A useful health endpoint checks the things your app needs to serve requests correctly. For a typical Node.js API backed by Postgres and Redis:

Database connectivity — a lightweight query that exercises the connection pool. Not a SELECT 1 to the database server directly, but through your ORM/connection pool, so you detect pool exhaustion and misconfigured connection strings, not just network reachability.

Redis connectivity — a PING command. If Redis is down and you depend on it for caching or session state, you are degraded. If you depend on it for rate limiting that gates all requests, you may be unhealthy.

Background job queue health — include this in readiness only if serving HTTP traffic directly depends on queue capacity. For example: if your API enqueues jobs and immediately returns a response that assumes the job will be processed, a backed-up or stuck queue is a readiness concern. If jobs run in the background independently of request handling, queue health belongs in /metrics, not /health — a flooded queue should trigger an alert, not take your API offline. In vatnode, queue health lives in metrics: the API can accept Stripe webhooks and return 200 even if workers are temporarily stuck; the jobs will drain once workers recover.

Disk space — optional but useful in containerised environments where logs or temporary files can fill a volume. A simple df check can prevent a class of incidents where the container fills its storage and starts failing writes silently.

What not to check in a health endpoint:

External third-party APIs (Stripe, Mailgun, VIES). Their transient failures should not take your service offline. Handle them with circuit breakers at the call site.
Business logic assertions. Health is an infrastructure concern, not a data consistency check.
Anything that takes more than a second to complete under normal conditions.

The Response Format

A health endpoint that just returns 200 OK with no body is better than nothing, but barely. A useful response tells you what is healthy, so that when something breaks you know what to investigate.

Here is the structure I use:

// types/health.ts
export type HealthStatus = "ok" | "degraded" | "unhealthy";

export interface ComponentHealth {
  status: HealthStatus;
  latencyMs: number;
  error?: string;
}

export interface HealthResponse {
  status: HealthStatus;
  version: string;
  uptime: number; // seconds
  checks: {
    database: ComponentHealth;
    redis: ComponentHealth;
    queue?: ComponentHealth;
  };
}

The status at the top level is a rollup. If all checks pass, it is ok. If non-critical checks fail (Redis is down but the app can serve cached data from memory), it is degraded. If critical checks fail (database unreachable), it is unhealthy.

The version field is your current deployment version or git SHA. This is extremely useful when debugging — it lets you verify that the instance you are looking at is actually running the code you just deployed.

The uptime field catches restart loops. An instance with 30 seconds of uptime that is supposed to have been running for days has restarted recently.

Implementing It in Hono

slug="mvp-development"
text="Healthchecks, structured logging, and graceful degradation are part of every production system I deliver. If you need a developer who treats reliability as a first-class requirement, let's talk."
/>

// routes/health.ts
import { Hono } from "hono";
import { sql } from "drizzle-orm";
import { db } from "@/lib/db";
import { redis } from "@/lib/redis";
import { orderQueue } from "@/lib/queues/order-queue";
import type { HealthResponse, HealthStatus, ComponentHealth } from "@/types/health";

const health = new Hono();

const PROCESS_START = Date.now();

async function checkDatabase(): Promise<ComponentHealth> {
  const start = Date.now();
  try {
    await Promise.race([
      db.execute(sql`SELECT 1`),
      new Promise((_, reject) => setTimeout(() => reject(new Error("timeout")), 2000)),
    ]);
    return { status: "ok", latencyMs: Date.now() - start };
  } catch (err) {
    return {
      status: "unhealthy",
      latencyMs: Date.now() - start,
      error: err instanceof Error ? err.message : "unknown error",
    };
  }
}

async function checkRedis(): Promise<ComponentHealth> {
  const start = Date.now();
  try {
    const response = await Promise.race([
      redis.ping(),
      new Promise<never>((_, reject) => setTimeout(() => reject(new Error("timeout")), 1000)),
    ]);
    if (response !== "PONG") throw new Error(`unexpected response: ${response}`);
    return { status: "ok", latencyMs: Date.now() - start };
  } catch (err) {
    return {
      status: "degraded", // Redis down = degraded, not unhealthy (depends on your app)
      latencyMs: Date.now() - start,
      error: err instanceof Error ? err.message : "unknown error",
    };
  }
}

async function checkQueue(): Promise<ComponentHealth> {
  const start = Date.now();
  try {
    const counts = await Promise.race([
      orderQueue.getJobCounts("waiting", "active", "failed"),
      new Promise<never>((_, reject) => setTimeout(() => reject(new Error("timeout")), 1500)),
    ]);
    // Thresholds are system-specific — calibrate against your normal throughput.
    // A high-volume queue may have 1000 waiting jobs as steady state;
    // a low-volume queue may flag 10 failed jobs as a problem.
    const isHealthy = counts.failed < 50 && counts.waiting < 1000;
    return {
      status: isHealthy ? "ok" : "degraded",
      latencyMs: Date.now() - start,
    };
  } catch (err) {
    return {
      status: "degraded",
      latencyMs: Date.now() - start,
      error: err instanceof Error ? err.message : "unknown error",
    };
  }
}

// This is the readiness-style check — full dependency verification.
// Wire it to /health/ready (Kubernetes) or /health (Caddy/monitoring).
// Docker's container healthcheck should call /health/live instead.
health.get("/health/ready", async (c) => {
  const [database, redis, queue] = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
    checkQueue(),
  ]);

  const db_result =
    database.status === "fulfilled"
      ? database.value
      : { status: "unhealthy" as HealthStatus, latencyMs: 0, error: "check threw" };

  const redis_result =
    redis.status === "fulfilled"
      ? redis.value
      : { status: "degraded" as HealthStatus, latencyMs: 0, error: "check threw" };

  const queue_result =
    queue.status === "fulfilled"
      ? queue.value
      : { status: "degraded" as HealthStatus, latencyMs: 0, error: "check threw" };

  // Determine overall status
  let overallStatus: HealthStatus = "ok";
  if (db_result.status === "unhealthy") {
    overallStatus = "unhealthy";
  } else if (redis_result.status !== "ok" || queue_result.status !== "ok") {
    overallStatus = "degraded";
  }

  const body: HealthResponse = {
    status: overallStatus,
    version: process.env.APP_VERSION ?? "unknown",
    uptime: Math.floor((Date.now() - PROCESS_START) / 1000),
    checks: {
      database: db_result,
      redis: redis_result,
      queue: queue_result,
    },
  };

  // 200 for ok and degraded — load balancer should still route traffic
  // 503 for unhealthy — remove from rotation
  const httpStatus = overallStatus === "unhealthy" ? 503 : 200;

  return c.json(body, httpStatus);
});

export { health };

HTTP Status Codes: Why 200 for Degraded

This surprises people: returning 200 for a degraded service is intentional — but it depends on what your load balancer does with the response.

The reasoning: if you return 503 when Redis is temporarily unavailable and your app can still serve most requests from Postgres, you just took yourself out of the load balancer rotation. All your instances are likely degraded at the same time if Redis is down — so you just took down your entire service over a non-fatal condition.

The general rule:

200 — route traffic here (ok or degraded, app is serving requests)
503 — do not route traffic here (unhealthy, app cannot serve requests)

Your monitoring system reads the response body and alerts on degraded status. The load balancer cares about the HTTP status; your alerting cares about the body. Keep these concerns separate.

A few important caveats:

Check what your load balancer actually does. Some older or simpler load balancers (HAProxy with basic config, certain managed cloud LBs) only look at status codes, not the body. The 200-for-degraded pattern assumes your LB and monitoring are separate consumers with separate configurations. If your LB is the only health check consumer, you may need to return 503 for degraded too.

Consider a separate internal port. A cleaner architecture separates concerns by port: the external port serves user traffic, an internal-only port (e.g., 9090) serves GET /health with full details for the load balancer and monitoring system. This way the detailed health response never touches the public network, and you avoid the content-negotiation problem entirely. Not always worth the operational overhead, but worth knowing the pattern exists.

The Timeout Circuit Breaker

The single most important constraint on a health endpoint: it must never hang.

If your database connection pool is exhausted, a SELECT 1 query will sit in the queue waiting for a connection to become available. Without a timeout, your health check hangs for 30 seconds, the load balancer times out waiting for a response, and it marks your instance unhealthy — not because it is, but because the health check itself blocked.

Every check in the example above uses Promise.race with a timeout. The thresholds I use:

Database: 2 seconds (a SELECT 1 over a local connection should complete in under 5ms; if it takes 2 seconds something is wrong)
Redis: 1 second (PING should be sub-millisecond)
BullMQ queue counts: 1.5 seconds (reads from Redis, same logic)

One important caveat: Promise.race bounds the health endpoint's response time, but it does not cancel the underlying operation. If db.execute(sql\SELECT 1)loses the race, it keeps running inside the connection pool — the database driver has no way to know you stopped waiting. This means a pool under pressure can accumulate abandoned queries alongside new ones. Where possible, use driver-level timeouts in addition toPromise.race: most Postgres clients support a query_timeoutorstatement_timeout setting that cancels the query at the server level.

The total health check should complete in well under 3 seconds. Configure your Docker or Kubernetes timeout to 5–10 seconds to give it headroom without making your orchestrator wait indefinitely.

Docker and Kubernetes Configuration

In Docker Compose:

services:
  api:
    image: myapp:latest
    healthcheck:
      # Use /health/live, not the full /health — Docker healthcheck triggers restarts,
      # not graceful drains. Checking Postgres/Redis here means a DB outage restarts your container.
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3001/health/live"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 20s

Using wget --spider (instead of wget -qO-) makes the intent explicit: exit code 0 on 2xx, non-zero on anything else. The -qO- variant prints the body to stdout, which is noise in Docker logs and makes the behavior less obvious.

The start_period gives your app time to finish initialising — running migrations, establishing connection pools — before the health check starts counting failures. Without it, your container can fail health checks during startup and get restarted in a loop before it has had a chance to boot.

In Kubernetes, split liveness and readiness:

# kubernetes/deployment.yaml
spec:
  containers:
    - name: api
      image: myapp:latest
      livenessProbe:
        httpGet:
          path: /health/live
          port: 3001
        initialDelaySeconds: 10
        periodSeconds: 30
        timeoutSeconds: 5
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 3001
        initialDelaySeconds: 5
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 2
      startupProbe:
        httpGet:
          path: /health/live
          port: 3001
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 12 # Allow 60 seconds for startup

With separate probes, implement separate routes:

// /health/live — only checks if the process itself is functional
// No external dependencies; if this fails, restart the container
health.get("/health/live", (c) => {
  return c.json({ status: "ok", uptime: Math.floor((Date.now() - PROCESS_START) / 1000) });
});

// /health/ready — checks if the instance can serve traffic
// Uses the full check above
health.get("/health/ready", async (c) => {
  // ... full dependency checks, returns 503 if unhealthy
});

If you are running Docker Compose without Kubernetes, use a minimal /health/live endpoint for Docker's container healthcheck. Use the full /health or /health/ready endpoint for external monitoring, Caddy upstream checks, or manual diagnostics — not for Docker restarts. The Docker daemon will restart containers that fail their healthcheck after retries consecutive failures, so what Docker calls should be process-level only.

Caching the Result and Managing Load

/health can become your most frequently called endpoint. A load balancer checking every 10 seconds, a monitoring system checking every minute, an autoscaler reading it constantly — all hitting the same Postgres and Redis that serve your actual users.

If each health check fires a SELECT 1 to the database, you are generating steady background load that compounds with your regular query traffic. At low volume this is irrelevant; on a high-traffic system with a loaded database, it matters.

The fix: cache the health check result for a few seconds. A 5-second TTL means a load balancer checking every 10 seconds hits your real dependencies at most twice per minute per instance, instead of six times per minute.

let cachedResult: { body: HealthResponse; status: number; ts: number } | null = null;
const CACHE_TTL_MS = 5000;

health.get("/health/ready", async (c) => {
  if (cachedResult && Date.now() - cachedResult.ts < CACHE_TTL_MS) {
    return c.json(cachedResult.body, cachedResult.status as 200 | 503);
  }

  // ... run checks ...

  cachedResult = { body, status: httpStatus, ts: Date.now() };
  return c.json(body, httpStatus as 200 | 503);
});

The tradeoff: a failure that occurs between cache refreshes will not be visible for up to 5 seconds. For most systems this is acceptable — load balancers already have failureThreshold and interval buffers built in. If you need sub-second failure detection, you probably need a more sophisticated monitoring pipeline, not a faster health endpoint.

One caveat: cache ok and degraded results freely, but consider a shorter TTL (or no cache) for unhealthy. A single transient Postgres timeout that resolves in 200ms should not cause 5 seconds of 503 responses. A simple approach: set the TTL based on the result status — 5 seconds for ok, 2 seconds for degraded, 0 for unhealthy.

A related point: if your monitoring system needs rich queue metrics, failed job counts, and latency histograms — that data belongs in a metrics endpoint (/metrics in Prometheus format, or a dedicated internal route), not in /health. The healthcheck tells the orchestrator whether to route traffic. Metrics tell your team what is happening inside the system. These are different questions with different consumers.

What Not to Expose Publicly

The full JSON response with component statuses and latencies is useful for your monitoring system and your team. It should not be publicly accessible without authentication.

A response body like this:

{
  "checks": {
    "database": { "status": "ok", "latencyMs": 3 },
    "redis": { "status": "degraded", "error": "connect ECONNREFUSED 10.0.0.5:6379" }
  }
}

tells an attacker your internal Redis IP address and that it is currently unreachable. The error strings from failed checks often contain connection strings, hostnames, and infrastructure details.

Options:

Restrict by network — serve the full response only to requests from internal networks or specific IPs. Your load balancer and monitoring system are internal; public-facing traffic is not.
Two endpoints — a public /health that returns only { "status": "ok" } or { "status": "unhealthy" } without details, and an authenticated /health/details that returns the full response.
Strip errors in production — include error fields only when NODE_ENV !== "production", or behind an ?verbose=1 query param gated by an internal header.

On vatnode, there are two health routes in production: /health is publicly reachable (UptimeRobot and Caddy need it) but returns a sanitized body — status rollup and latencies, no error strings or infrastructure details. /health/details is on an internal network interface behind IP allowlisting and returns the full response including error messages from failed checks. UptimeRobot gets enough signal to alert on downtime; the team gets full diagnostics when they need them.

Graceful Degradation and Not Restarting Too Fast

One pattern that burns people: a liveness probe that checks external dependencies and triggers container restarts when a downstream service is flaky.

Postgres goes down for 30 seconds. Your liveness probe checks Postgres. It fails 3 times in a row. Kubernetes restarts your pod. The new pod starts up, Postgres is still recovering, the liveness probe fails again. Kubernetes restarts again. Now you have a restart loop on top of a database outage, and your application is never in a stable state long enough to serve the requests that do not need Postgres.

The fix: liveness probes check only the process itself (is the Node.js event loop responsive?). Readiness probes check external dependencies. When Postgres recovers, the readiness probe passes, traffic resumes, and no container was ever unnecessarily restarted.

In Docker Compose, you only have one healthcheck, and its failure semantics are different from Kubernetes: an unhealthy container gets restarted, period. There is no "remove from rotation without restarting" equivalent. This makes the Docker case harder: if you include Postgres in your healthcheck, a database outage will trigger container restarts — which is exactly what the Kubernetes section warns against.

The pragmatic approach for Docker Compose: keep the healthcheck strictly process-level — no external dependencies at all. Point it at /health/live, which checks only that the Node.js event loop is responsive. Set retries high (3–5) and interval long (30s+) to absorb transient failures, and use your external monitoring system (Caddy upstream check, UptimeRobot, Grafana) for richer dependency checks. Accept that Docker's healthcheck is a liveness probe in all but name, and build your graceful degradation logic inside the application (circuit breakers, fallback paths) rather than relying on the orchestrator to make the distinction for you.

What Good Looks Like

On vatnode, three consumers read health endpoints with different intentions: Docker calls /health/live every 30 seconds to decide whether to restart the container; Caddy's upstream health check calls /health (readiness-style, full dependency check) every 30 seconds to decide whether to route traffic to this upstream; UptimeRobot calls /health every 5 minutes for external availability monitoring. Each consumer gets the endpoint that matches its semantics. Under normal conditions, the /health response looks like this:

{
  "status": "ok",
  "version": "a3f2c19",
  "uptime": 847293,
  "checks": {
    "database": { "status": "ok", "latencyMs": 2 },
    "redis": { "status": "ok", "latencyMs": 1 },
    "queue": { "status": "ok", "latencyMs": 4 }
  }
}

Uptime of 847293 seconds means roughly 9.8 days since last restart. Database latency of 2ms means the connection pool is healthy and the query is executing normally. This response takes me 3 seconds to read and tells me the system is fine.

When something is wrong, the response tells me where to look immediately — without SSHing into the server, without tailing logs, without waiting for a monitoring alert to fire.

If you're building production infrastructure that needs to stay reliable — whether that's a SaaS API, an e-commerce backend, or a worker-heavy integration platform — healthcheck design is one of those details that seems boring until the 2 AM incident when it's the only tool you have.

I've run these patterns in production across several systems, from vatnode.dev to pikkuna.fi. If you need a senior developer who can own production reliability end-to-end — get in touch. I'm available for freelance projects and long-term engagements.

Related:

Background Jobs in Node.js: BullMQ, pg-boss, or Just a Cron? — BullMQ queue health in practice
Self-Hosting a Production API on a €6/month VPS — Docker Compose healthcheck and Caddy upstream health checks in context
Vatnode VAT Validation SaaS — production system where these health patterns run

External documentation:

DEV Community