Juan Torchia

Posted on May 28 • Originally published at juanchi.dev

Docker healthchecks: what they actually measure and what you shouldn't promise

#english #node #docker #devops

Docker healthchecks: what they actually measure and what you shouldn't promise

The right way to know if your container is healthy is to stop asking the container if it's healthy. I know that sounds weird. Let me explain why a HEALTHCHECK that returns 200 OK might be lying straight to your face.

The problem isn't the instruction itself. It's the implicit promise we attach to it: if the healthcheck passes, the app works. That's where it breaks down. A process can respond on /healthz and simultaneously have a disconnected database, a saturated queue, or a hung internal worker. Docker's HEALTHCHECK knows nothing about any of that unless you explicitly teach it.

My thesis: HEALTHCHECK is a useful but narrow operational signal. Telling someone "if the healthcheck passes, the service is fine" is promising something the tool simply cannot deliver.

What the official docs say — and what they don't

The official HEALTHCHECK reference in Dockerfile describes the instruction precisely. What it does: runs a command periodically inside the container and updates the container's state between starting, healthy, and unhealthy based on the exit code. Exit 0 = healthy. Exit 1 = unhealthy. Exit 2 = reserved (don't use it).

# Basic pattern per the official documentation
HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
  CMD curl -f http://localhost:3000/healthz || exit 1

What the docs don't say: what that endpoint actually needs to respond with for the check to be meaningful. That's your call — and that's the problem I see most often in other people's production codebases.

The available parameters are --interval, --timeout, --start-period, --retries, and --start-interval (added in Dockerfile v1.4). Each has a reasonable default, but no default is universal. What no parameter can do is understand the business domain running inside the container.

One more thing the docs mention without much emphasis: Docker does not restart the container when it goes unhealthy. That depends on the restart policy or the orchestrator. On Railway, for example, what happens when a container goes unhealthy depends on the service configuration — not on Docker alone. If you're expecting Docker to fix the problem once it detects the failure, you'll be waiting a while.

The standard recipe and its hidden cost

The recipe I see in roughly 80% of the Dockerfiles I read follows this pattern:

# Common recipe — works for liveness, not for full readiness
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD wget -qO- http://localhost:8080/health || exit 1

The /health endpoint returns { "status": "ok" } and HTTP 200. The container shows up as healthy. Clean, tidy.

Now picture this reproducible scenario: the HTTP server is up, responding on the port, but the Postgres connection pool is exhausted because there was a traffic spike and connections weren't released properly. Real-world requests are failing with 503. The healthcheck keeps passing because it's asking the process — not the database.

This isn't hypothetical or some made-up incident. It's the exact behavior you get if the /health endpoint doesn't verify the pool. And most /health endpoints in public repos don't. They verify the process started, not that the service can actually serve traffic.

That difference has a name: liveness vs readiness. Kubernetes split them into two separate probes for a reason. Docker has a single HEALTHCHECK instruction, which forces you to choose what you actually want to measure.

# Endpoint that checks liveness (the process is alive)
# GET /healthz → 200 as long as the server responds

# Endpoint that checks real readiness (the service can handle requests)
# GET /ready → 200 only if DB connected, cache available, workers active

If you use a single endpoint for both, what you lose is diagnostic precision. The container shows healthy when it's actually alive but not ready.

Where people get it wrong: three patterns with real consequences

1. Healthcheck that doesn't cover external dependencies

# This only confirms Node.js is up and listening
HEALTHCHECK CMD node -e "require('http').get('http://localhost:3000/health')"

If Postgres is down, this check still passes. The fix is making /health actively query its critical dependencies:

// src/health/route.ts — Next.js App Router
import { db } from "@/lib/db"; // your database client

export async function GET() {
  try {
    // Minimal query to verify real connectivity
    await db.$queryRaw`SELECT 1`;
    return Response.json({ status: "ok", db: "connected" });
  } catch {
    // Implicit exit with 503 — the healthcheck reads this as unhealthy
    return Response.json(
      { status: "degraded", db: "unreachable" },
      { status: 503 }
    );
  }
}

Now the check measures something real. But watch the trade-off: every healthcheck invocation fires a query to the database. At --interval=10s across a service with many instances, that adds up. Pick the interval deliberately, not by defaulting.

2. `--start-period` too short for heavy apps

# Spring Boot can take 20-40s to start depending on context
# With start-period=5s, the container goes unhealthy before it's even ready
HEALTHCHECK --start-period=5s --interval=10s CMD curl -f http://localhost:8080/actuator/health || exit 1

If you're on Railway or any platform that reacts to unhealthy state, a short --start-period can kill the container before it's finished starting. That's not a Docker bug — it's bad calibration. The official docs specify that failures during start-period don't count as unhealthy, but if the app hasn't started by the time that window closes, the first real check can fail immediately.

3. No `HEALTHCHECK` at all

Without a HEALTHCHECK instruction, the container always shows state none. In Docker Compose that means depends_on: condition: service_healthy doesn't work. On Railway and similar platforms, it means you have zero operational status signal.

# docker-compose.yml — pattern with health dependency
services:
  app:
    build: .
    depends_on:
      postgres:
        condition: service_healthy  # Requires postgres to have a HEALTHCHECK
  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

Without HEALTHCHECK on postgres, the depends_on with condition: service_healthy fails at runtime. This is the kind of error that shows up at 11pm during a new deploy when you can't remember why that service used to take so long to start — until you dig through the logs and realize the app connected before Postgres was ready.

Decision matrix: what to check and when it matters

Scenario	What to measure	Suggested endpoint	Cost to consider
Liveness only	Process alive	`/healthz` — always returns 200	Minimal
Readiness with DB	DB accessible	`/ready` — `SELECT 1` or equivalent	One query per check
External dependencies	Critical APIs	`/ready` — low timeout, don't block	Network latency
Worker / job	Own heartbeat	Timestamp file or dedicated endpoint	Custom logic to maintain
Local compose only	Startup order	`pg_isready`, `redis-cli ping`	Nothing

The question worth asking before defining the command: what has to be true for this container to serve real traffic? If the answer includes "the database needs to be connected" or "the worker needs to be alive," that needs to show up in the endpoint you're checking.

Real limits: what you can't conclude from a healthcheck

Here's what a HEALTHCHECK can't give you without additional instrumentation:

Response latency: the check only measures whether it responded, not how long it took. An endpoint that takes 9 seconds with --timeout=10s passes as healthy. If latency matters to you, you need external metrics — Prometheus, OpenTelemetry, structured logs.
Response correctness: the healthcheck doesn't parse the body. You can return corrupted data and remain healthy as long as the HTTP status is 200.
Business logic state: if a queue is growing out of control, if a reconciliation process is silently failing, if calculations are wrong — none of that is visible to the healthcheck.
Capacity under load: the endpoint responding when Docker invokes it doesn't mean it'll respond when 500 concurrent requests hit it.

None of this invalidates HEALTHCHECK. What it does is scope its responsibility. It's a signal that the process is alive and can handle a minimal request. That's genuinely valuable for orchestration and restart policies. It's not enough to claim the service is functioning correctly.

For everything else you need metric-based alerts, distributed traces, or at minimum structured logs you can actually query. The healthcheck is the most basic layer of observability — not the only one.

FAQ — Docker healthcheck best practices

How often should the healthcheck run?

Depends on how fast you want to detect a failure. The --interval=30s default is reasonable for most services. If the check queries the database, dropping to 10s across a service with many instances can generate unnecessary load. For deploy pipelines where you need fast readiness signal, --interval=5s with a well-calibrated --start-period usually works. There's no universal answer — measure the endpoint's impact before tuning the interval.

Does the healthcheck restart the container if it fails?

Not directly. Docker marks the container as unhealthy, but what happens next depends on the container's restart policy (--restart always, on-failure, etc.) or the orchestrator. In Docker Compose and Swarm you can configure the reaction. On platforms like Railway, the behavior depends on the service configuration. Don't assume the container will restart just because it went unhealthy.

Does it make sense to use HEALTHCHECK in local development?

Yes, especially in compose to control startup order with depends_on: condition: service_healthy. It eliminates that "app started before the database and crashed" cycle we've all lived through. In development you don't need finely tuned intervals — the defaults work fine.

What's the difference between HEALTHCHECK in Dockerfile and healthcheck in docker-compose.yml?

Both configure the same thing, but at different levels. HEALTHCHECK in the Dockerfile is embedded in the image — it applies whenever you run that image. The healthcheck: key in docker-compose.yml overrides or defines the check for that specific service in that compose file. For images you control, defining it in the Dockerfile makes more sense. For third-party images (postgres, redis, etc.), configuring it in compose is your only option.

Can I disable a HEALTHCHECK that comes from a base image?

Yes. The official docs state that HEALTHCHECK NONE disables any healthcheck inherited from the parent image. Useful when you're using a base image that ships a check that doesn't apply to what you're actually running.

Does the healthcheck affect container performance?

The command runs inside the container and consumes resources from the calling process. A lightweight curl has minimal impact. An endpoint that runs complex queries or calls external services on every check can accumulate. If you ever see unusual CPU or database connections on a container that isn't under real traffic load, the healthcheck is one of the first places to look.

Useful signal, limited promise

After working with Docker across daily deploys — on Railway, in local compose, in backends mixing Next.js with separate services — here's what I think is the honest take: HEALTHCHECK is worth configuring properly. Not because it's some observability silver bullet, but because it's the cheapest layer of early detection you can add without any extra infrastructure.

But you have to be honest about what it promises. A healthcheck pointing at an endpoint that just returns 200 OK without verifying dependencies is a liveness signal, not a readiness signal. Calling it a "full health check" is overpromising.

My practical recommendation: if you define a single endpoint for HEALTHCHECK, make it verify the service's critical dependencies — the database at minimum. Calibrate --start-period based on the app's actual startup time. And document in the Dockerfile itself what the check is measuring, so whoever reads it next understands the contract.

What not to do: confuse "the container is healthy" with "the service is functioning correctly." Those phrases look similar and measure completely different things. The first claim you can back up with the healthcheck. The second requires metrics, traces, and alerts — things that start where HEALTHCHECK's scope ends.

If you want to go deeper on connecting observability signals across layers, the post on caching in Next.js App Router and the one on rate limiting before picking a library touch on similar operational decisions — where to put the logic, what each layer actually promises, and when the abstraction is hiding the real problem.

Source

Docker HEALTHCHECK reference: https://docs.docker.com/reference/dockerfile/#healthcheck

This article was originally published on juanchi.dev

DEV Community

Docker healthchecks: what they actually measure and what you shouldn't promise

Docker healthchecks: what they actually measure and what you shouldn't promise

What the official docs say — and what they don't

The standard recipe and its hidden cost

Where people get it wrong: three patterns with real consequences

1. Healthcheck that doesn't cover external dependencies

2. `--start-period` too short for heavy apps

3. No `HEALTHCHECK` at all

Decision matrix: what to check and when it matters

Real limits: what you can't conclude from a healthcheck

FAQ — Docker healthcheck best practices

Useful signal, limited promise

Top comments (0)

Docker healthchecks: what they actually measure and what you shouldn't promise

What the official docs say — and what they don't

The standard recipe and its hidden cost

Where people get it wrong: three patterns with real consequences

1. Healthcheck that doesn't cover external dependencies

2. --start-period too short for heavy apps

3. No HEALTHCHECK at all

Decision matrix: what to check and when it matters

Real limits: what you can't conclude from a healthcheck

FAQ — Docker healthcheck best practices

Useful signal, limited promise

2. `--start-period` too short for heavy apps

3. No `HEALTHCHECK` at all