Docker healthchecks: what they actually measure and what you shouldn't promise

#english #docker #devops #railway

Docker healthchecks: what they actually measure and what you shouldn't promise

Why do we treat a Docker healthcheck like it's a full health monitoring system when the official documentation promises nothing of the sort? You've been running containers for months, the Railway dashboard shows "healthy" in green, and somehow you assume the app is fine. That assumption bothers me. And I think it's worth putting it under pressure.

My thesis is straightforward: a healthcheck that only confirms the process responds can hide entire business failures. The mechanism does what it says it does — nothing more, nothing less. The problem is what we promise on top of it.

Docker healthcheck: what the docs say and what they DON'T

The official HEALTHCHECK reference is clear about its scope. The HEALTHCHECK instruction defines a command that Docker runs periodically inside the container to know whether that container "works." If the command returns exit code 0, the container is healthy. If it returns 1, it's unhealthy. If it returns 2, it's ignored (reserved code).

# Minimal example from the official documentation
HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

The available parameters are:

--interval: how often it runs (default: 30s)
--timeout: maximum time for the command to respond (default: 30s)
--start-period: initial grace period for the container to start up (default: 0s)
--retries: how many consecutive failures mark it unhealthy (default: 3)

What the documentation doesn't say is that the /health endpoint verifies the real state of the application. That's the responsibility of whoever implements it. Docker only evaluates the exit code of the command — it doesn't read the response body, doesn't parse JSON, doesn't know if the database is down or if a queue is accumulating unprocessed messages.

The uncomfortable truth: most examples floating around in tutorials use curl -f against an endpoint that returns {"status":"ok"} without connecting to anything real. It's a healthcheck that only confirms the Node, Go, or Java process started and is accepting TCP connections. That has value — but it's very limited value.

Where people go wrong: the common recipe and its hidden cost

The most frequent pattern showing up in public Docker setups looks like this:

# Common pattern — measures TCP, not business logic
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

And on the app side, the endpoint does something like this:

// Handler that only confirms the server is responding
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

The container shows up as healthy. The orchestrator doesn't restart it. Railway shows green. And meanwhile, the PostgreSQL connection could be timing out, the connection pool exhausted, or a critical worker silently dead.

The hidden cost isn't that the healthcheck lies — it's that we believe it more than we should. When a team assumes "healthy = app is fine," they stop looking at logs, metrics, and richer signals. The healthcheck becomes the invisible scapegoat: if the dashboard is green, the problem must be somewhere else.

There's a counterexample that applies to any typical system: imagine a backend that processes payments. The /health endpoint returns 200, the container is healthy, but the payment gateway client has an expired token. Every transaction fails silently. Docker knows nothing about that — and it shouldn't have to, if the endpoint doesn't check for it.

Decision matrix: when to trust the healthcheck and when to go further

This matrix isn't absolute — it's a sensible criterion for deciding what to verify and from where:

Signal	What it detects	What it DOESN'T detect	Best tool
Basic `HEALTHCHECK` (curl TCP)	Process started, port open	DB failures, broken logic, queues	First-level triage
`/health` endpoint with DB ping	Database connectivity	Corrupt data, permissions, migrations	Basic readiness
Deep `/health` endpoint	Critical dependencies responding	Degraded latency, silent errors	Startup checks in prod
Application metrics (Prometheus, logs)	Errors per route, latencies, queues	Problems outside the app's scope	Real observability
External alerts (uptime monitors)	Availability from outside the container	Internal process state	SLA and contracts

When to trust the basic healthcheck:

So Docker/Railway restarts the container if the process dies or goes zombie.
To prevent a container that's still starting up from receiving traffic (with --start-period).
As a first-level triage tool in local development.

When to go further:

If the service has critical external dependencies (DB, caches, third-party APIs).
If a silent failure has direct business cost.
If the container keeps restarting but the problem persists because the error isn't process-level.

A useful question before writing the endpoint: ask what you want Docker to do if this check fails. If the answer is "restart the container," verify that restarting actually fixes that kind of failure. If the problem is a downed DB, restarting the container doesn't help — and it can trigger a restart loop that makes things worse.

Common mistakes and gotchas the documentation doesn't warn you about

1. --start-period too short for heavy apps.
Apps with migrations, cache warmup, or slow initialization need a generous --start-period. If the first checks fail before the app finishes starting up, Docker can mark the container unhealthy before it's even ready. In Railway, that can translate into premature restarts.

2. The /health endpoint that does too much.
If the check connects to the DB, calls an external API, and verifies storage on every execution every 30 seconds, you're adding artificial load and creating a new failure point. A deep check can fail on timeout even when everything else is fine.

3. exit 1 with no context in the logs.
When a healthcheck fails, Docker only records that it failed — not the command output. If you don't add your own logging inside the endpoint or the script, diagnosing why the check is failing in production becomes harder than it needs to be.

# Better: wrapper script with logging before the exit
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD /bin/sh -c 'curl -fsS http://localhost:3000/health > /tmp/hc.log 2>&1 || (cat /tmp/hc.log; exit 1)'

4. Confusing liveness with readiness.
Docker HEALTHCHECK is essentially a liveness check: is the process still alive? Kubernetes explicitly separates liveness from readiness. In standalone Docker or Railway, that distinction doesn't exist at the orchestrator level — so the responsibility falls on how you design the endpoint and when you expose it.

5. Assuming "healthy" means the container is correctly receiving traffic.
HEALTHCHECK affects the status reported by docker ps and can influence restart policies, but it doesn't guarantee the load balancer (if there's one in front) is routing correctly. Those are separate layers.

FAQ: Docker healthcheck best practices

What exit codes does HEALTHCHECK accept?
Only 0 (healthy), 1 (unhealthy), and 2 (reserved, ignored). Any other code higher than that is treated as 1. The logic lives in the exit code of the command defined in CMD — Docker doesn't read stdout or the HTTP response body.

Should I include a database check in the healthcheck?
Depends on what you want Docker to do when it fails. If the DB goes down and restarting the container solves nothing, a DB ping in the healthcheck can generate restart loops with no resolution. A more sensible approach: verify the DB during app startup and fail there; use the healthcheck to confirm the process keeps responding after it's started.

How do I inspect the state of a running healthcheck?

# See state and logs of the last healthcheck
docker inspect --format='{{json .State.Health}}' <container_id> | jq

This shows the log of the last attempt, the exit code, and the timestamp — useful information for diagnosing without guessing.

Does the healthcheck affect docker compose up behavior?
Yes, if you use depends_on with condition: service_healthy. In that case, the dependent service won't start until the check passes. That's useful, but it requires --start-period to be well-calibrated; otherwise, compose can get stuck waiting indefinitely.

Does Railway use the HEALTHCHECK from the Dockerfile?
Railway respects the HEALTHCHECK defined in the Dockerfile to determine if a deploy was successful. If the check doesn't pass within the grace period, the deploy can get stuck in an error state. Worth checking the deploy logs if the service gets stuck in a loop.

When does it make sense NOT to define HEALTHCHECK?
In short-lived task containers (batch jobs, migration scripts), where the concept of "continuous health" doesn't apply. Also in local development where the overhead of configuring it doesn't add real value.

Closing: the signal is limited, and that's fine — as long as you know it

A well-configured healthcheck is a basic safety net: it detects that the process died or went zombie, and gives the orchestrator a signal to act on. That's valuable. But if you build an entire service's operations on that single signal, you're promising yourself more than the mechanism can deliver.

My stance is this: use HEALTHCHECK for what it does well — liveness — and build real observability for everything else. Structured logs, application metrics, external alerts. Not because the healthcheck is bad, but because no tool should carry responsibilities that don't belong to it.

The concrete next step: open the Dockerfile of any service you have running and ask yourself what happens if that endpoint returns 200 but the DB is down. If the answer is "I don't find out until someone reports an error," you have an observability gap that the healthcheck will never close.

If you made it this far and you're thinking about how to evaluate dependencies before adding them to an operational stack, you might find what I wrote about how to evaluate npm dependencies before pulling them into production useful. And if the topic of signals in systems resonates with you, the post on Sniffnet for monitoring network traffic follows a similar logic: tools with a clear scope, without overpromising.

Original source: