137Foundry

Posted on Jun 22

How to Tell If Your Health Endpoint Is Lying To You

#webdev #programming #productivity

A lying health endpoint is not a fictional problem. Almost every long-running web application accumulates one over time: a /health or /healthz route that returns 200 OK even when the application is, by any reasonable definition, broken. The team trusts the dashboard. The dashboard trusts the endpoint. The endpoint trusts a route handler that was scaffolded three years ago and has not been audited since.

This is a guide to figuring out whether your own health endpoint is telling the truth. It is the audit that every web service should have run at least once and that very few have.

The four ways a health endpoint lies

A health endpoint can lie in four distinct ways. Knowing which one applies tells you what to fix.

Lie #1: It returns 200 OK because the route handler is hard-coded to do so.

The route handler reads, in essence:

app.get("/health", (req, res) => res.send("ok"));

There is no check. There is no dependency probe. The endpoint reports the process can serve HTTP, and nothing more. For some teams this is fine (this is what a liveness probe should look like). For most teams who think they have a real health endpoint, it is not fine.

How to spot it: read the route handler. If it has no awaits, no database calls, no external checks, no try/catch around dependencies, it is this kind of lie.

Lie #2: It returns 200 OK because the check inside it is catching all errors.

The route handler tries to check the database, the cache, and maybe an upstream service, but every check is wrapped in a try block whose catch returns 200 anyway. The original intent may have been "do not let the health endpoint itself crash," but the implementation hides every dependency error behind a green light.

How to spot it: search the route handler for catch blocks. If any catch returns success, the endpoint can lie about that path.

Lie #3: It returns 200 OK because it checks the wrong dependency.

The handler checks the database connection pool by querying a stub or a status table that does not actually exercise the database. Or it checks a cache that has been replaced by a new cache, but the old check still works. Or it checks an upstream service that has been deprecated and now returns 200 from a maintenance page regardless of state.

How to spot it: trace every check inside the handler back to a real production dependency. If any of them are vestigial, they are lying.

Lie #4: It returns 200 OK because the process is fine but the parts that serve real traffic are not.

This is the subtlest lie. The route handler runs on a worker pool, an event loop, or a thread that is completely separate from the workers serving real requests. When the request-serving workers wedge, the health-endpoint worker is unaffected. The endpoint passes; users see latency and errors.

How to spot it: look at the framework's worker configuration. If the health endpoint runs on its own loop or worker pool, the endpoint is structurally unable to detect a wedge in the request-serving path.

Photo by Brett Sayles on Pexels

A four-step audit you can run today

The audit takes an hour or two for most web applications and produces concrete output: either the endpoint is telling the truth, or you have a short list of things to fix.

Step 1: Read the route handler.

Open the source file containing the health endpoint. Read it line by line. Write down:

What checks does it run?
What dependencies does each check exercise?
What does it return on success, on partial failure, on total failure?
Is the route served by the same worker pool as real traffic?

If the answers are surprising (you thought it checked the cache and it does not), you have already found one of the lies above.

Step 2: Run a fault-injection test.

In a non-production environment, stop the database. Call the health endpoint. Does it return 503? If it returns 200, the endpoint is lying. Repeat for the cache, for any internal upstream services in the request path, and for the message broker.

This is the single most valuable test you can run on a health endpoint, and it is surprisingly underused. Teams audit code; very few teams audit signals.

Step 3: Probe under realistic load.

In a non-production environment, generate traffic that mimics production. Watch the health endpoint while doing so. Does it stay snappy? Does it ever return 503 spuriously? Tools like Prometheus scraping the endpoint over time will surface slow checks and intermittent failures that hand-testing misses.

A health endpoint that takes 2 seconds to respond is not a health endpoint; it is a latency liability. If the orchestrator polls it every 5 seconds, you have just added meaningful background load to your own database.

Step 4: Trace the dashboard's chain.

The dashboard says the service is healthy. Where does that signal come from? Usually it is an external uptime checker, a monitoring tool, or the orchestrator's own status. Confirm what each of them is actually checking. Often the dashboard is summarizing the orchestrator's view, which in turn is summarizing the liveness probe. If the liveness probe is the always-200 lie from above, the entire dashboard chain is built on a lie.

Common discoveries from the audit

In our experience auditing client systems at 137Foundry, the patterns recur:

Pattern 1: The handler was written before the service used a database. The endpoint is a leftover from when the application was a single in-memory service. The original handler was honest at the time and has never been updated.

Pattern 2: The handler checks an old dependency that has been replaced. The application now uses Redis for caching, but the health check still queries a stub that pretends a Memcached instance is up.

Pattern 3: The handler is the same route for liveness, readiness, and detail. It tries to satisfy three different audiences and ends up lying to at least one of them.

Pattern 4: The handler catches exceptions silently. Someone added a catch block years ago to prevent the endpoint from 500ing, and now any internal error is masked as 200.

The fixes are usually 10 to 50 lines of code each, plus a configuration change to the orchestrator's probe settings.

Photo by Keysi Estrada on Pexels

The three-endpoint solution

The longer-term fix is the three-endpoint pattern: split the single /health into /livez (liveness), /readyz (readiness), and /statusz (detailed JSON for humans). Each endpoint has a clear scope and a clear audience.

Liveness checks only that the process can serve HTTP at all. Cheap, fast, no dependencies.
Readiness checks every dependency the application needs to serve real requests. Slightly slower, parallelized, cached briefly.
Detailed status returns a JSON payload with per-dependency information for dashboards and on-call engineers.

This pattern is described fully in the hub article on health check endpoint design worth trusting. Container orchestrators like Kubernetes are designed around this separation; load balancers like Nginx and edge proxies like Cloudflare all respect the readiness signal as a routing primitive. The split is not invented; it is what the underlying infrastructure already expects.

What the audit will surface

After the audit, you will land on one of three outcomes.

Outcome A: The endpoint is honest. The route handler runs real checks, returns 503 on real failures, runs on the production worker pool, and the orchestrator's view matches reality. You have spent two hours and confirmed a piece of your stack is working as intended. Worth knowing.

Outcome B: The endpoint is structurally lying. It is hardcoded to 200, or it is checking the wrong things, or it runs on the wrong worker pool. You have a short list of changes to ship. Usually one afternoon of work.

Outcome C: The endpoint is mostly honest but has gaps. It checks the database but not the cache. It catches one exception silently but not the others. It checks an upstream that no longer matters. You have a slightly longer list, but each item is mechanical.

In our consulting work, Outcome B is the most common. Teams add real checks year by year, and the endpoint grows organically into something that mostly works but has at least one significant lie buried in it.

The takeaway

A lying health endpoint is the most expensive piece of seemingly trivial code in your stack. The reason it is expensive is that every alert, dashboard, runbook, and on-call escalation depends on it. When it lies, every downstream signal is poisoned, and the team trusts a green dashboard while users are unhappy.

The audit is short. The fix is mechanical. The cost of doing it once is a fraction of the cost of a single 2 AM page caused by a probe that should have surfaced the problem ten minutes earlier. If you have not audited your health endpoint in the last year, you are almost certainly running on at least one of the four lies above.

For a longer reference and the full three-endpoint design pattern, the hub article covers the rest. The services hub is where the audit conversation usually begins.