137Foundry

Posted on Jun 22

How to Write Readiness Checks That Survive Real Production Traffic

#webdev #programming #productivity

Readiness checks are deceptively easy to write and surprisingly easy to write badly. A readiness check that works in development can become a load-bearing piece of your production infrastructure in ways the author never intended. This guide walks through how to write a readiness check that holds up under real production traffic and does not become its own reliability liability.

The audience for the readiness check is the load balancer in your orchestrator (typically Kubernetes or a similar platform) and any front-end reverse proxy like Nginx or an edge service like Cloudflare. The job of the check is to tell the load balancer whether this instance can serve real production traffic right now.

Step 1: Define what "ready" means for this service

Before writing any code, write down the dependencies your service must have available to serve a real production request. The list should be short and specific. For most web services, it looks like:

The primary database (read and write).
The primary cache.
The message broker, if the request path produces messages synchronously.
Any internal upstream service called synchronously in the request handler.

Things that should not be on the list:

Asynchronous background dependencies. If your service writes events to a queue but the request can complete without confirming queue health, the queue is not a readiness dependency.
Non-critical caches. If a cache outage degrades performance but does not break the request, it should be reported in detail status but not block readiness.
External APIs called outside the synchronous request path. Same reasoning.
Disk space, memory, CPU. These are metrics for Prometheus or another monitoring tool, not binary readiness signals.

This list is the entire substance of the readiness check. Writing it down explicitly is the most important step.

Photo by panumas nikhomkhai on Pexels

Step 2: Implement each dependency check as a fast, isolated function

Each dependency check should be a small function that takes no parameters, runs a single fast query against the dependency, and returns true/false plus a brief reason string.

A reasonable database readiness check looks like:

async function checkPrimaryDb() {
  const start = performance.now();
  try {
    await db.query("SELECT 1");
    return { ok: true, latency_ms: performance.now() - start };
  } catch (e) {
    return { ok: false, error: e.message, latency_ms: performance.now() - start };
  }
}

Notes on each part:

The query is SELECT 1. Not a real query. Not a table scan. The check is testing connectivity, not performance.
The timing is captured. You will want it for the detailed status endpoint.
The function does not throw; it returns a structured result. The caller decides what to do with a failure.

Repeat this pattern for each dependency. Resist the urge to write a "generic" check that takes a dependency name and runs a different query per name. Specific is better than generic here; every dependency has its own quirks.

Step 3: Run the checks in parallel and with a timeout

The readiness handler should call every dependency check in parallel and apply a hard timeout (typically 500 milliseconds to 2 seconds). The timeout matters because a slow dependency that does not return in time should fail the readiness check fast, not hang the probe.

A reasonable handler in pseudocode:

async function readyz(req, res) {
  const checks = await Promise.all([
    withTimeout(checkPrimaryDb(), 1000),
    withTimeout(checkPrimaryCache(), 500),
    withTimeout(checkMessageBroker(), 1000),
  ]);
  const allOk = checks.every(c => c.ok);
  res.status(allOk ? 200 : 503).json({ checks });
}

The parallel execution means the overall check completes in roughly the time of the slowest dependency, not the sum of all dependencies. The timeout means a stuck dependency does not stick the probe.

Step 4: Cache the result briefly

The orchestrator polls the readiness endpoint frequently. If every probe ran a fresh database query, the database would experience a constant trickle of health-check load. Multiply by every pod, and the health check becomes a load source.

The fix is to cache the readiness result for a short window (typically 1 to 2 seconds). The handler:

Returns the cached result if it is less than the cache TTL old.
Otherwise runs the checks fresh, caches the result with the timestamp, and returns it.

A 1-second cache means each pod runs at most 1 set of dependency checks per second, regardless of how often the orchestrator polls. This is the right shape: the readiness result is fresh enough to be meaningful but not so fresh that it becomes a workload.

Photo by Al Nahian on Pexels

Step 5: Distinguish between hard and soft failures

Not every dependency is equally critical. A reasonable readiness handler distinguishes:

Hard dependencies. If any of these is down, the service cannot serve requests. The readiness check should fail. Database, primary cache, message broker.
Soft dependencies. If these are degraded, the service can serve requests with degraded behavior. The readiness check should pass; the degraded state should be visible in detailed status.

For most web services, the soft category includes secondary caches, analytics endpoints, and any non-critical external integration. The boundary depends on the service.

A common bug: classifying a soft dependency as hard, then having a non-critical service outage take the whole production cluster out of rotation. The fix is to be explicit about the boundary and to test failure modes for each.

Step 6: Handle graceful shutdown

When the process receives SIGTERM (a deploy, a scale-down, an orchestrator-initiated restart), the readiness check should immediately start failing so the load balancer stops routing new requests. Meanwhile in-flight requests continue to be served until they complete.

In practice this looks like:

let shuttingDown = false;
process.on("SIGTERM", () => { shuttingDown = true; });

async function readyz(req, res) {
  if (shuttingDown) {
    return res.status(503).json({ shutting_down: true });
  }
  // ... normal check logic
}

The liveness probe should keep passing during this window so the orchestrator does not SIGKILL the process before in-flight work has drained.

Step 7: Confirm the failure modes in a test environment

The most important step. In a non-production environment, simulate each dependency outage and confirm:

Killing the database causes readiness to fail within the timeout window.
Killing the cache causes readiness to fail.
Killing the message broker causes readiness to fail.
Killing a soft dependency does not cause readiness to fail but does show up in detail status.
SIGTERM causes readiness to fail immediately.
After dependency recovery, readiness recovers automatically.

Run these tests during the initial implementation and after any meaningful change. They catch the four lies a health endpoint can tell, all in about an hour of work.

Step 8: Watch the probe under real load

Once deployed, monitor the readiness probe over time. Watch for:

Frequent flaps (rapid 200/503/200/503 cycles). Usually indicates a timeout that is too aggressive or a dependency that is genuinely flaky.
Slow response times. If the probe takes more than 500ms on average, something is wrong with the parallelization or the timeout configuration.
Correlation with downstream incidents. Genuine readiness failures should correspond to real downstream issues.

Tools like Prometheus can scrape the probe over time and graph these signals. The graphs will tell you whether the readiness check is doing its job or has become noise.

Common mistakes and how to fix them

A few patterns we have seen in audits at 137Foundry's web development service:

Sequential dependency checks. Each dependency is checked in series. The total probe time is the sum of all checks, which slows the probe response. Fix: parallelize.

No timeout. The check awaits each dependency without a timeout. A slow dependency hangs the probe indefinitely. Fix: add a timeout per dependency.

No caching. Every probe runs fresh checks. Probe load becomes meaningful at scale. Fix: cache for 1-2 seconds.

Catch-all success. A try/catch around the whole handler returns 200 on any exception. Real failures get hidden. Fix: return 503 on any caught error and surface the error in the response body.

Missing graceful shutdown. The probe does not respect SIGTERM. Deploys lose in-flight requests. Fix: add the shuttingDown flag.

Checking the wrong dependency. The check queries a stub or a deprecated endpoint that no longer reflects production state. Fix: trace every check to a real production dependency.

The takeaway

A readiness check that survives production traffic is a small piece of code that defines the dependency contract explicitly, runs checks in parallel with timeouts, caches briefly to avoid becoming load, distinguishes hard and soft failures, and respects graceful shutdown. None of these requirements is hard. All of them are easy to miss.

For the longer write-up on how readiness sits inside the three-endpoint health check design and how the orchestrator, load balancer, and detailed status views all coordinate, the hub article on health check endpoint design is the place to land. The services hub is where the longer engagements live.

Most teams ship a v1 readiness check in a single afternoon. Most teams do not revisit it for two years. The audit that the section above describes usually catches at least one bug in the v1 implementation; running it once is a strong upgrade for any production web service.

DEV Community