Health Checks : Understand in 3 Minutes

#healthcheck #livenessprobe #readinessprobe #abotwrotethis

Problem Statement

A health check is a lightweight endpoint or test that tells you whether your application is alive, ready to serve traffic, or dying a slow death. You’ve probably been woken up at 3 AM by an alert that your API is returning 500s, only to find it’s actually working fine – or worse, your load balancer kept sending requests to a zombie instance that was still accepting connections but returning garbage. Health checks exist to answer that simple question: “Is this thing actually working right now?” Without them, your system relies on guesswork, and your users are the first to know something is wrong.

Core Explanation

A health check is a dedicated, minimal path in your application (like GET /health) that returns a quick status. It’s not a full regression test – just a heartbeat. Most health checks fall into two categories:

Liveness check: “Is the process still running?” This prevents deadlocks or infinite loops from locking up your app. If it fails, the orchestrator (Kubernetes, for example) restarts the container.
Readiness check: “Is the service ready to accept traffic?” This matters after startup or during heavy load when the app might be alive but not able to handle requests (e.g., still loading a cache or waiting for a database connection).

Think of it like a car’s check engine light. The light itself doesn’t tell you which cylinder is misfiring, but it tells you something is wrong before the engine seizes up. Health checks work the same way: they give you a binary signal – healthy or unhealthy – and leave deeper diagnostics for other tools.

Key components of a well-designed health check:

Minimal dependencies: The health endpoint should only validate what’s truly critical (e.g., database connectivity, essential external APIs). If you check every service dependency, a cascading failure can make it look like your app is down when it’s actually fine.
Cached or short timeout: Health checks run frequently – every few seconds. They must be fast (< 100ms ideally) and not block on long operations.
Separation of concerns: Liveness and readiness are separate endpoints. Mixing them can cause unnecessary restarts during startup delays.

The typical flow: load balancer or container orchestrator calls GET /health. If it returns HTTP 200 with body {"status":"ok"}, the service stays in rotation. If it returns 503 or times out, the orchestrator marks the instance as unhealthy and reroutes traffic.

Practical Context

When to use health checks: Always. Any service that runs in production and is fronted by a load balancer or runs in an orchestrator (Docker Compose, Kubernetes, Nomad, etc.) benefits from health checks. Specifically:

Microservices: Each service should expose liveness and readiness endpoints so orchestration engines can manage lifecycle automatically.
Load-balanced APIs: Cloud providers (AWS ALB, GCP LB, Nginx) can use health checks to stop sending traffic to failing instances.
Database migrations: A readiness check can delay traffic until migrations complete, avoiding race conditions.

When NOT to use health checks:

Don’t make them do heavy work, like running a full database query. That will degrade performance and cause false positives under load.
Don’t use a single health check to test everything – separate critical dependencies from nice-to-haves. If your app depends on an external analytics service that is temporarily down, the health check should still pass; the app can degrade gracefully.
Don’t rely solely on health checks for monitoring – they are for automated decision-making, not for deep troubleshooting. Use separate alerting for latency, error rates, and custom metrics.

Why you should care: Health checks are the foundation of self-healing systems. They let your infrastructure automatically restart dead instances, drain traffic from slow ones, and gradually roll out new versions. Without them, you’re running a fire-fighting operation.

Quick Example

Here’s a minimal health check in a Node.js Express app:

const express = require('express');
const app = express();

// Readiness check – also checks DB connectivity
app.get('/ready', async (req, res) => {
  try {
    await db.ping(); // lightweight check
    res.status(200).json({ status: 'ready' });
  } catch (err) {
    res.status(503).json({ status: 'not ready', error: err.message });
  }
});

// Liveness check – just returns 200 if process is running
app.get('/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

app.listen(3000);

What this demonstrates: The /live endpoint is a simple “I’m still running” signal. The /ready endpoint also verifies database connectivity. In a Kubernetes Deployment, you’d point livenessProbe to /live and readinessProbe to /ready. This separation prevents the orchestrator from restarting your container every time the DB has a hiccup (liveness stays green), while still stopping new traffic during a real outage (readiness goes red).

Key Takeaway

Always implement separate liveness and readiness endpoints that are fast, minimal, and only check what truly matters for traffic acceptance. This single pattern will make your deployments safer, your debugging easier, and your nights quieter. For deeper reading, check out the Kubernetes documentation on Configure Liveness, Readiness and Startup Probes.

DEV Community