Your App is 'Up' But Not Working: Docker Healthchecks

#docker #healthcheck #systemadministration #devops

Last winter, we had a critical integration service running for a production ERP. Around 02:00 AM, everything looked green on my system monitor; the service was in "Up" status in the docker ps output. However, the operator screens were blank, and no data was coming from the production line. When I dove into the logs in a panic, I realized the service couldn't connect to an external API.

This is exactly what it means when a container that appears "Up" is not actually "healthy." Docker only checks if the main process is running; it doesn't know if the application is truly functional. In this post, I will explain step-by-step how to overcome these misleading situations and write robust internal health checks (HEALTHCHECK) for our Docker containers.

Why Do We Fall Into the 'Up' But 'Not Working' State?

This is a problem I frequently encounter, usually exploding at a critical moment. When a container's main process starts successfully, Docker marks it as "Up." But this doesn't mean all of the application's dependencies are ready or that the business logic is working correctly.

These situations usually stem from a few different scenarios. For example, your application might have started, but it hasn't established a database connection yet, or it couldn't pull critical configurations from an external API. When the Redis connection dropped in the backend of my own side project, the service still appeared to be running, but no requests were being answered; this was a typical "Up" but "not working" scenario.

ℹ️ Common Reasons

In-App Initialization Process: The application process has started, but critical dependencies like database connections, cache servers, or other services are not yet ready. This is especially common in large Java applications or services that do a lot of work on initial startup.

External Dependency Issues: The application is running, but the external services it needs (database, message broker, another microservice) are inaccessible or returning errors.

Resource Exhaustion: The container might be hitting a cgroup memory limit or CPU limit. Even if the process appears to be running, it cannot process requests because it lacks sufficient resources. When one of my services on my own VPS hit the cgroup memory.high limit, the HEALTHCHECK was still passing because it was only listening to the port, but the application couldn't process the data in memory.

Internal Logic Errors: The application process is technically running but has entered an error loop or the business logic is broken.

Using Docker's HEALTHCHECK feature is essential to detect and automatically fix these situations. Otherwise, many system administrators and developers, myself included, end up dealing with alerts in the middle of the night.

Docker HEALTHCHECK Basics

Docker HEALTHCHECK allows you to periodically check if the application inside a container is truly healthy. If a HEALTHCHECK fails, Docker marks the container as "unhealthy," and this information can be used by orchestration tools (Docker Compose, Kubernetes).

You can define a HEALTHCHECK instruction by adding it to your Dockerfile or specifying it in your docker-compose.yml file. In its most basic form, we expect to send a request to a specific HTTP endpoint of your application and receive a successful response.

# HEALTHCHECK definition inside a Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl --fail http://localhost:8080/health || exit 1

Each of the parameters I used here has an important meaning:

--interval=30s: Run the health check every 30 seconds. Checking too frequently can increase resource consumption; checking too infrequently can cause you to notice problems late.
--timeout=10s: Wait a maximum of 10 seconds for the health check command to complete. If the command doesn't finish within this time, the check is considered failed.
--start-period=5s: Ignore HEALTHCHECK results for 5 seconds after the container starts. This time is to allow the application to boot and load its dependencies. The boot time of a production ERP sometimes took up to 3 minutes; without start_period, it was always reporting 'unhealthy'.
--retries=3: How many times a HEALTHCHECK must fail before marking the container as "unhealthy." This prevents false alarms caused by issues like temporary network glitches.
CMD curl --fail http://localhost:8080/health || exit 1: The command to run for the health check. The curl --fail command returns a non-zero exit code for responses that are not HTTP status code 200, which causes the HEALTHCHECK to fail.

⚠️ Difference Between CMD-SHELL and CMD

Using CMD-SHELL in a HEALTHCHECK command ensures your command is run within a shell (like /bin/sh -c). This allows you to use shell features like pipes (|) and redirects (>). If you use CMD, the command is executed directly, and shell features cannot be used. Generally, CMD-SHELL is more flexible, but remember that the shell creates an additional process.

A Basic HEALTHCHECK Definition

Let's take a simple web application, for example, a FastAPI service running on port 8080. We can check if the /health endpoint of this service returns 200 OK.

# Dockerfile
FROM python:3.9-slim-buster
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

HEALTHCHECK --interval=10s --timeout=5s --start-period=10s --retries=3 \
  CMD python -c "import socket; s = socket.create_connection(('localhost', 8080), timeout=1); s.close()" || exit 1

Here, I used a simple Python script instead of curl. Why? Because tools like curl or wget are usually not included by default in alpine-based images, and installing them increases the image size. Python's built-in socket module is a much lighter alternative. This is a method I frequently use in projects where I want to keep the image size small.

Writing Smarter HEALTHCHECKs

Simply listening to a port or having a /health endpoint return 200 OK is not always enough. We need to write smarter checks that reflect the actual functionality of your application.

Checking External Dependencies

If your application has critical dependencies like a database, Redis, or an external API, the HEALTHCHECK should check these as well. In a client project, the ERP integration service remained 'Up' even when it couldn't connect to an external iSCSI storage unit. We had to add a custom HEALTHCHECK there.

PostgreSQL Connection Check:

HEALTHCHECK --interval=30s --timeout=5s --start-period=30s --retries=3 \
  CMD pg_isready -h localhost -p 5432 -U myuser || exit 1

This command checks if the PostgreSQL server is ready to accept connections on the local machine at port 5432 with the user myuser. The pg_isready tool comes with PostgreSQL client packages.

Redis Connection Check:

HEALTHCHECK --interval=15s --timeout=3s --start-period=10s --retries=5 \
  CMD redis-cli ping || exit 1

The redis-cli ping command sends a PING request to the Redis server and returns a non-zero exit code if it doesn't receive a response. This causes the HEALTHCHECK to fail.

Internal Application State Check

Sometimes it's necessary to query the application's own internal logic. For example, are certain caches full, are there too many messages in a specific queue, or is an internal service still able to perform its tasks? In the backend of the Android spam blocker app I developed, the service would slow down when certain caches were full. I added the cache status to the /health endpoint.

To do this, you can define a custom /healthz or /readiness endpoint in your application. This endpoint doesn't just return an HTTP 200 OK; it also checks the application's internal dependencies and critical state variables.

// Example /healthz endpoint in a Node.js Express app
app.get('/healthz', (req, res) => {
  const isDbConnected = checkDatabaseConnection(); // Check database connection
  const isRedisHealthy = checkRedisConnection();   // Check Redis connection
  const isQueueProcessing = checkQueueStatus();     // Check message queue status

  if (isDbConnected && isRedisHealthy && isQueueProcessing) {
    res.status(200).send('Healthy');
  } else {
    res.status(503).send('Unhealthy');
  }
});

Then, we can update the HEALTHCHECK command in the Dockerfile to query this endpoint:

HEALTHCHECK --interval=20s --timeout=5s --start-period=15s --retries=3 \
  CMD curl --fail http://localhost:8080/healthz || exit 1

This approach reflects the real-time status of your application much more accurately and prevents misleading "Up" states.

Optimizing HEALTHCHECK Parameters

HEALTHCHECK parameters should be carefully adjusted according to the requirements of your application and environment. Incorrect settings can lead to unnecessary restarts or delayed detection of problems.

| Parameter | Description | Optimization Tips |