Why Your Docker Containers Refuse to Die: The PID 1 Problem

#docker #devops #linux #debugging

You hit docker stop. Nothing happens. You wait ten seconds. Docker eventually sends SIGKILL. The container disappears, but only after a frustrating timeout. Your CI pipeline is slower than it should be, your Kubernetes pod terminations are sluggish, and you have a vague feeling something is wrong.

I hit this exact issue last month while debugging a deployment that took 90 seconds to roll out a single replica. Turned out to be the same boring culprit I've seen on at least four other projects: the PID 1 problem.

Let me walk you through what's actually happening, why it bites so many teams, and how to fix it properly.

The frustrating symptom

Here's what it usually looks like. You've got a Node app, a Python service, or whatever. You build it, run it, and try to stop it:

docker run --name myapp -d my-image:latest
# ... later ...
time docker stop myapp
# real    0m10.234s

Ten seconds. Every. Single. Time. That's the default --time value before Docker gives up and sends SIGKILL. If you're orchestrating dozens of containers, this adds up fast.

Worse, in production, this means your rolling deploys are slow, your zero-downtime story is shaky, and any in-flight requests are getting cut off ungracefully because your app never had a chance to clean up.

The root cause: PID 1 is weird

Here's the part most tutorials skip. In Linux, the process with PID 1 has special status. It's the init process. The kernel treats it differently in two important ways:

It does not get the default signal handlers. If you send SIGTERM to PID 1, and the process has no explicit handler for it, the signal is ignored. This is a kernel-level protection meant to keep init from being killed accidentally.
It is responsible for reaping zombie child processes. When any process in the system has its parent die, it gets re-parented to PID 1. When those orphans eventually exit, PID 1 must call wait() on them or they become zombies forever.

Now, in a Docker container, your application process is PID 1. So if your Node script doesn't explicitly handle SIGTERM, Docker's stop signal goes nowhere. The kernel quietly drops it. Docker waits its timeout, then nukes you with SIGKILL.

You can confirm this is happening with a quick test:

# Inside a running container
ps -ef
# UID  PID  PPID  CMD
# root   1     0  node server.js

That 1 next to your app is the problem.

The proof, in one tiny example

Let me show you the bug in the smallest possible repro. Save this as app.js:

// No SIGTERM handler
setInterval(() => console.log('alive'), 1000);

And a Dockerfile:

FROM node:20-alpine
COPY app.js /app.js
CMD ["node", "/app.js"]

Build and run:

docker build -t pid1-demo .
docker run --name demo -d pid1-demo
time docker stop demo

You'll wait the full 10 seconds. Now compare with this:

// With SIGTERM handler
process.on('SIGTERM', () => {
  console.log('shutting down cleanly');
  process.exit(0);
});
setInterval(() => console.log('alive'), 1000);

Rebuild and stop. Instant. The container exits in well under a second because PID 1 now actually responds to the signal.

Fix #1: Handle signals in your app

The most correct fix is to handle SIGTERM (and usually SIGINT) in your application code. This is the right answer because your app probably needs to do cleanup anyway: drain HTTP connections, finish in-flight DB writes, flush logs.

For a Node HTTP server:

const server = http.createServer(handler);
server.listen(3000);

function shutdown() {
  console.log('SIGTERM received, draining...');
  // Stop accepting new connections, finish existing ones
  server.close(() => process.exit(0));
  // Hard stop if drain takes too long
  setTimeout(() => process.exit(1), 8000).unref();
}

process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);

For Python with Flask/Gunicorn, Gunicorn already handles this for you. For a raw script:

import signal, sys
def shutdown(signum, frame):
    print('cleaning up')
    sys.exit(0)
signal.signal(signal.SIGTERM, shutdown)
signal.signal(signal.SIGINT, shutdown)

Fix #2: Use a proper init process

Sometimes you can't modify the app, or you've got a shell script as your entrypoint that spawns multiple children. In that case, run a tiny init process as PID 1 and let it handle signals and zombie reaping.

The usual choice is tini, which is around 24KB and does exactly one thing well. Docker actually ships with built-in tini support via the --init flag:

docker run --init --name demo -d pid1-demo

That's it. Docker injects a small init binary as PID 1, your app becomes PID 2, signals get forwarded properly, and zombies get reaped.

If you want it baked into the image instead of relying on the runtime flag:

FROM node:20-alpine
RUN apk add --no-cache tini
COPY app.js /app.js
# tini becomes PID 1 and execs your command as a child
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "/app.js"]

For Debian-based images, swap apk add for apt-get install -y tini. There's also dumb-init, which is similar and slightly different in signal-forwarding behavior. Both are fine.

The shell-form CMD trap

One more gotcha. If you write your CMD in shell form, you actually get sh -c "..." as PID 1, not your app:

# Shell form — PID 1 is /bin/sh, NOT node
CMD node /app.js

# Exec form — PID 1 is node
CMD ["node", "/app.js"]

And sh is also one of those processes that ignores most signals by default. Always prefer exec form unless you genuinely need shell features. If you do need shell expansion, wrap it with exec:

CMD ["sh", "-c", "exec node /app.js"]

The exec replaces the shell process with node, so node still ends up as PID 1.

Prevention checklist

A few habits that have saved me a lot of debugging time:

Default to exec-form CMD and ENTRYPOINT. It's a one-line change that prevents an entire class of bugs.
Add --init or bake in tini for any image where you don't fully control the application's signal handling.
Test your shutdown path locally with time docker stop <container>. If it takes more than two or three seconds, something is wrong. Catch it before production does.
Set sensible stopGracePeriodSeconds in Kubernetes to match your app's actual drain time. Don't just leave it at the 30-second default and hope.
Log on SIGTERM receipt. When something goes wrong in production, you want to know whether the signal arrived at all or was silently dropped.

The meme version of this is: containers are easy, until they aren't. The boring reality is that Linux process semantics didn't change just because we put a thin namespace wrapper around them. PID 1 is special, signals are easy to drop, and zombies accumulate. Once you internalize that, half the weird container shutdown issues you'll ever see stop being mysterious.