A container that's misbehaving is one of those problems where your instinct works against you. The pressure pushes you toward the dramatic move — restart it, redeploy, rebuild the image — before you actually know what's wrong. Most of the time the answer was sitting one cheap command away, and you skipped past it.
What's kept me calm through enough of these is having a fixed order. Same sequence every time, cheapest check first, moving up a layer only when the current one doesn't answer the question. When the clock is running, the order is what saves you, because you're not deciding what to do next — you already decided, months ago, when nothing was on fire. Here's the exact sequence I run on a container that's crashing, restarting, hanging, or just acting wrong.
The mental model: start cheap, move up
Every step below costs almost nothing to run and rules out a whole class of causes. The discipline is to actually do them in order instead of jumping to the exciting theory. A crashing container is almost always one of a few things: it exited with a telling status, it's screaming in its logs, its config or mounts are wrong, it's hitting a resource limit, or something is killing it from outside. The workflow walks that list from cheapest to most involved.
Step 1: docker ps -a — what's the status and exit code?
First question, always: what state is the container actually in? Not what you think it's in — what Docker says.
docker ps -a
The -a is non-negotiable here. A crashed container doesn't appear in plain docker ps, so if you skip the flag you'll conclude "it's gone" when it actually exited two seconds after starting. Look at the STATUS column:
-
Up 10 minutes— it's running; your problem is inside a live container (skip to logs). -
Restarting (1) 5 seconds ago— it's in a crash loop. The number in parentheses is the last exit code. This is the classic "keeps dying on startup" case. -
Exited (0) 3 minutes ago— it stopped cleanly. Exit 0 means the main process finished normally; maybe it did its job and there's no bug at all. -
Exited (137)orExited (143)— it was killed by a signal. More on those in a second, because they're the ones people misread.
The exit code is the single most useful number on the screen, and it's free. Read it before you form any theory.
Reading exit codes accurately
A few exit codes come up constantly, and getting them right saves you from chasing the wrong problem:
-
0— clean exit. The process ended without error. -
1— a generic application error. Whatever ran, it failed on its own terms; the logs will tell you why. -
125— thedocker runcommand itself failed (bad flag, bad image reference). The container never really started. -
126— the command in the container was found but isn't executable (permissions, or it's not actually a binary). -
127— command not found. Usually a wrong path or a missing binary in the image. -
137— the process receivedSIGKILL(signal 9). The convention is128 + signal, so128 + 9 = 137. In containers this most often means the kernel's OOM killer stopped it for exceeding a memory limit, or Docker force-killed it after a stop timeout. It is not automatically "out of memory" — it means SIGKILL, and OOM is the most common reason. Confirm it, don't assume it. -
143— the process receivedSIGTERM(signal 15).128 + 15 = 143. This is a graceful stop request — usually the container was told to shut down (adocker stop, an orchestrator scaling down). Often this is expected behavior, not a bug.
That distinction between 137 and 143 matters. 143 is usually "something asked it to stop politely and it did." 137 is "something killed it hard" — and that's the one worth investigating.
Step 2: docker logs --tail then -f — what did it say on the way down?
Now that you know the status, read what the container actually said. Start bounded — you don't want the entire history — then follow if you need live output:
# Last 100 lines, which usually includes the death rattle
docker logs --tail 100 web-1
# Add timestamps to correlate with other events
docker logs --tail 100 -t web-1
# If it's still running or restarting, follow it live
docker logs -f web-1
For a crash loop, --tail is perfect: the last thing before the exit is almost always the cause — an unhandled exception, a failed connection to a dependency, a missing environment variable. If the container is restarting fast, docker logs -f web-1 lets you watch a fresh crash happen in real time.
One honest caveat: docker logs only shows you what the process wrote to stdout and stderr. If the app logs to a file inside the container instead, this comes back empty, and that emptiness is itself a clue — either the app died before it could log, or its logs are going somewhere you'll have to go get (which is what Step 4 is for).
Step 3: docker inspect --format — state, restarts, mounts, health
If the logs didn't hand you the answer, ask Docker for the structured facts. Raw docker inspect is a wall of JSON; --format pulls out exactly the field you care about:
# Exact state and exit code
docker inspect --format '{{ .State.Status }} {{ .State.ExitCode }}' web-1
# How many times has it restarted? A high count confirms a real loop.
docker inspect --format '{{ .RestartCount }}' web-1
# Was it OOM-killed? This is the definitive answer to the 137 question.
docker inspect --format '{{ .State.OOMKilled }}' web-1
# What's actually mounted where?
docker inspect --format '{{ range .Mounts }}{{ .Source }} -> {{ .Destination }}{{ "\n" }}{{ end }}' web-1
# If the image defines a healthcheck, what's it reporting?
docker inspect --format '{{ .State.Health.Status }}' web-1
.State.OOMKilled is the one that turns a guess into a fact. Saw a 137 and suspect memory? This returns true or false and settles it. RestartCount confirms whether you're really in a loop or just saw one unlucky restart. And the mounts check catches a whole category of "works locally, breaks in the container" bugs — a volume pointing at the wrong host path, or a config file that isn't where the app expects it.
Step 4: docker exec -it ... sh — get inside and poke around
If the container is running (or restarting slowly enough to catch), step inside and look with your own eyes:
docker exec -it web-1 sh
# or bash, if the image has it
docker exec -it web-1 bash
exec starts a new process in the existing container without disturbing the main one, so it's safe on something live. Once inside, check the things logs and inspect can't easily show you:
# Is the config file actually there and correct?
cat /etc/app/app.conf
# Are the environment variables what you expect?
env | sort
# Can it reach its dependencies?
nslookup db-1
wget -qO- http://db-1:5432 || echo "cannot reach db-1"
# Is the disk full inside the container?
df -h
An important limitation: if the container has already exited, exec won't work — there's no running process to exec into. For a container that dies instantly on startup, the trick is to override its entrypoint and start a shell instead, so you can look around a stopped-but-openable version:
docker run --rm -it --entrypoint sh myapp:1.4.2
Now you're in the image's filesystem without the failing startup command, and you can check paths, permissions, and whether that binary even exists.
Step 5: docker stats — is it under resource pressure?
If the container runs but is slow, unresponsive, or getting OOM-killed, look at live resource usage:
# Live view of all containers
docker stats
# One snapshot for one container, then exit
docker stats --no-stream web-1
Watch the MEM USAGE / LIMIT column especially. If memory is pinned right against the limit, you've found your 137 — the container is bumping the ceiling and the OOM killer is doing its job. High steady CPU explains sluggishness. This is also where you catch the slow leak: memory climbing over time until the inevitable kill, which looks like a container that "randomly" restarts every few hours.
Step 6: docker events — who's acting on this container?
If you've gotten this far and the container is dying but nothing inside it explains why, widen the lens. docker events streams what the daemon is doing, so you can see kills, stops, and OOM events as they happen:
# Watch daemon events live, filtered to one container
docker events --filter container=web-1
# Or look at a recent window after the fact
docker events --since 30m --filter container=web-1
This is how you catch an external actor — an orchestrator stopping the container, a health check triggering a restart, a docker kill from a script somewhere. When the inside of the container looks innocent, events tells you what the outside is doing to it.
A worked example
Say web-1 is stuck restarting. The order pays off like this: docker ps -a shows Restarting (137). That 137 says SIGKILL, and my first suspect is memory — but I don't assume. docker logs --tail 50 web-1 shows the app booting normally and then just stopping mid-request, no exception. That absence is a hint: clean logs plus a hard kill points away from an application bug. docker inspect --format '{{ .State.OOMKilled }}' web-1 returns true — now it's a fact, not a hunch. docker stats --no-stream web-1 confirms memory pinned at the limit. The fix isn't in the code; it's a too-low memory limit (or a genuine leak worth profiling). Six cheap commands, and I never once guessed.
Making it repeatable
The reason this works isn't the commands — it's that they're always in the same order, so under pressure you're executing a routine instead of improvising. Write it down as a runbook for your team: status, logs, inspect, exec, stats, events. Six steps, cheapest first. When something breaks at 3 a.m., the person on call shouldn't have to reinvent the sequence.
If you want the reference versions of these — including what specific Docker errors mean when they show up in step one or two — I keep a set of Docker error-fix guides at devopsaitoolkit.com/categories/docker, and the broader Docker toolkit lives at devopsaitoolkit.com/stacks/docker.
Wrapping up
The durable lesson isn't any single command — it's the habit of starting cheap and moving up one layer at a time. A misbehaving container feels chaotic, but the diagnosis almost never is: read the exit code, read the logs, confirm with inspect, then go inside. Fix the order in your head once, and every future incident gets a little quieter.
Top comments (0)