Quick one-liner: Scale worker containers with one command and add restart policies so crashed services recover on their own. Then hit the one problem restart policies can't solve.
🤔 Why This Matters
In the last post, you put nginx in front of a Python web app. One backend, one load balancer, and it works fine.
Then traffic picks up and that single backend starts choking. You need another one. So you add a second container to your compose file, update nginx.conf to include it, and restart everything. It works. Traffic grows and you add a third. You edit the compose file, add another upstream entry in nginx.conf, and restart. Same drill.
Here's where things get interesting. One of those backends crashes. Maybe a memory leak, maybe a bad request triggers a segfault. It goes down, but nginx doesn't know about it. It keeps routing requests to the dead container and users start seeing errors.
We'll work through both of these problems in this post.
✅ Prerequisites
-
Ep 1-8 completed. You know Compose basics like multi-service files,
.envfiles, shared networks, and theup/ps/logs/downworkflow.
📦 Scaling Web Apps Behind nginx
Back in the last post, you built a load balancer with nginx in front of a Python web app. If you still have that loadbalance directory, great. If not, grab the files from the exercise at the end of post 08. You need app.py, nginx.conf, and docker-compose.yml.
Here's where we left off:
services:
nginx:
image: nginx:latest
ports:
- "8080:80"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
web:
image: python:slim
command: python /app/app.py
volumes:
- ./app.py:/app/app.py
working_dir: /app
One backend. nginx in front. nginx.conf has one upstream entry: server web:5000;.
Now traffic picks up. One backend isn't enough. You need a second one.
The obvious approach: copy the web service, rename the original to web1, add a web2, and update nginx.conf.
Update docker-compose.yml:
services:
nginx:
image: nginx:latest
ports:
- "8080:80"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
web1:
image: python:slim
command: python /app/app.py
volumes:
- ./app.py:/app/app.py
working_dir: /app
web2:
image: python:slim
command: python /app/app.py
volumes:
- ./app.py:/app/app.py
working_dir: /app
Update nginx.conf:
upstream backend {
server web1:5000;
server web2:5000;
}
server {
listen 80;
keepalive_timeout 0;
location = /favicon.ico {
return 204;
}
location / {
proxy_pass http://backend;
}
}
keepalive_timeout 0 disables HTTP keep-alive. The favicon.ico block returns an empty response directly from nginx — without it, every browser page load fires two requests (/ and /favicon.ico) and they interleave on the round-robin, so your page refresh always lands on the same backend. Without it, browsers reuse the same connection and you always hit the same backend. With it, each request gets a fresh connection and nginx round-robins properly.
Two services. Two upstream entries. Tear down the old stack and start fresh:
$ docker compose down --remove-orphans
$ docker compose up -d
--remove-orphansremoves any containers that are no longer defined in your compose file. Without it, the oldwebcontainer stays behind and blocks network cleanup.
[+] up 3/3
✔ Container loadbalance-nginx-1 Created
✔ Container loadbalance-web1-1 Created
✔ Container loadbalance-web2-1 Created
Open http://localhost:8080 and refresh. The hostname changes between requests. nginx is round-robining between web1 and web2.
Check the containers:
$ docker compose ps -a
NAME IMAGE COMMAND STATUS
loadbalance-nginx-1 nginx:latest "nginx…" Up 5 min
loadbalance-web1-1 python:slim "python…" Up 5 min
loadbalance-web2-1 python:slim "python…" Up 5 min
Three containers. nginx distributes requests between two backends. It works.
Now let's break one:
$ sudo kill -9 $(docker inspect --format '{{.State.Pid}}' loadbalance-web2-1)
This kills the container process at the OS level, outside Docker's control. Docker sees it as a crash. Let's watch what happens.
Check:
$ docker compose ps -a
NAME IMAGE COMMAND STATUS
loadbalance-nginx-1 nginx:latest "nginx…" Up 6 min
loadbalance-web1-1 python:slim "python…" Up 6 min
loadbalance-web2-1 python:slim "python…" Exited (137) 10s ago
web2 exited with code 137. The kernel killed the process with SIGKILL. Not a clean exit. Just dead.
Two containers left running. nginx and web1.
Open http://localhost:8080 and refresh. The page still loads. nginx tries to send a request to web2, gets a connection refused, and falls back to web1. You might see one slow refresh, then everything works.
This is the real-world pattern: containers die. Crashes. Memory leaks. Bad requests. nginx keeps serving traffic with whatever backends are alive. But we had to bring web2 back manually. Nobody was watching. Nobody got paged. It just stayed dead until someone noticed the output from docker compose ps -a.
That's the second problem we'll fix in this post. But look at the compose file first. web1 and web2 are identical. Same image, same command, same volumes. The only difference is the name. And nginx.conf lists each one by name. Want a third instance? Add web3 to the compose file. Add server web3:5000; to nginx.conf. Want ten? Edit ten lines.
There's a better way.
Replace web1 and web2 with a single web service. Add depends_on so nginx doesn't start before web exists:
services:
nginx:
image: nginx:latest
ports:
- "8080:80"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
depends_on:
- web
web:
image: python:slim
command: python /app/app.py
volumes:
- ./app.py:/app/app.py
working_dir: /app
Update nginx.conf to reference just web:
upstream backend {
server web:5000;
}
server {
listen 80;
keepalive_timeout 0;
location = /favicon.ico {
return 204;
}
location / {
proxy_pass http://backend;
}
}
When nginx loads this config, it resolves web through Docker DNS and builds its upstream pool from what it sees at that moment. depends_on helps startup order, but it does not guarantee nginx picks up every replica on first boot. If nginx starts too early, reload it after the stack is up.
One service name. No hardcoded numbers.
Tear down the old stack and start fresh with two instances:
$ docker compose down --remove-orphans
$ docker compose up -d --scale web=2
$ docker compose exec nginx nginx -s reload
[+] up 3/3
✔ Container loadbalance-nginx-1 Created
✔ Container loadbalance-web-1 Created
✔ Container loadbalance-web-2 Created
Open http://localhost:8080 and refresh. The hostname should alternate between the two containers on every request.
Now kill one:
$ sudo kill -9 $(docker inspect --format '{{.State.Pid}}' loadbalance-web-2)
$ docker compose ps -a
NAME IMAGE COMMAND STATUS
loadbalance-nginx-1 nginx:latest "nginx…" Up 2 min
loadbalance-web-1 python:slim "python…" Up 2 min
loadbalance-web-2 python:slim "python…" Exited (137) 3s ago
web-2 is dead. Keep refreshing — the page still loads. nginx detected the failure and routes everything to web-1.
Now bring it back. You don't need to know which container died or restart it by name. Just tell compose how many you want:
$ docker compose up -d --scale web=2
[+] up 2/2
✔ Container loadbalance-web-1 Running
✔ Container loadbalance-web-2 Started
web-2 is back. Round-robin resumes. With the manual web1/web2 approach you'd have had to run docker compose up web2 and know the exact name. With --scale you just declare the count.
Scale to five with the same command:
$ docker compose up -d --scale web=5
[+] up 5/5
✔ Container loadbalance-nginx-1 Running
✔ Container loadbalance-web-1 Running
✔ Container loadbalance-web-2 Running
✔ Container loadbalance-web-3 Started
✔ Container loadbalance-web-4 Started
✔ Container loadbalance-web-5 Started
No nginx config changes. But nginx resolved web at startup when only two containers existed — it doesn't know about the three new ones yet. Reload nginx to pick them up:
$ docker compose exec nginx nginx -s reload
Now all five are in the pool. That's the power of --scale — the config stays the same, you just change the count and reload.
But there's a gap in this recovery story. Bringing web-2 back required you to notice it was dead and run the command yourself. Nobody was watching.
Tear it down:
$ docker compose down
🔄 Restart Policies
web-2 stayed dead until you ran the command. What if you didn't notice for an hour?
Back in post 04 we covered restart policies — one flag that tells Docker to bring a container back whenever it exits. The same thing works in Compose. One line, and you never have to babysit a crashed container again.
Add restart: unless-stopped to the web service:
services:
nginx:
image: nginx:latest
ports:
- "8080:80"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
web:
image: python:slim
command: python /app/app.py
volumes:
- ./app.py:/app/app.py
working_dir: /app
restart: unless-stopped
Start with two again:
$ docker compose up -d --scale web=2
Now kill web-2:
$ sudo kill -9 $(docker inspect --format '{{.State.Pid}}' loadbalance-web-2)
Check:
$ docker compose ps
NAME IMAGE SERVICE STATUS
loadbalance-nginx-1 nginx:latest nginx Up 2 minutes
loadbalance-web-1 python:slim web Up 2 minutes
loadbalance-web-2 python:slim web Up 1 second
Already back. web-2 restarted so fast it barely registered as dead. You didn't touch anything — Docker saw the process die and brought it straight back up.
unless-stopped restarts on crashes and host reboots, but stays down when you run docker compose stop intentionally.
🧪 Exercise 1: Scale Your Worker Stack from Episode 8
Done with the loadbalance stack. Tear it down:
$ docker compose down
In episode 8 you built a producer/worker/redis stack. One worker, pulling jobs from a queue. Submit enough jobs and the queue grows faster than the worker drains it. One worker isn't enough.
Grab your prodwork directory from episode 8. Your starting point:
prodwork/
├── producer.py
├── worker.py
└── docker-compose.yml
Your job:
Add
restart: unless-stoppedto theworkerservice. The producer and redis services stay as-is.Scale to three workers:
$ docker compose up -d --scale worker=3
Check docker compose ps. You should have one redis, one producer, and three workers.
- Watch the logs, then submit jobs:
$ docker compose logs -f
With the log stream open, go to http://localhost:5000 and submit a few jobs. You'll see different workers picking them up in real time. The queue drains faster with three workers pulling from it.
- Kill one worker mid-queue:
$ sudo kill -9 $(docker inspect --format '{{.State.Pid}}' prodwork-worker-2)
Keep watching the logs. worker-2 drops out, the other two keep draining. Once worker-2 restarts, it rejoins and starts pulling jobs again — all visible in the log stream.
- Test that manual stop is respected. Stop the stack:
$ docker compose stop
All containers exit and stay stopped. Start it back up and the workers are ready again. No jobs were processed in the meantime, but nothing was lost either. Redis persisted the queue.
The key thing to notice: you never changed worker.py. Multiple workers connect to the same Redis queue and brpop handles the coordination. Each job goes to exactly one worker.
🧪 Exercise 2: The App That Dies on Boot
Your worker stack is resilient. But it only works because Redis starts fast and unless-stopped keeps retrying until it connects. Now try the same pattern with an app that connects to Postgres. Postgres takes longer to initialise, and this time the app has a flaw that makes restart policies actively dangerous.
Create a new directory:
$ mkdir -p appstack && cd appstack
We'll use noteboard. It's a simple note board app. POST a name and message, GET shows all notes. On first start it runs a migration to create the schema.
docker-compose.yml:
services:
db:
image: postgres:16
environment:
POSTGRES_DB: app
POSTGRES_USER: app
POSTGRES_PASSWORD: app
app:
image: gitea.dtio.app/davidtio/noteboard:latest
ports:
- "5000:5000"
volumes:
- appstate:/app/state
restart: unless-stopped
volumes:
appstate:
Start it and watch the logs:
$ docker compose up
app-1 | First run, setting up...
app-1 | Migration failed: connection to server at "db" (172.20.0.2), port 5432 failed: Connection refused
app-1 |
app-1 | Already installed, skipping migrations
app-1 | Database not ready: connection to server at "db" (172.20.0.2), port 5432 failed: Connection refused
app-1 |
app-1 | Already installed, skipping migrations
app-1 | Serving on :5000
app-1 | Exception occurred during processing of request from ('172.20.0.1', 41410)
app-1 | psycopg2.errors.InFailedSqlTransaction: current transaction is aborted,
app-1 | commands ignored until end of transaction block
Open http://localhost:5000. The browser returns nothing — no error page, just an empty response.
On the very first boot, the app wrote the state file and tried to run the migration — but Postgres wasn't ready. The migration failed mid-transaction, leaving the connection in a broken state. The app exited. unless-stopped brought it back. Every restart after that saw the state file and skipped the migration entirely. Postgres eventually came up, the app reached Serving on :5000, and started accepting requests. But the notes table was never created, and psycopg2 refuses to run any query on a connection with a prior failed transaction. Every request crashes the handler silently.
$ docker compose ps
NAME IMAGE STATUS
appstack-db-1 postgres:16 Up 2 min
appstack-app-1 gitea.dtio.app/davidtio/noteboard:latest Up 2 min
Both containers up. Nothing in docker compose ps tells you anything is wrong.
Tear down and clear the volume:
$ docker compose down --volumes
The damage was done in the first few seconds. The state file was written, the migration never ran, and every restart since then has skipped it without question. Restart policies didn't cause this — they just kept the broken app alive long enough that you might not notice until a user reports it.
Restart policies can't solve this. They only decide what happens after a container exits. What you actually need is a way to tell Docker: don't start the app until the database is ready to accept connections.
That's the next post.
🏁 What You've Built
| Feature | What It Does |
|---|---|
--scale web=5 |
Spins up five web containers behind one nginx. No config changes needed. |
--scale worker=3 |
Spins up three workers pulling from the same Redis queue. Each job goes to exactly one worker. |
| Compose DNS + nginx reload | Docker can resolve a service name to multiple scaled instances. Reload nginx after scaling or late starts to refresh the upstream pool. |
restart: unless-stopped |
Auto-restarts crashed containers without manual intervention. See post 04 for the full policy breakdown. |
👉 Coming up: The noteboard died on boot because Postgres wasn't ready, and restart policies made it worse. There has to be a better way to bring a stack up. That's what we tackle next.
Found this helpful? 🙌
- LinkedIn: Share with your network
- Twitter: Tweet about it
- Questions? Drop a comment below or reach out on LinkedIn
Top comments (1)
The heartbeat-vs-log distinction is a real gap, and the "never started" case is a genuine blind spot that simple log monitoring cannot catch.
That said, the && pattern still treats exit code 0 as "the job worked," and that assumption breaks in the exact production scenarios heartbeats are supposed to protect against. The classic case is something like
pg_dump $DB | gzip > backup.sql.gzchained to a ping. The shell sees the pipeline exit 0 because gzip succeeded, even when pg_dump failed mid-stream, so you get green heartbeats for weeks while your backup file is a few bytes of empty gzip. Any script that wraps multiple commands withoutset -euo pipefail, or any tool that logs errors to stderr but exits clean (rsync partial transfers, kubectl with --ignore-not-found, half the AWS CLI), has the same shape. Heartbeat monitoring confirms the curl ran, not that the work was correct.Do you pair the ping with a cheap sanity check (output file size, row count, last-modified delta) before it fires, or is the contract just "the script said it succeeded"?