DEV Community

David Tio
David Tio

Posted on • Originally published at blog.dtio.app

Docker Compose: Scale Out and Stay Resilient (2026)

Quick one-liner: Scale worker containers with one command and add restart policies so crashed services recover on their own. Then hit the one problem restart policies can't solve.


🤔 Why This Matters

In the last post, you put nginx in front of a Python web app. One backend, one load balancer, and it works fine.

Then traffic picks up and that single backend starts choking. You need another one. So you add a second container to your compose file, update nginx.conf to include it, and restart everything. It works. Traffic grows and you add a third. You edit the compose file, add another upstream entry in nginx.conf, and restart. Same drill.

Here's where things get interesting. One of those backends crashes. Maybe a memory leak, maybe a bad request triggers a segfault. It goes down, but nginx doesn't know about it. It keeps routing requests to the dead container and users start seeing errors.

We'll work through both of these problems in this post.


✅ Prerequisites

  • Ep 1-8 completed. You know Compose basics like multi-service files, .env files, shared networks, and the up/ps/logs/down workflow.

📦 Scaling Web Apps Behind nginx

Back in the last post, you built a load balancer with nginx in front of a Python web app. If you still have that loadbalance directory, great. If not, grab the files from the exercise at the end of post 08. You need app.py, nginx.conf, and docker-compose.yml.

Here's where we left off:

services:
  nginx:
    image: nginx:latest
    ports:
      - "8080:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf

  web:
    image: python:slim
    command: python /app/app.py
    volumes:
      - ./app.py:/app/app.py
    working_dir: /app
Enter fullscreen mode Exit fullscreen mode

One backend. nginx in front. nginx.conf has one upstream entry: server web:5000;.

Now traffic picks up. One backend isn't enough. You need a second one.

The obvious approach: copy the web service, rename the original to web1, add a web2, and update nginx.conf.

Update docker-compose.yml:

services:
  nginx:
    image: nginx:latest
    ports:
      - "8080:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf

  web1:
    image: python:slim
    command: python /app/app.py
    volumes:
      - ./app.py:/app/app.py
    working_dir: /app

  web2:
    image: python:slim
    command: python /app/app.py
    volumes:
      - ./app.py:/app/app.py
    working_dir: /app
Enter fullscreen mode Exit fullscreen mode

Update nginx.conf:

upstream backend {
    server web1:5000;
    server web2:5000;
}

server {
    listen 80;
    keepalive_timeout 0;

    location = /favicon.ico {
        return 204;
    }

    location / {
        proxy_pass http://backend;
    }
}
Enter fullscreen mode Exit fullscreen mode

keepalive_timeout 0 disables HTTP keep-alive. The favicon.ico block returns an empty response directly from nginx — without it, every browser page load fires two requests (/ and /favicon.ico) and they interleave on the round-robin, so your page refresh always lands on the same backend. Without it, browsers reuse the same connection and you always hit the same backend. With it, each request gets a fresh connection and nginx round-robins properly.

Two services. Two upstream entries. Tear down the old stack and start fresh:

$ docker compose down --remove-orphans
$ docker compose up -d
Enter fullscreen mode Exit fullscreen mode

--remove-orphans removes any containers that are no longer defined in your compose file. Without it, the old web container stays behind and blocks network cleanup.

[+] up 3/3
 ✔ Container loadbalance-nginx-1   Created
 ✔ Container loadbalance-web1-1    Created
 ✔ Container loadbalance-web2-1    Created
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8080 and refresh. The hostname changes between requests. nginx is round-robining between web1 and web2.

Check the containers:

$ docker compose ps -a
Enter fullscreen mode Exit fullscreen mode
NAME                       IMAGE         COMMAND   STATUS
loadbalance-nginx-1        nginx:latest  "nginx…"  Up 5 min
loadbalance-web1-1         python:slim   "python…" Up 5 min
loadbalance-web2-1         python:slim   "python…" Up 5 min
Enter fullscreen mode Exit fullscreen mode

Three containers. nginx distributes requests between two backends. It works.

Now let's break one:

$ sudo kill -9 $(docker inspect --format '{{.State.Pid}}' loadbalance-web2-1)
Enter fullscreen mode Exit fullscreen mode

This kills the container process at the OS level, outside Docker's control. Docker sees it as a crash. Let's watch what happens.

Check:

$ docker compose ps -a
Enter fullscreen mode Exit fullscreen mode
NAME                       IMAGE         COMMAND   STATUS
loadbalance-nginx-1        nginx:latest  "nginx…"  Up 6 min
loadbalance-web1-1         python:slim   "python…" Up 6 min
loadbalance-web2-1         python:slim   "python…" Exited (137) 10s ago
Enter fullscreen mode Exit fullscreen mode

web2 exited with code 137. The kernel killed the process with SIGKILL. Not a clean exit. Just dead.

Two containers left running. nginx and web1.

Open http://localhost:8080 and refresh. The page still loads. nginx tries to send a request to web2, gets a connection refused, and falls back to web1. You might see one slow refresh, then everything works.

This is the real-world pattern: containers die. Crashes. Memory leaks. Bad requests. nginx keeps serving traffic with whatever backends are alive. But we had to bring web2 back manually. Nobody was watching. Nobody got paged. It just stayed dead until someone noticed the output from docker compose ps -a.

That's the second problem we'll fix in this post. But look at the compose file first. web1 and web2 are identical. Same image, same command, same volumes. The only difference is the name. And nginx.conf lists each one by name. Want a third instance? Add web3 to the compose file. Add server web3:5000; to nginx.conf. Want ten? Edit ten lines.

There's a better way.

Replace web1 and web2 with a single web service. Add depends_on so nginx doesn't start before web exists:

services:
  nginx:
    image: nginx:latest
    ports:
      - "8080:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf
    depends_on:
      - web

  web:
    image: python:slim
    command: python /app/app.py
    volumes:
      - ./app.py:/app/app.py
    working_dir: /app
Enter fullscreen mode Exit fullscreen mode

Update nginx.conf to reference just web:

upstream backend {
    server web:5000;
}

server {
    listen 80;
    keepalive_timeout 0;

    location = /favicon.ico {
        return 204;
    }

    location / {
        proxy_pass http://backend;
    }
}
Enter fullscreen mode Exit fullscreen mode

When nginx loads this config, it resolves web through Docker DNS and builds its upstream pool from what it sees at that moment. depends_on helps startup order, but it does not guarantee nginx picks up every replica on first boot. If nginx starts too early, reload it after the stack is up.

One service name. No hardcoded numbers.

Tear down the old stack and start fresh with two instances:

$ docker compose down --remove-orphans
$ docker compose up -d --scale web=2
$ docker compose exec nginx nginx -s reload
Enter fullscreen mode Exit fullscreen mode
[+] up 3/3
 ✔ Container loadbalance-nginx-1   Created
 ✔ Container loadbalance-web-1     Created
 ✔ Container loadbalance-web-2     Created
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8080 and refresh. The hostname should alternate between the two containers on every request.

Now kill one:

$ sudo kill -9 $(docker inspect --format '{{.State.Pid}}' loadbalance-web-2)
Enter fullscreen mode Exit fullscreen mode
$ docker compose ps -a
Enter fullscreen mode Exit fullscreen mode
NAME                       IMAGE         COMMAND   STATUS
loadbalance-nginx-1        nginx:latest  "nginx…"  Up 2 min
loadbalance-web-1          python:slim   "python…" Up 2 min
loadbalance-web-2          python:slim   "python…" Exited (137) 3s ago
Enter fullscreen mode Exit fullscreen mode

web-2 is dead. Keep refreshing — the page still loads. nginx detected the failure and routes everything to web-1.

Now bring it back. You don't need to know which container died or restart it by name. Just tell compose how many you want:

$ docker compose up -d --scale web=2
Enter fullscreen mode Exit fullscreen mode
[+] up 2/2
 ✔ Container loadbalance-web-1     Running
 ✔ Container loadbalance-web-2     Started
Enter fullscreen mode Exit fullscreen mode

web-2 is back. Round-robin resumes. With the manual web1/web2 approach you'd have had to run docker compose up web2 and know the exact name. With --scale you just declare the count.

Scale to five with the same command:

$ docker compose up -d --scale web=5
Enter fullscreen mode Exit fullscreen mode
[+] up 5/5
 ✔ Container loadbalance-nginx-1   Running
 ✔ Container loadbalance-web-1     Running
 ✔ Container loadbalance-web-2     Running
 ✔ Container loadbalance-web-3     Started
 ✔ Container loadbalance-web-4     Started
 ✔ Container loadbalance-web-5     Started
Enter fullscreen mode Exit fullscreen mode

No nginx config changes. But nginx resolved web at startup when only two containers existed — it doesn't know about the three new ones yet. Reload nginx to pick them up:

$ docker compose exec nginx nginx -s reload
Enter fullscreen mode Exit fullscreen mode

Now all five are in the pool. That's the power of --scale — the config stays the same, you just change the count and reload.

But there's a gap in this recovery story. Bringing web-2 back required you to notice it was dead and run the command yourself. Nobody was watching.

Tear it down:

$ docker compose down
Enter fullscreen mode Exit fullscreen mode

🔄 Restart Policies

web-2 stayed dead until you ran the command. What if you didn't notice for an hour?

Back in post 04 we covered restart policies — one flag that tells Docker to bring a container back whenever it exits. The same thing works in Compose. One line, and you never have to babysit a crashed container again.

Add restart: unless-stopped to the web service:

services:
  nginx:
    image: nginx:latest
    ports:
      - "8080:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf

  web:
    image: python:slim
    command: python /app/app.py
    volumes:
      - ./app.py:/app/app.py
    working_dir: /app
    restart: unless-stopped
Enter fullscreen mode Exit fullscreen mode

Start with two again:

$ docker compose up -d --scale web=2
Enter fullscreen mode Exit fullscreen mode

Now kill web-2:

$ sudo kill -9 $(docker inspect --format '{{.State.Pid}}' loadbalance-web-2)
Enter fullscreen mode Exit fullscreen mode

Check:

$ docker compose ps
Enter fullscreen mode Exit fullscreen mode
NAME                  IMAGE          SERVICE  STATUS
loadbalance-nginx-1   nginx:latest   nginx    Up 2 minutes
loadbalance-web-1     python:slim    web      Up 2 minutes
loadbalance-web-2     python:slim    web      Up 1 second
Enter fullscreen mode Exit fullscreen mode

Already back. web-2 restarted so fast it barely registered as dead. You didn't touch anything — Docker saw the process die and brought it straight back up.

unless-stopped restarts on crashes and host reboots, but stays down when you run docker compose stop intentionally.


🧪 Exercise 1: Scale Your Worker Stack from Episode 8

Done with the loadbalance stack. Tear it down:

$ docker compose down
Enter fullscreen mode Exit fullscreen mode

In episode 8 you built a producer/worker/redis stack. One worker, pulling jobs from a queue. Submit enough jobs and the queue grows faster than the worker drains it. One worker isn't enough.

Grab your prodwork directory from episode 8. Your starting point:

prodwork/
├── producer.py
├── worker.py
└── docker-compose.yml
Enter fullscreen mode Exit fullscreen mode

Your job:

  1. Add restart: unless-stopped to the worker service. The producer and redis services stay as-is.

  2. Scale to three workers:

   $ docker compose up -d --scale worker=3
Enter fullscreen mode Exit fullscreen mode

Check docker compose ps. You should have one redis, one producer, and three workers.

  1. Watch the logs, then submit jobs:
   $ docker compose logs -f
Enter fullscreen mode Exit fullscreen mode

With the log stream open, go to http://localhost:5000 and submit a few jobs. You'll see different workers picking them up in real time. The queue drains faster with three workers pulling from it.

  1. Kill one worker mid-queue:
   $ sudo kill -9 $(docker inspect --format '{{.State.Pid}}' prodwork-worker-2)
Enter fullscreen mode Exit fullscreen mode

Keep watching the logs. worker-2 drops out, the other two keep draining. Once worker-2 restarts, it rejoins and starts pulling jobs again — all visible in the log stream.

  1. Test that manual stop is respected. Stop the stack:
   $ docker compose stop
Enter fullscreen mode Exit fullscreen mode

All containers exit and stay stopped. Start it back up and the workers are ready again. No jobs were processed in the meantime, but nothing was lost either. Redis persisted the queue.

The key thing to notice: you never changed worker.py. Multiple workers connect to the same Redis queue and brpop handles the coordination. Each job goes to exactly one worker.


🧪 Exercise 2: The App That Dies on Boot

Your worker stack is resilient. But it only works because Redis starts fast and unless-stopped keeps retrying until it connects. Now try the same pattern with an app that connects to Postgres. Postgres takes longer to initialise, and this time the app has a flaw that makes restart policies actively dangerous.

Create a new directory:

$ mkdir -p appstack && cd appstack
Enter fullscreen mode Exit fullscreen mode

We'll use noteboard. It's a simple note board app. POST a name and message, GET shows all notes. On first start it runs a migration to create the schema.

docker-compose.yml:

services:
  db:
    image: postgres:16
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: app
      POSTGRES_PASSWORD: app

  app:
    image: gitea.dtio.app/davidtio/noteboard:latest
    ports:
      - "5000:5000"
    volumes:
      - appstate:/app/state
    restart: unless-stopped

volumes:
  appstate:
Enter fullscreen mode Exit fullscreen mode

Start it and watch the logs:

$ docker compose up
Enter fullscreen mode Exit fullscreen mode
app-1  | First run, setting up...
app-1  | Migration failed: connection to server at "db" (172.20.0.2), port 5432 failed: Connection refused
app-1  |
app-1  | Already installed, skipping migrations
app-1  | Database not ready: connection to server at "db" (172.20.0.2), port 5432 failed: Connection refused
app-1  |
app-1  | Already installed, skipping migrations
app-1  | Serving on :5000
app-1  | Exception occurred during processing of request from ('172.20.0.1', 41410)
app-1  | psycopg2.errors.InFailedSqlTransaction: current transaction is aborted,
app-1  | commands ignored until end of transaction block
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:5000. The browser returns nothing — no error page, just an empty response.

On the very first boot, the app wrote the state file and tried to run the migration — but Postgres wasn't ready. The migration failed mid-transaction, leaving the connection in a broken state. The app exited. unless-stopped brought it back. Every restart after that saw the state file and skipped the migration entirely. Postgres eventually came up, the app reached Serving on :5000, and started accepting requests. But the notes table was never created, and psycopg2 refuses to run any query on a connection with a prior failed transaction. Every request crashes the handler silently.

$ docker compose ps
Enter fullscreen mode Exit fullscreen mode
NAME             IMAGE                                      STATUS
appstack-db-1    postgres:16                                Up 2 min
appstack-app-1   gitea.dtio.app/davidtio/noteboard:latest   Up 2 min
Enter fullscreen mode Exit fullscreen mode

Both containers up. Nothing in docker compose ps tells you anything is wrong.

Tear down and clear the volume:

$ docker compose down --volumes
Enter fullscreen mode Exit fullscreen mode

The damage was done in the first few seconds. The state file was written, the migration never ran, and every restart since then has skipped it without question. Restart policies didn't cause this — they just kept the broken app alive long enough that you might not notice until a user reports it.

Restart policies can't solve this. They only decide what happens after a container exits. What you actually need is a way to tell Docker: don't start the app until the database is ready to accept connections.

That's the next post.


🏁 What You've Built

Feature What It Does
--scale web=5 Spins up five web containers behind one nginx. No config changes needed.
--scale worker=3 Spins up three workers pulling from the same Redis queue. Each job goes to exactly one worker.
Compose DNS + nginx reload Docker can resolve a service name to multiple scaled instances. Reload nginx after scaling or late starts to refresh the upstream pool.
restart: unless-stopped Auto-restarts crashed containers without manual intervention. See post 04 for the full policy breakdown.

👉 Coming up: The noteboard died on boot because Postgres wasn't ready, and restart policies made it worse. There has to be a better way to bring a stack up. That's what we tackle next.


Found this helpful? 🙌

Top comments (1)

Collapse
 
ticktockbent profile image
Wes

The heartbeat-vs-log distinction is a real gap, and the "never started" case is a genuine blind spot that simple log monitoring cannot catch.

That said, the && pattern still treats exit code 0 as "the job worked," and that assumption breaks in the exact production scenarios heartbeats are supposed to protect against. The classic case is something like pg_dump $DB | gzip > backup.sql.gz chained to a ping. The shell sees the pipeline exit 0 because gzip succeeded, even when pg_dump failed mid-stream, so you get green heartbeats for weeks while your backup file is a few bytes of empty gzip. Any script that wraps multiple commands without set -euo pipefail, or any tool that logs errors to stderr but exits clean (rsync partial transfers, kubectl with --ignore-not-found, half the AWS CLI), has the same shape. Heartbeat monitoring confirms the curl ran, not that the work was correct.

Do you pair the ping with a cheap sanity check (output file size, row count, last-modified delta) before it fires, or is the contract just "the script said it succeeded"?