DEV Community

Ramer Labs
Ramer Labs

Posted on

The Ultimate Checklist for Zero‑Downtime Deploys with Docker and Nginx

Introduction

Zero‑downtime deployments are a non‑negotiable expectation for modern services. As a DevOps lead, you’ve probably seen the panic when a new release takes the API offline for a few seconds. The good news? With Docker and Nginx you can orchestrate smooth rollouts that keep traffic flowing, users happy, and SLOs intact. This checklist walks you through the concrete steps you need to turn a "deploy" into a "seamless transition".


Prerequisites

Before you start ticking items off, make sure you have the following in place:

  • Docker Engine 20.10+ on all hosts.
  • Docker Compose (or a Swarm/Kubernetes orchestrator) for multi‑service coordination.
  • Nginx 1.21+ acting as a reverse proxy in front of your containers.
  • A CI pipeline that builds immutable images and pushes them to a registry.
  • Access to monitoring (Prometheus, Grafana) and logging (ELK, Loki) stacks.

If any of these are missing, allocate time to get them up and running first – the checklist assumes they exist.


Checklist Overview

Item
1 Build immutable, version‑tagged Docker images
2 Store images in a private, immutable registry
3 Define health‑check endpoints for every container
4 Use a rolling‑update strategy with a small batch size
5 Configure Nginx for blue‑green routing
6 Validate Nginx health checks before swapping traffic
7 Run zero‑downtime DB migrations in a separate step
8 Wire up observability alerts for latency & error spikes
9 Prepare an automated rollback plan
10 Clean up old images and containers after a successful rollout

Below each item is a short “how‑to” you can copy‑paste into your workflow.


1. Build Immutable, Version‑Tagged Docker Images

Never rely on latest. Tag each build with a Git SHA or semantic version.

# Dockerfile
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
EXPOSE 3000
CMD ["node", "dist/index.js"]
Enter fullscreen mode Exit fullscreen mode

Build and push:

VERSION=$(git rev-parse --short HEAD)
docker build -t registry.myco.com/webapp:$VERSION .
docker push registry.myco.com/webapp:$VERSION
Enter fullscreen mode Exit fullscreen mode

The immutable tag guarantees you can roll back by redeploying the exact same image.


2. Store Images in a Private, Immutable Registry

Configure your CI to authenticate against the registry using short‑lived tokens. Example for GitHub Actions:

- name: Login to Docker registry
  uses: docker/login-action@v2
  with:
    registry: registry.myco.com
    username: ${{ secrets.REGISTRY_USER }}
    password: ${{ secrets.REGISTRY_TOKEN }}
Enter fullscreen mode Exit fullscreen mode

3. Define Health‑Check Endpoints

Docker’s built‑in health checks let the orchestrator know when a container is ready.

services:
  web:
    image: registry.myco.com/webapp:${VERSION}
    ports:
      - "3000:3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 3s
      retries: 3
Enter fullscreen mode Exit fullscreen mode

Expose /health to return 200 only when the app can serve traffic and its DB connections are healthy.


4. Use a Rolling‑Update Strategy

If you’re on Docker Swarm, the update_config block controls batch size and delay.

services:
  web:
    deploy:
      replicas: 4
      update_config:
        parallelism: 1        # one container at a time
        delay: 10s            # give it time to pass health check
        order: start-first    # start new, then stop old
Enter fullscreen mode Exit fullscreen mode

Kubernetes users can achieve the same with strategy.rollingUpdate.maxSurge and maxUnavailable.


5. Configure Nginx for Blue‑Green Routing

Run two upstream groups – green (current) and blue (new). Switch traffic by toggling the proxy_pass target.

upstream green {
    server 10.0.0.11:3000;
    server 10.0.0.12:3000;
}

upstream blue {
    server 10.0.0.21:3000;
    server 10.0.0.22:3000;
}

server {
    listen 80;
    location / {
        # Initially point to green
        proxy_pass http://green;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
Enter fullscreen mode Exit fullscreen mode

When the new containers pass health checks, edit the proxy_pass line to http://blue; and reload Nginx:

nginx -s reload
Enter fullscreen mode Exit fullscreen mode

Because Nginx only switches the upstream after the config reload, the change is atomic from the client’s perspective.


6. Validate Nginx Health Checks Before Swapping

Add a passive health check in Nginx to avoid sending traffic to a failing pod.

upstream blue {
    server 10.0.0.21:3000 max_fails=3 fail_timeout=30s;
    server 10.0.0.22:3000 max_fails=3 fail_timeout=30s;
    keepalive 32;
}
Enter fullscreen mode Exit fullscreen mode

You can also use the ngx_http_healthcheck_module (commercial) for active probing, but the passive approach works well for most SaaS workloads.


7. Run Zero‑Downtime DB Migrations Separately

Never couple schema changes with the app rollout. Use a tool like Flyway or Prisma migrate to apply migrations in a forward‑compatible manner.

# Example with Prisma
npx prisma migrate deploy --preview-feature
Enter fullscreen mode Exit fullscreen mode

If a migration needs a back‑fill, run it in a background worker that reads from a queue, ensuring the API stays responsive.


8. Wire Up Observability Alerts

Create alerts that fire on:

  • Response latency > 200 ms for more than 2 minutes.
  • Error rate > 1% across the new upstream.
  • Container restarts > 1 per minute.

In Prometheus:

- alert: HighLatency
  expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="web"}[5m])) by (le)) > 0.2
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "95th percentile latency is high"
Enter fullscreen mode Exit fullscreen mode

Having these alerts lets you abort a rollout before users notice a problem.


9. Prepare an Automated Rollback Plan

If any alert fires, trigger a rollback via your CI/CD tool. For Docker Swarm you can revert to the previous image tag:

docker service update \
  --image registry.myco.com/webapp:${PREV_VERSION} \
  --rollback-delay 5s \
  web_service
Enter fullscreen mode Exit fullscreen mode

Keep the previous tag stored in a variable (e.g., PREV_VERSION) as part of the deployment script.


10. Clean Up Old Images and Containers

After a successful rollout, prune stale images to free space and reduce attack surface.

# Remove dangling images
docker image prune -f
# Remove images older than 30 days
docker images --filter "until=720h" -q | xargs -r docker rmi -f
Enter fullscreen mode Exit fullscreen mode

Also decommission the old upstream servers in Nginx and update the inventory file used by your orchestration layer.


Conclusion

Zero‑downtime deployments aren’t magic; they’re the result of disciplined versioning, health‑aware routing, and observability‑driven feedback loops. By following this checklist you can ship features several times a day without ever breaking the user experience. If you need help shipping this, the team at RamerLabs can help.

Top comments (0)