DEV Community

Ramer Labs
Ramer Labs

Posted on

The Ultimate Checklist for Zero‑Downtime Deploys with Docker & Nginx

Why Zero‑Downtime Matters

For a DevOps lead, every second of unavailability translates into lost revenue, eroded trust, and a damaged brand. Modern users expect services to stay online even when you push new features, security patches, or database migrations. Achieving zero‑downtime isn’t a magic trick; it’s a disciplined set of practices that combine container orchestration, smart proxying, and observability.


Prerequisites

Before you dive into the checklist, make sure you have the following in place:

  • Docker Engine ≥ 20.10 on every host.
  • Docker Compose (or a compatible orchestrator like Kubernetes) for multi‑service definitions.
  • Nginx 1.21+ compiled with the ngx_http_upstream_module.
  • A CI/CD runner (GitHub Actions, GitLab CI, or Jenkins) that can push images to a registry.
  • Basic monitoring stack (Prometheus + Grafana or Loki + Grafana) for health checks.

Docker & Nginx Blueprint

Dockerfile (Node.js example)

FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci && npm run build

FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY package*.json ./
RUN npm ci --production
EXPOSE 3000
CMD ["node", "dist/index.js"]
Enter fullscreen mode Exit fullscreen mode

The multi‑stage build keeps the final image lean (<50 MB) and isolates build‑time dependencies.

docker‑compose.yml

version: "3.9"
services:
  api:
    build: .
    image: myorg/api:{{GIT_SHA}}
    restart: always
    environment:
      - NODE_ENV=production
    ports:
      - "3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
  nginx:
    image: nginx:1.23-alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      api:
        condition: service_healthy
Enter fullscreen mode Exit fullscreen mode

The depends_on clause ensures Nginx only starts routing traffic after the API passes its health check.

Nginx Reverse‑Proxy Config

worker_processes auto;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;

events { worker_connections 1024; }

http {
    upstream api_upstream {
        server api:3000 max_fails=3 fail_timeout=30s;
    }

    server {
        listen 80;
        location / {
            proxy_pass http://api_upstream;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The upstream block lets Nginx gracefully handle a failing container by retrying other healthy instances.


CI/CD Pipeline Checklist

  1. Lint & Unit Test – Fail fast on code quality issues.
  2. Build Docker Image – Tag with commit SHA and latest.
  3. Run Integration Tests against a temporary compose stack.
  4. Push Image to Registry – Ensure the registry is reachable from production nodes.
  5. Deploy to Staging – Use the same compose file; run smoke tests.
  6. Blue‑Green Switch – Deploy the new version alongside the old one.
  7. Health‑Check Validation – Verify /health returns 200 before routing.
  8. Promote to Production – Swap Nginx upstream without downtime.
  9. Post‑Deploy Monitoring – Watch error rates for 5 minutes; auto‑rollback if thresholds breach.

GitHub Actions Snippet

name: CI/CD
on:
  push:
    branches: [main]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      - name: Log in to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DHUB_USER }}
          password: ${{ secrets.DHUB_PASS }}
      - name: Build & Push
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: myorg/api:${{ github.sha }},myorg/api:latest
      - name: Deploy Blue‑Green Stack
        run: |
          docker compose -f docker-compose.yml up -d --no-deps --scale api=2
          # health check loop omitted for brevity
Enter fullscreen mode Exit fullscreen mode

The --scale api=2 command brings up the new container while keeping the old one alive, enabling a true blue‑green rollout.


Blue‑Green Deployment Workflow

  1. Spin Up New Versiondocker compose up -d api_new creates a parallel service.
  2. Health Probe – Nginx only adds the new upstream after /health passes.
  3. Swap Traffic – Update the upstream block (or use a DNS‑based service discovery) to point to the new container.
  4. Graceful Drain – Reduce max_fails on the old container; let existing connections finish.
  5. Verify Metrics – Check latency, error rate, and CPU usage for anomalies.
  6. Retire Old Podsdocker compose rm -f api_old once the new version is stable.

Because Nginx reads the upstream configuration on each request, you can reload it with a zero‑second nginx -s reload without dropping connections.


Observability & Logging

  • Prometheus – Scrape /metrics from both Nginx and the API.
  • Grafana Dashboards – Visualize request latency, 5xx rates, and container restarts.
  • Loki – Centralize Nginx access/error logs for quick grep.
  • Alertmanager – Trigger a webhook to your CI system if error rate > 1% for 2 minutes.

Sample Prometheus target configuration:

scrape_configs:
  - job_name: "nginx"
    static_configs:
      - targets: ["nginx:9113"]
  - job_name: "api"
    static_configs:
      - targets: ["api:3000"]
Enter fullscreen mode Exit fullscreen mode

Rollback Strategy

  1. Keep the Previous Image – Tag releases as v1.2.3 and never GC them immediately.
  2. Automated Rollback – If the health check fails three times in a row, run docker compose down api_new && docker compose up -d api_old.
  3. Database Compatibility – Use backward‑compatible migrations or feature flags to avoid schema lock‑in.
  4. Post‑Mortem – Log the cause, update the checklist, and share findings with the team.

Final Thoughts

Zero‑downtime deployments are achievable with a modest amount of automation, solid health checks, and disciplined observability. By following the checklist above, a DevOps lead can confidently ship new code without upsetting users or triggering costly incidents. If you need help shipping this, the team at RamerLabs can help.

Top comments (0)