Ramer Labs

Posted on Sep 27

The Ultimate Checklist for Zero‑Downtime Deploys with Docker & Nginx

#cloud #devops #performance

Why Zero‑Downtime Deploys Matter

Every minute of downtime translates into lost revenue, frustrated users, and a dented brand reputation. In high‑traffic SaaS environments, even a brief outage can cascade into a wave of support tickets and churn. A well‑orchestrated zero‑downtime deployment strategy lets you push new features, security patches, or configuration tweaks without ever taking the service offline.

Prerequisites

Before you start ticking items off the checklist, make sure you have the following in place:

Docker Engine (≥ 20.10) installed on all hosts.
Nginx acting as a reverse‑proxy and load‑balancer.
A CI/CD platform (GitHub Actions, GitLab CI, or similar) with permission to push images to your registry.
A monitoring stack (Prometheus + Grafana, Loki, or a SaaS alternative) that can alert on health‑check failures.
SSH access or a configuration management tool (Ansible, Terraform) to provision the infrastructure.

If any of these pieces are missing, you’ll likely hit roadblocks later in the process.

The Checklist

Below is a practical, step‑by‑step checklist you can copy into a Markdown file, a Confluence page, or your internal runbook. Treat each top‑level bullet as a must‑do before you cut the “Deploy” button.

1️⃣ Prepare Immutable Docker Images

Base on a minimal OS (e.g., alpine or distroless) to reduce attack surface and start‑up time.
Pin all dependencies in your Dockerfile.
Run a multi‑stage build to keep the final image lean.
Tag images with both a semantic version and a git SHA for traceability.

# Dockerfile – multi‑stage Node.js app
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci && npm run build

FROM node:18-alpine AS runtime
WORKDIR /app
COPY --from=builder /app/dist ./dist
CMD ["node", "dist/index.js"]

2️⃣ Configure Nginx for Blue‑Green Routing

Define two upstream blocks (upstream blue and upstream green).
Expose a health‑check endpoint (/healthz) on each container.
Use the proxy_next_upstream directive to automatically failover if a backend is unhealthy.
Keep the config idempotent so you can reload without dropping connections.

# /etc/nginx/conf.d/app.conf
upstream blue {
    server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
}
upstream green {
    server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    location / {
        proxy_pass http://$upstream; # $upstream is set by a variable
    }
    location /healthz {
        proxy_pass http://$upstream/healthz;
    }
}

3️⃣ Automate CI/CD Pipeline

Build, test, and push the image in a single job.
Store the image tag as an artifact for downstream jobs.
Trigger a deployment job that updates the Nginx upstream variable.
Add a manual approval step for production releases (optional but recommended).

# .github/workflows/deploy.yml (GitHub Actions)
name: Deploy
on:
  push:
    tags:
      - 'v*.*.*'
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      - name: Log in to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKER_USER }}
          password: ${{ secrets.DOCKER_PASS }}
      - name: Build & push image
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: myregistry.com/app:${{ github.ref_name }},myregistry.com/app:${{ github.sha }}
  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: SSH into host and reload Nginx
        run: |
          ssh -o StrictHostKeyChecking=no ${{ secrets.SSH_USER }}@${{ secrets.SSH_HOST }} \
            "docker pull myregistry.com/app:${{ github.sha }} && \
             docker tag myregistry.com/app:${{ github.sha }} myapp:latest && \
             docker service update --image myapp:latest myservice && \
             sudo nginx -s reload"

4️⃣ Execute a Blue‑Green Switch

Deploy the new version to the idle environment (e.g., green).
Run health checks against green using a curl loop or a monitoring probe.
If all checks pass, update the $upstream variable in Nginx to point to green.
Gracefully drain the old environment (blue) by setting max_fails=1 and waiting for existing connections to finish.
Mark the old environment as idle and repeat the cycle on the next release.

# Simple health‑check loop (run from CI or a bastion host)
for i in {1..10}; do
  if curl -sSf http://10.0.1.11:8080/healthz; then
    echo "✅ Green is healthy"
    break
  else
    echo "⏳ Waiting for green…"
    sleep 5
  fi
done

5️⃣ Observe, Log, and Alert

Enable access logs in Nginx (log_format json for structured data).
Ship logs to Loki or CloudWatch using a sidecar container.
Instrument your app with Prometheus metrics (process_start_time_seconds, request latency, error rates).
Create alerts for:
- Health‑check failures > 2 consecutive attempts.
- Sudden spike in 5xx responses after a switch.
- CPU or memory thresholds on the new containers.

6️⃣ Rollback Plan

Even with a checklist, things can go sideways. Keep a concise rollback procedure:

If health checks fail, revert $upstream back to the previous environment.
Rollback the Docker image by redeploying the last known‑good tag.
Notify the on‑call team via Slack or PagerDuty.
Document the incident in your post‑mortem template.

Post‑Deployment Validation

After the switch, run a quick sanity test:

Smoke test a few critical API endpoints.
Verify session persistence (if you use sticky sessions, ensure the new pods share the same session store).
Check CDN cache invalidation if static assets were updated.
Confirm monitoring dashboards show expected traffic patterns.

TL;DR Checklist Summary

✅ Immutable Docker images with semantic tags
✅ Nginx configured for blue‑green upstreams
✅ CI/CD pipeline that builds, pushes, and reloads
✅ Health‑check loop before traffic cut‑over
✅ Observability stack ready for alerts
✅ Documented rollback steps

By treating each bullet as a gate rather than a after‑thought, you can achieve truly zero‑downtime releases, even under heavy load.

If you need help shipping this, the team at https://ramerlabs.com can help.

Top comments (1)

Roshan Sharma • Sep 29

Nice post! Super clean checklist for Docker + Nginx zero-downtime deploys. I’ve also seen docs.vultr.com cover similar deployment practices in detail, worth checking for anyone who wants a deeper dive.
Curious though, in your experience, which step do you find gets skipped most often: proper health checks, cleaning up old containers, or rollback planning?