Ramer Labs

Posted on Sep 25

The Ultimate Checklist for Zero‑Downtime Deploys with Docker & Nginx

#cloud #devops #docker #architecture

Introduction

Zero‑downtime deployments are no longer a nice‑to‑have; they’re a baseline expectation for modern services. As a DevOps lead, you’re probably juggling Docker containers, Nginx reverse proxies, and a CI/CD pipeline that must keep the lights on while you push new code. This checklist walks you through the practical steps to achieve seamless rollouts, from image building to observability, without sacrificing safety.

1. Prepare Your Docker Images

1.1 Use Multi‑Stage Builds

Multi‑stage Dockerfiles keep your final image lean and free of build‑time dependencies.

# ---- Build Stage ----
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# ---- Production Stage ----
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY package*.json ./
RUN npm ci --production
EXPOSE 3000
CMD ["node", "dist/index.js"]

1.2 Tag Images with Immutable Versions

Never push latest to production. Tag each build with a Git SHA or semantic version, e.g., myapp:1.4.2‑a1b2c3.

docker build -t myregistry.com/myapp:1.4.2-a1b2c3 .

2. Blueprint Your Nginx Proxy

2.1 Separate Upstream Blocks per Release

Define distinct upstream groups for the current and candidate containers. This makes traffic shifting painless.

upstream app_current {
    server 127.0.0.1:8081;
}

upstream app_candidate {
    server 127.0.0.1:8082;
}

server {
    listen 80;
    location / {
        proxy_pass http://app_current;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

2.2 Enable Graceful Shutdowns

Add proxy_next_upstream and a reasonable keepalive_timeout so Nginx can finish in‑flight requests before dropping a container.

proxy_connect_timeout 5s;
proxy_read_timeout 30s;
proxy_send_timeout 30s;
keepalive_timeout 65s;

3. Adopt a Blue‑Green Deployment Strategy

3.1 Define the Flow

Deploy candidate – spin up a new container on the candidate port (e.g., 8082).
Health‑check – run automated smoke tests against the candidate upstream.
Swap traffic – update Nginx config to point proxy_pass from app_current to app_candidate and reload.
Monitor – watch error rates, latency, and logs for a brief period.
Retire old – once confidence is high, stop the old container and rename the candidate to current.

3.2 Automate with a CI/CD Job

Here’s a minimal GitHub Actions workflow that orchestrates the above steps.

name: Blue‑Green Deploy
on:
  push:
    branches: [ main ]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Build Docker Image
        run: |
          IMAGE_TAG=${{ github.sha }}
          docker build -t myregistry.com/myapp:$IMAGE_TAG .
          docker push myregistry.com/myapp:$IMAGE_TAG
      - name: Deploy Candidate
        run: |
          docker run -d --name myapp_candidate -p 8082:3000 myregistry.com/myapp:${{ github.sha }}
      - name: Health Check
        run: |
          curl -f http://localhost:8082/health || exit 1
      - name: Switch Nginx Traffic
        run: |
          ssh user@host "sudo cp /etc/nginx/conf.d/app_candidate.conf /etc/nginx/conf.d/app.conf && sudo nginx -s reload"
      - name: Observe
        run: |
          sleep 30 # give a few seconds for traffic to settle
          curl -s http://localhost/metrics | grep error_rate || exit 1
      - name: Cleanup Old
        run: |
          docker stop myapp_current && docker rm myapp_current
          docker rename myapp_candidate myapp_current

4. Observability & Logging

4.1 Centralize Logs

Send container stdout/stderr to a log aggregator (e.g., Loki, ELK). In Docker Compose, add a logging driver:

services:
  myapp:
    image: myregistry.com/myapp:1.4.2-a1b2c3
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

4.2 Export Metrics

Expose Prometheus‑compatible metrics from your app and scrape them via the Nginx exporter.

location /metrics {
    proxy_pass http://app_current/metrics;
    allow 127.0.0.1;
    deny all;
}

4.3 Alert on Anomalies

Create alerts for:

Error rate > 1% over a 5‑minute window.
Latency P95 > 300ms.
Container restarts > 2 in the last hour.

5. Rollback Plan

Even with thorough checks, things can go sideways. Keep a one‑click rollback:

Re‑apply the previous Nginx config (app_previous.conf).
Reload Nginx.
Verify health endpoints.
If needed, spin up the older Docker image again.

Document the rollback steps in your runbook and test them quarterly.

6. Checklist Summary

[ ] Immutable image tags – no latest in prod.
[ ] Multi‑stage Dockerfile – smallest possible runtime.
[ ] Separate Nginx upstreams for current and candidate.
[ ] Graceful timeout settings in Nginx.
[ ] Automated health checks before traffic switch.
[ ] CI/CD job that builds, deploys, validates, and swaps.
[ ] Centralized logging and metrics export.
[ ] Alert thresholds for error rate, latency, restarts.
[ ] Documented rollback procedure and periodic drills.

Following this checklist will give you confidence that each push lands without a hiccup, keeping users happy and your team stress‑free.

Closing Thoughts

Zero‑downtime deployments are a combination of disciplined image management, smart proxy configuration, and robust observability. By treating each release as a candidate rather than an overwrite, you gain the safety net needed for fast iteration. If you need help shipping this, the team at https://ramerlabs.com can help.

DEV Community