Why Zero‑Downtime Matters
For a DevOps lead, every second of unavailability translates into lost revenue, eroded trust, and a damaged brand. Modern users expect services to stay online even when you push new features, security patches, or database migrations. Achieving zero‑downtime isn’t a magic trick; it’s a disciplined set of practices that combine container orchestration, smart proxying, and observability.
Prerequisites
Before you dive into the checklist, make sure you have the following in place:
- Docker Engine ≥ 20.10 on every host.
- Docker Compose (or a compatible orchestrator like Kubernetes) for multi‑service definitions.
-
Nginx 1.21+ compiled with the
ngx_http_upstream_module
. - A CI/CD runner (GitHub Actions, GitLab CI, or Jenkins) that can push images to a registry.
- Basic monitoring stack (Prometheus + Grafana or Loki + Grafana) for health checks.
Docker & Nginx Blueprint
Dockerfile (Node.js example)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci && npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY package*.json ./
RUN npm ci --production
EXPOSE 3000
CMD ["node", "dist/index.js"]
The multi‑stage build keeps the final image lean (<50 MB) and isolates build‑time dependencies.
docker‑compose.yml
version: "3.9"
services:
api:
build: .
image: myorg/api:{{GIT_SHA}}
restart: always
environment:
- NODE_ENV=production
ports:
- "3000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
nginx:
image: nginx:1.23-alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
api:
condition: service_healthy
The depends_on
clause ensures Nginx only starts routing traffic after the API passes its health check.
Nginx Reverse‑Proxy Config
worker_processes auto;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events { worker_connections 1024; }
http {
upstream api_upstream {
server api:3000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
location / {
proxy_pass http://api_upstream;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
}
The upstream block lets Nginx gracefully handle a failing container by retrying other healthy instances.
CI/CD Pipeline Checklist
- Lint & Unit Test – Fail fast on code quality issues.
-
Build Docker Image – Tag with commit SHA and
latest
. - Run Integration Tests against a temporary compose stack.
- Push Image to Registry – Ensure the registry is reachable from production nodes.
- Deploy to Staging – Use the same compose file; run smoke tests.
- Blue‑Green Switch – Deploy the new version alongside the old one.
-
Health‑Check Validation – Verify
/health
returns200
before routing. - Promote to Production – Swap Nginx upstream without downtime.
- Post‑Deploy Monitoring – Watch error rates for 5 minutes; auto‑rollback if thresholds breach.
GitHub Actions Snippet
name: CI/CD
on:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Log in to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DHUB_USER }}
password: ${{ secrets.DHUB_PASS }}
- name: Build & Push
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: myorg/api:${{ github.sha }},myorg/api:latest
- name: Deploy Blue‑Green Stack
run: |
docker compose -f docker-compose.yml up -d --no-deps --scale api=2
# health check loop omitted for brevity
The --scale api=2
command brings up the new container while keeping the old one alive, enabling a true blue‑green rollout.
Blue‑Green Deployment Workflow
-
Spin Up New Version –
docker compose up -d api_new
creates a parallel service. -
Health Probe – Nginx only adds the new upstream after
/health
passes. - Swap Traffic – Update the upstream block (or use a DNS‑based service discovery) to point to the new container.
-
Graceful Drain – Reduce
max_fails
on the old container; let existing connections finish. - Verify Metrics – Check latency, error rate, and CPU usage for anomalies.
-
Retire Old Pods –
docker compose rm -f api_old
once the new version is stable.
Because Nginx reads the upstream configuration on each request, you can reload it with a zero‑second nginx -s reload
without dropping connections.
Observability & Logging
-
Prometheus – Scrape
/metrics
from both Nginx and the API. - Grafana Dashboards – Visualize request latency, 5xx rates, and container restarts.
- Loki – Centralize Nginx access/error logs for quick grep.
- Alertmanager – Trigger a webhook to your CI system if error rate > 1% for 2 minutes.
Sample Prometheus target configuration:
scrape_configs:
- job_name: "nginx"
static_configs:
- targets: ["nginx:9113"]
- job_name: "api"
static_configs:
- targets: ["api:3000"]
Rollback Strategy
-
Keep the Previous Image – Tag releases as
v1.2.3
and never GC them immediately. -
Automated Rollback – If the health check fails three times in a row, run
docker compose down api_new && docker compose up -d api_old
. - Database Compatibility – Use backward‑compatible migrations or feature flags to avoid schema lock‑in.
- Post‑Mortem – Log the cause, update the checklist, and share findings with the team.
Final Thoughts
Zero‑downtime deployments are achievable with a modest amount of automation, solid health checks, and disciplined observability. By following the checklist above, a DevOps lead can confidently ship new code without upsetting users or triggering costly incidents. If you need help shipping this, the team at RamerLabs can help.
Top comments (0)