DEV Community

Wilson Xu
Wilson Xu

Posted on

Building a Zero-Downtime Deployment Pipeline with GitHub Actions and Docker

Building a Zero-Downtime Deployment Pipeline with GitHub Actions and Docker

Every minute of downtime costs money. According to Gartner, the average cost of IT downtime is $5,600 per minute. For SaaS platforms processing transactions, serving API requests, or running real-time dashboards, even a few seconds of unavailability during deployments can erode user trust and trigger SLA violations. Yet many teams still deploy by stopping the old container and starting a new one, accepting a gap where no instance is healthy.

The irony is that most downtime is self-inflicted. Deployments — the very process meant to improve the product — are the single largest source of production incidents at many organizations. The solution is not to deploy less often. It is to make deployments safe.

This tutorial walks through building a complete zero-downtime deployment pipeline using GitHub Actions and Docker. You will implement a blue-green deployment strategy with health checks, automated rollback, and monitoring hooks — all using tools you likely already have in your stack. No Kubernetes cluster required. No expensive orchestration platform. Just Docker Compose, Nginx, and a well-structured deployment script.

By the end, you will have a production-ready CI/CD pipeline that builds, tests, deploys, and verifies your application without dropping a single request.

Architecture Overview: Blue-Green Deployments

Blue-green deployment maintains two identical production environments. At any time, one environment (blue) serves live traffic while the other (green) sits idle or runs the new release candidate. When you deploy, you bring up the new version on the idle environment, verify it passes health checks, then switch the router to point traffic at the new environment. The old environment stays running as an instant rollback target.

Here is the flow:

                    ┌──────────────┐
   Traffic ────────>│   Nginx      │
                    │  (reverse    │
                    │   proxy)     │
                    └──────┬───────┘
                           │
                ┌──────────┴──────────┐
                │                     │
         ┌──────▼──────┐      ┌──────▼──────┐
         │  Blue (v1)  │      │  Green (v2) │
         │  Port 8001  │      │  Port 8002  │
         └─────────────┘      └─────────────┘
Enter fullscreen mode Exit fullscreen mode

Nginx proxies traffic to whichever environment is marked active. The deployment script brings up the inactive environment with the new image, runs health checks against it, then reconfigures Nginx to route traffic to it. The previously active environment becomes the rollback target.

This approach has clear advantages over rolling deployments for smaller setups: the switchover is atomic, rollback is instant (just flip the proxy back), and you always have the previous version running and ready.

Setting Up the Dockerfile

A well-structured multi-stage Dockerfile keeps your production image lean and your build reproducible. This example uses a Node.js application, but the pattern applies to any runtime.

# Stage 1: Install dependencies
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production

# Stage 2: Build the application
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 3: Production image
FROM node:20-alpine AS production
WORKDIR /app

RUN addgroup -g 1001 appgroup && \
    adduser -u 1001 -G appgroup -s /bin/sh -D appuser

COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY package.json ./

# Health check endpoint used by deployment scripts
HEALTHCHECK --interval=10s --timeout=3s --start-period=15s --retries=3 \
    CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

USER appuser
EXPOSE 3000

CMD ["node", "dist/server.js"]
Enter fullscreen mode Exit fullscreen mode

Key decisions in this Dockerfile:

  • Multi-stage build separates dependency installation, compilation, and runtime. The final image contains only production dependencies and compiled output.
  • Non-root user follows the principle of least privilege.
  • Built-in HEALTHCHECK gives Docker a way to determine if the container is healthy, which the deployment script relies on.
  • Alpine base keeps the image small (typically under 150MB).

The Health Check Endpoint

The deployment pipeline needs a reliable way to verify the application is ready to serve traffic. Add a health check endpoint that validates your critical dependencies:

// src/health.js
import express from 'express';

const router = express.Router();

router.get('/health', async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    status: 'ok',
    checks: {}
  };

  // Check database connectivity
  try {
    await db.query('SELECT 1');
    checks.checks.database = { status: 'ok' };
  } catch (err) {
    checks.checks.database = { status: 'fail', message: err.message };
    checks.status = 'fail';
  }

  // Check Redis connectivity
  try {
    await redis.ping();
    checks.checks.redis = { status: 'ok' };
  } catch (err) {
    checks.checks.redis = { status: 'fail', message: err.message };
    checks.status = 'fail';
  }

  const statusCode = checks.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(checks);
});

// Lightweight liveness probe (no dependency checks)
router.get('/ready', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

export default router;
Enter fullscreen mode Exit fullscreen mode

The /health endpoint checks real dependencies — database, cache, external services. The /ready endpoint confirms the HTTP server is accepting connections. The deployment script uses both: /ready to know the process started, and /health to confirm it can serve real requests.

Docker Compose for Blue-Green Environments

Define both environments in a single compose file. The deployment script controls which one runs the new version:

# docker-compose.yml
services:
  blue:
    image: ${DOCKER_IMAGE:-myapp:latest}
    container_name: myapp-blue
    ports:
      - "8001:3000"
    environment:
      - NODE_ENV=production
      - APP_COLOR=blue
    restart: unless-stopped
    networks:
      - app-network

  green:
    image: ${DOCKER_IMAGE:-myapp:latest}
    container_name: myapp-green
    ports:
      - "8002:3000"
    environment:
      - NODE_ENV=production
      - APP_COLOR=green
    restart: unless-stopped
    networks:
      - app-network

  nginx:
    image: nginx:alpine
    container_name: myapp-proxy
    ports:
      - "80:80"
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d
    depends_on:
      - blue
      - green
    restart: unless-stopped
    networks:
      - app-network

networks:
  app-network:
    driver: bridge
Enter fullscreen mode Exit fullscreen mode

The Nginx configuration routes traffic to the active environment:

# nginx/conf.d/default.conf
upstream app_backend {
    # This file gets rewritten by the deploy script
    # to point at either blue (8001) or green (8002)
    server myapp-blue:3000;
}

server {
    listen 80;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }

    location /nginx-health {
        access_log off;
        return 200 "ok";
    }
}
Enter fullscreen mode Exit fullscreen mode

The Deployment Script

This is the core of the zero-downtime pipeline. The script determines which environment is idle, deploys the new image to it, waits for health checks to pass, switches Nginx, then optionally stops the old environment:

#!/usr/bin/env bash
# deploy.sh — Blue-green deployment with health checks and rollback
set -euo pipefail

DOCKER_IMAGE="${1:?Usage: deploy.sh <image:tag>}"
MAX_HEALTH_RETRIES=30
HEALTH_INTERVAL=2
NGINX_CONF="./nginx/conf.d/default.conf"

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }

# Determine which environment is currently active
get_active_env() {
  if grep -q "myapp-blue" "$NGINX_CONF"; then
    echo "blue"
  else
    echo "green"
  fi
}

# Determine the target (inactive) environment
ACTIVE=$(get_active_env)
if [ "$ACTIVE" = "blue" ]; then
  TARGET="green"
  TARGET_PORT=8002
  TARGET_CONTAINER="myapp-green"
else
  TARGET="blue"
  TARGET_PORT=8001
  TARGET_CONTAINER="myapp-blue"
fi

log "Active environment: $ACTIVE"
log "Deploying $DOCKER_IMAGE to: $TARGET"

# Pull the new image
log "Pulling image..."
docker pull "$DOCKER_IMAGE"

# Start the target environment with the new image
log "Starting $TARGET environment..."
DOCKER_IMAGE="$DOCKER_IMAGE" docker compose up -d "$TARGET"

# Wait for health checks to pass
log "Waiting for $TARGET to become healthy..."
RETRIES=0
while [ $RETRIES -lt $MAX_HEALTH_RETRIES ]; do
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    "http://localhost:${TARGET_PORT}/health" 2>/dev/null || echo "000")

  if [ "$HTTP_CODE" = "200" ]; then
    log "Health check passed on $TARGET (HTTP $HTTP_CODE)"
    break
  fi

  RETRIES=$((RETRIES + 1))
  log "Health check attempt $RETRIES/$MAX_HEALTH_RETRIES (HTTP $HTTP_CODE)"
  sleep $HEALTH_INTERVAL
done

if [ $RETRIES -eq $MAX_HEALTH_RETRIES ]; then
  log "ERROR: Health checks failed after $MAX_HEALTH_RETRIES attempts"
  log "Rolling back — stopping $TARGET"
  docker compose stop "$TARGET"
  exit 1
fi

# Switch Nginx to point at the new environment
log "Switching traffic to $TARGET..."
if [ "$TARGET" = "blue" ]; then
  sed -i.bak 's/myapp-green:3000/myapp-blue:3000/g' "$NGINX_CONF"
else
  sed -i.bak 's/myapp-blue:3000/myapp-green:3000/g' "$NGINX_CONF"
fi

# Reload Nginx without dropping connections
docker exec myapp-proxy nginx -s reload

# Verify traffic is flowing to the new environment
sleep 2
VERIFY_CODE=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:80/health")
if [ "$VERIFY_CODE" != "200" ]; then
  log "ERROR: Post-switch verification failed. Rolling back Nginx config."
  cp "${NGINX_CONF}.bak" "$NGINX_CONF"
  docker exec myapp-proxy nginx -s reload
  exit 1
fi

log "Deployment complete. $TARGET is now serving traffic."
log "Previous environment ($ACTIVE) is still running as rollback target."
Enter fullscreen mode Exit fullscreen mode

The script follows a strict sequence: deploy to the idle slot, verify health, switch traffic, verify again. If anything fails, it rolls back automatically.

GitHub Actions Workflow

Now wire the deployment script into a GitHub Actions workflow that triggers on pushes to main:

# .github/workflows/deploy.yml
name: Build and Deploy (Zero-Downtime)

on:
  push:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - run: npm ci
      - run: npm run lint
      - run: npm test

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=raw,value=latest

      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to production
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.DEPLOY_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.DEPLOY_SSH_KEY }}
          script: |
            cd /opt/myapp
            git pull origin main
            ./deploy.sh ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}

      - name: Verify deployment
        run: |
          sleep 5
          HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
            "https://${{ secrets.DEPLOY_HOST }}/health")
          if [ "$HTTP_CODE" != "200" ]; then
            echo "Deployment verification failed (HTTP $HTTP_CODE)"
            exit 1
          fi
          echo "Deployment verified (HTTP $HTTP_CODE)"

      - name: Notify on failure
        if: failure()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Deployment failed for ${{ github.repository }}@${{ github.sha }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Enter fullscreen mode Exit fullscreen mode

The workflow has three stages. Test runs your test suite and linting. Build-and-push creates the Docker image and pushes it to GitHub Container Registry. Deploy SSHs into the production server and runs the blue-green deployment script. A final verification step confirms the deployment succeeded from outside the network.

Notice the environment: production on the deploy job. This enables GitHub's environment protection rules, where you can require manual approvals, restrict which branches can deploy, and add wait timers.

Rollback Strategy

The blue-green architecture gives you two rollback mechanisms.

Instant rollback is flipping Nginx back to the previous environment. Since the old containers are still running, this takes under a second:

#!/usr/bin/env bash
# rollback.sh — Instant rollback to previous environment
set -euo pipefail

NGINX_CONF="./nginx/conf.d/default.conf"

if grep -q "myapp-blue" "$NGINX_CONF"; then
  CURRENT="blue"
  ROLLBACK_TO="green"
  sed -i 's/myapp-blue:3000/myapp-green:3000/g' "$NGINX_CONF"
else
  CURRENT="green"
  ROLLBACK_TO="blue"
  sed -i 's/myapp-green:3000/myapp-blue:3000/g' "$NGINX_CONF"
fi

docker exec myapp-proxy nginx -s reload

echo "Rolled back from $CURRENT to $ROLLBACK_TO"
echo "Verify: curl http://localhost:80/health"
Enter fullscreen mode Exit fullscreen mode

Version rollback deploys a specific previous image tag. Since every deployment is tagged with the Git SHA, you can redeploy any previous version:

# Redeploy a specific version
./deploy.sh ghcr.io/yourorg/myapp:abc1234
Enter fullscreen mode Exit fullscreen mode

Add a GitHub Actions workflow for manual rollback:

# .github/workflows/rollback.yml
name: Manual Rollback

on:
  workflow_dispatch:
    inputs:
      version:
        description: 'Git SHA or image tag to roll back to'
        required: true

jobs:
  rollback:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Execute rollback
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.DEPLOY_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.DEPLOY_SSH_KEY }}
          script: |
            cd /opt/myapp
            ./deploy.sh ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.event.inputs.version }}
Enter fullscreen mode Exit fullscreen mode

This gives your team a one-click rollback button directly in the GitHub Actions UI.

Monitoring and Alerting

A deployment pipeline is only as good as its observability. Add a post-deployment smoke test and integrate with your monitoring stack.

Create a smoke test script that validates critical paths after deployment:

#!/usr/bin/env bash
# smoke-test.sh — Post-deployment validation
set -euo pipefail

BASE_URL="${1:?Usage: smoke-test.sh <base_url>}"
FAILURES=0

check() {
  local name="$1"
  local url="$2"
  local expected_code="${3:-200}"

  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$url")
  if [ "$HTTP_CODE" = "$expected_code" ]; then
    echo "  PASS: $name (HTTP $HTTP_CODE)"
  else
    echo "  FAIL: $name (expected $expected_code, got $HTTP_CODE)"
    FAILURES=$((FAILURES + 1))
  fi
}

echo "Running smoke tests against $BASE_URL"
echo "---"

check "Health endpoint"    "$BASE_URL/health"
check "Ready endpoint"     "$BASE_URL/ready"
check "API root"           "$BASE_URL/api/v1"
check "Static assets"      "$BASE_URL/favicon.ico"
check "Auth (no token)"    "$BASE_URL/api/v1/me" 401

echo "---"
if [ $FAILURES -gt 0 ]; then
  echo "FAILED: $FAILURES check(s) did not pass"
  exit 1
fi
echo "All smoke tests passed"
Enter fullscreen mode Exit fullscreen mode

For ongoing monitoring, expose Prometheus-compatible metrics from your application:

// src/metrics.js
import { collectDefaultMetrics, Registry, Histogram, Counter } from 'prom-client';

const register = new Registry();
collectDefaultMetrics({ register });

export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [register],
});

export const deploymentCounter = new Counter({
  name: 'deployment_total',
  help: 'Total number of deployments',
  labelNames: ['status', 'environment'],
  registers: [register],
});

export { register };
Enter fullscreen mode Exit fullscreen mode

Wire the metrics endpoint into your Express app:

// In server.js
import { register, httpRequestDuration } from './metrics.js';

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode });
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});
Enter fullscreen mode Exit fullscreen mode

Configure Prometheus to scrape the metrics endpoint, and set up Grafana alerts for the patterns that indicate a bad deployment:

  • Error rate spike: If http_request_duration_seconds shows a jump in 5xx responses within five minutes of a deployment, fire an alert.
  • Latency increase: If the p99 latency doubles compared to the pre-deployment baseline, flag it.
  • Health check failures: If the /health endpoint returns non-200 more than three times in a row, trigger an automatic rollback via webhook.

After each deployment, the deployment_total counter increments, giving you a clear timeline of deployments overlaid on your application metrics. This makes it trivial to correlate any anomaly with the exact deployment that caused it.

Putting It All Together

Here is the complete project structure:

myapp/
├── .github/
│   └── workflows/
│       ├── deploy.yml          # CI/CD pipeline
│       └── rollback.yml        # Manual rollback trigger
├── nginx/
│   └── conf.d/
│       └── default.conf        # Nginx upstream config
├── src/
│   ├── server.js               # Application entry point
│   ├── health.js               # Health check routes
│   └── metrics.js              # Prometheus metrics
├── deploy.sh                   # Blue-green deployment script
├── rollback.sh                 # Instant rollback script
├── smoke-test.sh               # Post-deploy validation
├── docker-compose.yml          # Blue + green + nginx
├── Dockerfile                  # Multi-stage production build
└── package.json
Enter fullscreen mode Exit fullscreen mode

The deployment flow from commit to production:

  1. Developer pushes to main.
  2. GitHub Actions runs tests and linting.
  3. Docker image is built and pushed to GHCR with the Git SHA as the tag.
  4. The deploy job SSHs into the production server and runs deploy.sh.
  5. deploy.sh identifies the inactive environment, starts it with the new image, waits for health checks, switches Nginx, and verifies.
  6. If anything fails, the script rolls back automatically. The old environment is still running.
  7. Post-deployment smoke tests validate critical paths.
  8. Slack notification fires on failure.

For teams running multiple services, extract the deployment scripts into a shared repository and parameterize the service name, ports, and health check paths. The same blue-green pattern works for any HTTP service behind a reverse proxy.

Key Takeaways

Start with health checks. The entire pipeline depends on your application reporting its readiness accurately. A /health endpoint that checks real dependencies is non-negotiable.

Keep the previous version running. Blue-green gives you a one-second rollback. Do not tear down the old environment immediately after switching — keep it running for at least one full monitoring cycle.

Test the rollback path. A rollback you have never tested is not a rollback. Run rollback.sh in staging regularly to make sure it works.

Use Git SHAs as image tags. The latest tag is convenient but dangerous. SHA-based tags give you a direct link between your running code and the exact commit that produced it.

Verify from outside. The final health check in the GitHub Actions workflow hits the public URL, not the internal one. This catches proxy misconfigurations, DNS issues, and TLS problems that internal checks would miss.

Automate everything. If your rollback procedure requires someone to SSH into a server and run commands manually at 3 AM, it is not a rollback procedure. It is a hope. The deploy script, the rollback script, and the smoke tests should all run without human intervention.

Scaling Beyond a Single Server

This tutorial uses a single server with Docker Compose, which handles a surprising amount of traffic. But when you outgrow it, the same blue-green pattern scales naturally:

  • Multiple servers: Run the same compose setup on multiple machines behind a load balancer. Deploy to one server at a time (canary), verify, then proceed to the rest.
  • Container orchestration: If you move to Kubernetes, the concepts translate directly. Blue-green becomes two Deployments with a Service switching between them, or you use Kubernetes' built-in rolling update strategy with maxSurge and maxUnavailable settings.
  • Service mesh: Tools like Istio or Linkerd give you traffic splitting at the network level, enabling canary deployments where you shift 5% of traffic to the new version, monitor, then gradually increase.

The fundamentals remain the same at every scale: deploy the new version alongside the old one, verify it works, shift traffic, and keep the old version ready for rollback.

Zero-downtime deployments are not reserved for large teams with Kubernetes clusters. With Docker Compose, Nginx, and a well-structured deployment script, any team can deploy to production with confidence — and without dropping a single request.

Top comments (0)