DEV Community

Young Gao
Young Gao

Posted on

Zero-Downtime Deployments on Kubernetes: Rolling Updates, Blue-Green, and Canary

Zero-Downtime Deployments on Kubernetes: Rolling Updates, Blue-Green, and Canary

Deploying without downtime isn't just about setting strategy: RollingUpdate. It's about health checks that actually verify readiness, connection draining that doesn't drop requests, and rollback triggers that catch problems before users do.

Here's how to set up each deployment strategy correctly on Kubernetes.

The Basics: Why Deployments Fail

Most "zero-downtime" deployments still drop requests because of three mistakes:

  1. Readiness probes that lie — returning 200 before the app can serve traffic
  2. No graceful shutdown — pods killed mid-request
  3. Missing preStop hooks — pod removed from service before in-flight requests complete

Rolling Update (Default)

Rolling updates replace pods incrementally. Kubernetes creates new pods before terminating old ones.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Create 1 extra pod during update
      maxUnavailable: 0   # Never reduce below desired count
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: myregistry/api-server:v2.1.0
          ports:
            - containerPort: 8080

          # Readiness: "Can this pod serve traffic?"
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3

          # Liveness: "Is this pod stuck/deadlocked?"
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
            failureThreshold: 3

          # Startup: "Is the app still starting up?"
          startupProbe:
            httpGet:
              path: /health
              port: 8080
            failureThreshold: 30
            periodSeconds: 2
            # Gives app up to 60s to start before liveness kicks in

          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 10"]
                # Wait for endpoints controller to remove pod from Service
Enter fullscreen mode Exit fullscreen mode

Why preStop: sleep 10?

When Kubernetes terminates a pod, two things happen concurrently:

  1. The Endpoints controller removes the pod from the Service
  2. The container receives SIGTERM

If SIGTERM arrives before the Endpoints update propagates, the load balancer still sends traffic to a pod that's shutting down. The preStop sleep gives time for the Endpoints change to propagate.

Readiness vs Liveness vs Startup

// In your Go server:
func main() {
    ready := false

    // Startup: load config, warm caches, connect to DB
    go func() {
        connectDB()
        warmCache()
        ready = true // Only now accept traffic
    }()

    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        // Liveness: am I alive and not deadlocked?
        w.WriteHeader(http.StatusOK)
    })

    http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
        // Readiness: can I serve production traffic right now?
        if !ready {
            w.WriteHeader(http.StatusServiceUnavailable)
            return
        }
        // Optionally check DB connection
        if err := db.Ping(); err != nil {
            w.WriteHeader(http.StatusServiceUnavailable)
            return
        }
        w.WriteHeader(http.StatusOK)
    })

    http.ListenAndServe(":8080", nil)
}
Enter fullscreen mode Exit fullscreen mode

Never make liveness probes depend on external services. If your database goes down, liveness fails, Kubernetes restarts your pod, the new pod also can't reach the database, it restarts again — crash loop. Use readiness to stop traffic; use liveness only for detecting internal deadlocks.

Blue-Green Deployment

Run two identical environments. Switch traffic from blue (current) to green (new) atomically.

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-blue
spec:
  replicas: 4
  selector:
    matchLabels:
      app: api-server
      version: blue
  template:
    metadata:
      labels:
        app: api-server
        version: blue
    spec:
      containers:
        - name: api
          image: myregistry/api-server:v2.0.0
---
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-green
spec:
  replicas: 4
  selector:
    matchLabels:
      app: api-server
      version: green
  template:
    metadata:
      labels:
        app: api-server
        version: green
    spec:
      containers:
        - name: api
          image: myregistry/api-server:v2.1.0
---
# service.yaml — switch by changing selector
apiVersion: v1
kind: Service
metadata:
  name: api-server
spec:
  selector:
    app: api-server
    version: blue   # Change to "green" to switch
  ports:
    - port: 80
      targetPort: 8080
Enter fullscreen mode Exit fullscreen mode

Switch traffic:

# Deploy green with new version
kubectl apply -f green-deployment.yaml

# Wait for all green pods to be ready
kubectl rollout status deployment/api-server-green

# Switch traffic (atomic — one API call)
kubectl patch service api-server -p '{"spec":{"selector":{"version":"green"}}}'

# Verify, then scale down blue
kubectl scale deployment api-server-blue --replicas=0
Enter fullscreen mode Exit fullscreen mode

Advantage: Instant rollback — just switch the Service selector back to "blue".

Disadvantage: Requires 2x resources during deployment.

Canary Deployment

Route a small percentage of traffic to the new version. If metrics look good, gradually increase.

Simple Canary with Replica Ratios

# Stable: 9 replicas of v2.0.0
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: api-server
      track: stable
  template:
    metadata:
      labels:
        app: api-server
        track: stable
    spec:
      containers:
        - name: api
          image: myregistry/api-server:v2.0.0
---
# Canary: 1 replica of v2.1.0 (gets ~10% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: api-server
      track: canary
  template:
    metadata:
      labels:
        app: api-server
        track: canary
    spec:
      containers:
        - name: api
          image: myregistry/api-server:v2.1.0
---
# Service selects both — traffic split by replica count
apiVersion: v1
kind: Service
metadata:
  name: api-server
spec:
  selector:
    app: api-server    # Matches both stable and canary
  ports:
    - port: 80
      targetPort: 8080
Enter fullscreen mode Exit fullscreen mode

Scale up canary gradually:

# Start: 10% canary
kubectl scale deployment api-server-canary --replicas=1
kubectl scale deployment api-server-stable --replicas=9

# 50% canary
kubectl scale deployment api-server-canary --replicas=5
kubectl scale deployment api-server-stable --replicas=5

# 100% canary (promote)
kubectl scale deployment api-server-canary --replicas=10
kubectl scale deployment api-server-stable --replicas=0
Enter fullscreen mode Exit fullscreen mode

Automated Canary with Metrics

Use a shell script (or Flagger/Argo Rollouts in production):

#!/bin/bash
# canary-promote.sh

CANARY_DEPLOY="api-server-canary"
STABLE_DEPLOY="api-server-stable"
TOTAL_REPLICAS=10
ERROR_THRESHOLD=1  # percent

for pct in 10 25 50 75 100; do
    canary_replicas=$((TOTAL_REPLICAS * pct / 100))
    stable_replicas=$((TOTAL_REPLICAS - canary_replicas))

    echo "Setting canary to ${pct}% (${canary_replicas} replicas)"
    kubectl scale deployment $CANARY_DEPLOY --replicas=$canary_replicas
    kubectl scale deployment $STABLE_DEPLOY --replicas=$stable_replicas

    # Wait and check error rate
    sleep 60

    # Query Prometheus for error rate (adjust query for your setup)
    error_rate=$(curl -s "http://prometheus:9090/api/v1/query?query=\
        rate(http_requests_total{deployment=\"${CANARY_DEPLOY}\",code=~\"5..\"}[1m])\
        /rate(http_requests_total{deployment=\"${CANARY_DEPLOY}\"}[1m])*100" \
        | jq '.data.result[0].value[1] // "0"' -r)

    if (( $(echo "$error_rate > $ERROR_THRESHOLD" | bc -l) )); then
        echo "Error rate ${error_rate}% exceeds threshold. Rolling back."
        kubectl scale deployment $CANARY_DEPLOY --replicas=0
        kubectl scale deployment $STABLE_DEPLOY --replicas=$TOTAL_REPLICAS
        exit 1
    fi

    echo "Error rate ${error_rate}% — looks good"
done

echo "Canary promoted to 100%"
Enter fullscreen mode Exit fullscreen mode

Graceful Shutdown

Your application must handle SIGTERM correctly:

func main() {
    srv := &http.Server{Addr: ":8080", Handler: mux}

    // Start server
    go func() {
        if err := srv.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for SIGTERM
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    <-quit

    log.Println("Shutting down — finishing in-flight requests...")

    // Give in-flight requests up to 30s to complete
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        log.Printf("Forced shutdown: %v", err)
    }

    log.Println("Server stopped")
}
Enter fullscreen mode Exit fullscreen mode

The shutdown sequence:

  1. Kubernetes sends SIGTERM
  2. preStop hook runs (sleep 10 — lets Endpoints update propagate)
  3. App receives SIGTERM, stops accepting new connections
  4. App finishes in-flight requests (up to 30s)
  5. App exits cleanly
  6. Kubernetes waits up to terminationGracePeriodSeconds (60s), then SIGKILL

Pod Disruption Budgets

Prevent too many pods from going down simultaneously (especially during node maintenance):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 3    # Always keep at least 3 pods running
  selector:
    matchLabels:
      app: api-server
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Strategy Downtime Rollback Speed Resource Cost Complexity
Rolling Zero Minutes 1.25x Low
Blue-Green Zero Seconds 2x Medium
Canary Zero Seconds 1.1x–2x High

Start with rolling updates (with proper probes and preStop hooks). Move to canary when you have metrics/monitoring in place. Use blue-green for databases or stateful services where you need instant rollback.

Conclusion

Zero-downtime deployments require getting three things right: readiness probes that genuinely verify readiness, graceful shutdown that drains connections, and preStop hooks that account for Endpoint propagation delay. The deployment strategy (rolling, blue-green, canary) is secondary — if your health checks lie, every strategy will drop requests.


If this was helpful, you can support my work at ko-fi.com/nopkt

Top comments (0)