DEV Community

Cover image for Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%
Samson Tanimawo
Samson Tanimawo

Posted on

Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%

Deploy and Pray

Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.

After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.

What Canary Deployments Actually Mean

A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:

Traffic flow:
  Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
                            └──→  5% → v1.2.4 (canary)
Enter fullscreen mode Exit fullscreen mode

If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.

Our Canary Pipeline

# .github/workflows/canary-deploy.yml
canary_deploy:
  steps:
    - name: Deploy canary (5%)
      run: |
        kubectl set image deployment/api-canary api=api:${{ github.sha }}
        kubectl scale deployment/api-canary --replicas=1
        # Configure traffic split
        kubectl apply -f - <<EOF
        apiVersion: split.smi-spec.io/v1alpha1
        kind: TrafficSplit
        metadata:
          name: api-canary
        spec:
          service: api
          backends:
          - service: api-stable
            weight: 95
          - service: api-canary
            weight: 5
        EOF

    - name: Wait and analyze (10 minutes)
      run: |
        sleep 600
        # Check canary health
        ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')
        LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')

        echo "Canary error rate: $ERROR_RATE"
        echo "Canary p99 latency: $LATENCY"

        if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
          echo "CANARY FAILED: Error rate too high"
          exit 1
        fi

    - name: Promote to 50%
      run: |
        kubectl apply -f traffic-split-50.yaml
        sleep 600  # Wait another 10 min

    - name: Full rollout
      run: |
        kubectl set image deployment/api-stable api=api:${{ github.sha }}
        kubectl delete deployment api-canary
        kubectl delete trafficsplit api-canary
Enter fullscreen mode Exit fullscreen mode

The Canary Checklist

What we check during the canary window:

CANARY_CHECKS = {
    'error_rate': {
        'query': 'rate(http_5xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
        'threshold': 0.01,  # Max 1% errors
        'comparison': 'less_than'
    },
    'latency_p99': {
        'query': 'histogram_quantile(0.99, rate(http_duration_bucket{version="canary"}[5m]))',
        'threshold': 0.5,  # Max 500ms
        'comparison': 'less_than'
    },
    'success_rate': {
        'query': 'rate(http_2xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
        'threshold': 0.99,  # Min 99% success
        'comparison': 'greater_than'
    },
    'memory_usage': {
        'query': 'container_memory_working_set_bytes{version="canary"}',
        'threshold': 512 * 1024 * 1024,  # Max 512MB
        'comparison': 'less_than'
    }
}
Enter fullscreen mode Exit fullscreen mode

Results After 6 Months

Metric Before After
Rollback rate 15% of deploys 3% of deploys
Mean time to detect bad deploy 25 min 8 min
Customer-facing incidents from deploys 4/month 0.5/month
Deploy frequency 1x/day (afraid) 5x/day (confident)

The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.

Start Simple

You don't need Istio or a service mesh for canary deploys. Start with:

  1. Two deployment objects (stable + canary)
  2. A load balancer that supports weighted routing
  3. A script that checks error rates after deploy
  4. A human who decides whether to promote or rollback

Automate from there.

If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)