Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%

#deployments #devops #sre #cicd

Deploy and Pray

Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.

After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.

What Canary Deployments Actually Mean

A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:

Traffic flow:
  Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
                            └──→  5% → v1.2.4 (canary)

If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.

Our Canary Pipeline

# .github/workflows/canary-deploy.yml
canary_deploy:
  steps:
    - name: Deploy canary (5%)
      run: |
        kubectl set image deployment/api-canary api=api:${{ github.sha }}
        kubectl scale deployment/api-canary --replicas=1
        # Configure traffic split
        kubectl apply -f - <<EOF
        apiVersion: split.smi-spec.io/v1alpha1
        kind: TrafficSplit
        metadata:
          name: api-canary
        spec:
          service: api
          backends:
          - service: api-stable
            weight: 95
          - service: api-canary
            weight: 5
        EOF

    - name: Wait and analyze (10 minutes)
      run: |
        sleep 600
        # Check canary health
        ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')
        LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')

        echo "Canary error rate: $ERROR_RATE"
        echo "Canary p99 latency: $LATENCY"

        if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
          echo "CANARY FAILED: Error rate too high"
          exit 1
        fi

    - name: Promote to 50%
      run: |
        kubectl apply -f traffic-split-50.yaml
        sleep 600  # Wait another 10 min

    - name: Full rollout
      run: |
        kubectl set image deployment/api-stable api=api:${{ github.sha }}
        kubectl delete deployment api-canary
        kubectl delete trafficsplit api-canary

The Canary Checklist

What we check during the canary window:

CANARY_CHECKS = {
    'error_rate': {
        'query': 'rate(http_5xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
        'threshold': 0.01,  # Max 1% errors
        'comparison': 'less_than'
    },
    'latency_p99': {
        'query': 'histogram_quantile(0.99, rate(http_duration_bucket{version="canary"}[5m]))',
        'threshold': 0.5,  # Max 500ms
        'comparison': 'less_than'
    },
    'success_rate': {
        'query': 'rate(http_2xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
        'threshold': 0.99,  # Min 99% success
        'comparison': 'greater_than'
    },
    'memory_usage': {
        'query': 'container_memory_working_set_bytes{version="canary"}',
        'threshold': 512 * 1024 * 1024,  # Max 512MB
        'comparison': 'less_than'
    }
}

Results After 6 Months

Metric	Before	After
Rollback rate	15% of deploys	3% of deploys
Mean time to detect bad deploy	25 min	8 min
Customer-facing incidents from deploys	4/month	0.5/month
Deploy frequency	1x/day (afraid)	5x/day (confident)

The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.

Start Simple

You don't need Istio or a service mesh for canary deploys. Start with:

Two deployment objects (stable + canary)
A load balancer that supports weighted routing
A script that checks error rates after deploy
A human who decides whether to promote or rollback

Automate from there.

If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (1)

Henry A • Apr 17

Nice overview. One thing I'd add: canary routing at the load balancer level (weighted target groups in ALB, or Nginx upstream weights) is often simpler to set up than service mesh canaries for teams that aren't on Istio yet. You can do 95/5 traffic split with a single ALB listener rule change and roll it back in seconds. The biggest gotcha I've seen is teams doing canary deploys without automated rollback triggers — if your error rate threshold trips, the rollback should be instant, not waiting for someone to notice a dashboard.