DEV Community

Cover image for Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%
Samson Tanimawo
Samson Tanimawo

Posted on

Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%

Deploy and Pray

Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.

After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.

What Canary Deployments Actually Mean

A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:

Traffic flow:
  Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
                            └──→  5% → v1.2.4 (canary)
Enter fullscreen mode Exit fullscreen mode

If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.

Our Canary Pipeline

# .github/workflows/canary-deploy.yml
canary_deploy:
  steps:
    - name: Deploy canary (5%)
      run: |
        kubectl set image deployment/api-canary api=api:${{ github.sha }}
        kubectl scale deployment/api-canary --replicas=1
        # Configure traffic split
        kubectl apply -f - <<EOF
        apiVersion: split.smi-spec.io/v1alpha1
        kind: TrafficSplit
        metadata:
          name: api-canary
        spec:
          service: api
          backends:
          - service: api-stable
            weight: 95
          - service: api-canary
            weight: 5
        EOF

    - name: Wait and analyze (10 minutes)
      run: |
        sleep 600
        # Check canary health
        ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')
        LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')

        echo "Canary error rate: $ERROR_RATE"
        echo "Canary p99 latency: $LATENCY"

        if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
          echo "CANARY FAILED: Error rate too high"
          exit 1
        fi

    - name: Promote to 50%
      run: |
        kubectl apply -f traffic-split-50.yaml
        sleep 600  # Wait another 10 min

    - name: Full rollout
      run: |
        kubectl set image deployment/api-stable api=api:${{ github.sha }}
        kubectl delete deployment api-canary
        kubectl delete trafficsplit api-canary
Enter fullscreen mode Exit fullscreen mode

The Canary Checklist

What we check during the canary window:

CANARY_CHECKS = {
    'error_rate': {
        'query': 'rate(http_5xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
        'threshold': 0.01,  # Max 1% errors
        'comparison': 'less_than'
    },
    'latency_p99': {
        'query': 'histogram_quantile(0.99, rate(http_duration_bucket{version="canary"}[5m]))',
        'threshold': 0.5,  # Max 500ms
        'comparison': 'less_than'
    },
    'success_rate': {
        'query': 'rate(http_2xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
        'threshold': 0.99,  # Min 99% success
        'comparison': 'greater_than'
    },
    'memory_usage': {
        'query': 'container_memory_working_set_bytes{version="canary"}',
        'threshold': 512 * 1024 * 1024,  # Max 512MB
        'comparison': 'less_than'
    }
}
Enter fullscreen mode Exit fullscreen mode

Results After 6 Months

Metric Before After
Rollback rate 15% of deploys 3% of deploys
Mean time to detect bad deploy 25 min 8 min
Customer-facing incidents from deploys 4/month 0.5/month
Deploy frequency 1x/day (afraid) 5x/day (confident)

The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.

Start Simple

You don't need Istio or a service mesh for canary deploys. Start with:

  1. Two deployment objects (stable + canary)
  2. A load balancer that supports weighted routing
  3. A script that checks error rates after deploy
  4. A human who decides whether to promote or rollback

Automate from there.

If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (1)

Collapse
 
henryaza profile image
Henry A

Nice overview. One thing I'd add: canary routing at the load balancer level (weighted target groups in ALB, or Nginx upstream weights) is often simpler to set up than service mesh canaries for teams that aren't on Istio yet. You can do 95/5 traffic split with a single ALB listener rule change and roll it back in seconds. The biggest gotcha I've seen is teams doing canary deploys without automated rollback triggers — if your error rate threshold trips, the rollback should be instant, not waiting for someone to notice a dashboard.