Deploy and Pray
Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.
After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.
What Canary Deployments Actually Mean
A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:
Traffic flow:
Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
└──→ 5% → v1.2.4 (canary)
If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.
Our Canary Pipeline
# .github/workflows/canary-deploy.yml
canary_deploy:
steps:
- name: Deploy canary (5%)
run: |
kubectl set image deployment/api-canary api=api:${{ github.sha }}
kubectl scale deployment/api-canary --replicas=1
# Configure traffic split
kubectl apply -f - <<EOF
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: api-canary
spec:
service: api
backends:
- service: api-stable
weight: 95
- service: api-canary
weight: 5
EOF
- name: Wait and analyze (10 minutes)
run: |
sleep 600
# Check canary health
ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')
LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')
echo "Canary error rate: $ERROR_RATE"
echo "Canary p99 latency: $LATENCY"
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "CANARY FAILED: Error rate too high"
exit 1
fi
- name: Promote to 50%
run: |
kubectl apply -f traffic-split-50.yaml
sleep 600 # Wait another 10 min
- name: Full rollout
run: |
kubectl set image deployment/api-stable api=api:${{ github.sha }}
kubectl delete deployment api-canary
kubectl delete trafficsplit api-canary
The Canary Checklist
What we check during the canary window:
CANARY_CHECKS = {
'error_rate': {
'query': 'rate(http_5xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
'threshold': 0.01, # Max 1% errors
'comparison': 'less_than'
},
'latency_p99': {
'query': 'histogram_quantile(0.99, rate(http_duration_bucket{version="canary"}[5m]))',
'threshold': 0.5, # Max 500ms
'comparison': 'less_than'
},
'success_rate': {
'query': 'rate(http_2xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
'threshold': 0.99, # Min 99% success
'comparison': 'greater_than'
},
'memory_usage': {
'query': 'container_memory_working_set_bytes{version="canary"}',
'threshold': 512 * 1024 * 1024, # Max 512MB
'comparison': 'less_than'
}
}
Results After 6 Months
| Metric | Before | After |
|---|---|---|
| Rollback rate | 15% of deploys | 3% of deploys |
| Mean time to detect bad deploy | 25 min | 8 min |
| Customer-facing incidents from deploys | 4/month | 0.5/month |
| Deploy frequency | 1x/day (afraid) | 5x/day (confident) |
The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.
Start Simple
You don't need Istio or a service mesh for canary deploys. Start with:
- Two deployment objects (stable + canary)
- A load balancer that supports weighted routing
- A script that checks error rates after deploy
- A human who decides whether to promote or rollback
Automate from there.
If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (1)
Nice overview. One thing I'd add: canary routing at the load balancer level (weighted target groups in ALB, or Nginx upstream weights) is often simpler to set up than service mesh canaries for teams that aren't on Istio yet. You can do 95/5 traffic split with a single ALB listener rule change and roll it back in seconds. The biggest gotcha I've seen is teams doing canary deploys without automated rollback triggers — if your error rate threshold trips, the rollback should be instant, not waiting for someone to notice a dashboard.