Deploy and Pray
Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.
After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.
What Canary Deployments Actually Mean
A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:
Traffic flow:
Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
└──→ 5% → v1.2.4 (canary)
If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.
Our Canary Pipeline
# .github/workflows/canary-deploy.yml
canary_deploy:
steps:
- name: Deploy canary (5%)
run: |
kubectl set image deployment/api-canary api=api:${{ github.sha }}
kubectl scale deployment/api-canary --replicas=1
# Configure traffic split
kubectl apply -f - <<EOF
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: api-canary
spec:
service: api
backends:
- service: api-stable
weight: 95
- service: api-canary
weight: 5
EOF
- name: Wait and analyze (10 minutes)
run: |
sleep 600
# Check canary health
ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')
LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')
echo "Canary error rate: $ERROR_RATE"
echo "Canary p99 latency: $LATENCY"
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "CANARY FAILED: Error rate too high"
exit 1
fi
- name: Promote to 50%
run: |
kubectl apply -f traffic-split-50.yaml
sleep 600 # Wait another 10 min
- name: Full rollout
run: |
kubectl set image deployment/api-stable api=api:${{ github.sha }}
kubectl delete deployment api-canary
kubectl delete trafficsplit api-canary
The Canary Checklist
What we check during the canary window:
CANARY_CHECKS = {
'error_rate': {
'query': 'rate(http_5xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
'threshold': 0.01, # Max 1% errors
'comparison': 'less_than'
},
'latency_p99': {
'query': 'histogram_quantile(0.99, rate(http_duration_bucket{version="canary"}[5m]))',
'threshold': 0.5, # Max 500ms
'comparison': 'less_than'
},
'success_rate': {
'query': 'rate(http_2xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
'threshold': 0.99, # Min 99% success
'comparison': 'greater_than'
},
'memory_usage': {
'query': 'container_memory_working_set_bytes{version="canary"}',
'threshold': 512 * 1024 * 1024, # Max 512MB
'comparison': 'less_than'
}
}
Results After 6 Months
| Metric | Before | After |
|---|---|---|
| Rollback rate | 15% of deploys | 3% of deploys |
| Mean time to detect bad deploy | 25 min | 8 min |
| Customer-facing incidents from deploys | 4/month | 0.5/month |
| Deploy frequency | 1x/day (afraid) | 5x/day (confident) |
The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.
Start Simple
You don't need Istio or a service mesh for canary deploys. Start with:
- Two deployment objects (stable + canary)
- A load balancer that supports weighted routing
- A script that checks error rates after deploy
- A human who decides whether to promote or rollback
Automate from there.
If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)