Zero-Downtime Deployments on Kubernetes: Rolling Updates, Blue-Green, and Canary
Deploying without downtime isn't just about setting strategy: RollingUpdate. It's about health checks that actually verify readiness, connection draining that doesn't drop requests, and rollback triggers that catch problems before users do.
Here's how to set up each deployment strategy correctly on Kubernetes.
The Basics: Why Deployments Fail
Most "zero-downtime" deployments still drop requests because of three mistakes:
- Readiness probes that lie — returning 200 before the app can serve traffic
- No graceful shutdown — pods killed mid-request
- Missing preStop hooks — pod removed from service before in-flight requests complete
Rolling Update (Default)
Rolling updates replace pods incrementally. Kubernetes creates new pods before terminating old ones.
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Create 1 extra pod during update
maxUnavailable: 0 # Never reduce below desired count
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api
image: myregistry/api-server:v2.1.0
ports:
- containerPort: 8080
# Readiness: "Can this pod serve traffic?"
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
# Liveness: "Is this pod stuck/deadlocked?"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
# Startup: "Is the app still starting up?"
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 2
# Gives app up to 60s to start before liveness kicks in
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
# Wait for endpoints controller to remove pod from Service
Why preStop: sleep 10?
When Kubernetes terminates a pod, two things happen concurrently:
- The Endpoints controller removes the pod from the Service
- The container receives SIGTERM
If SIGTERM arrives before the Endpoints update propagates, the load balancer still sends traffic to a pod that's shutting down. The preStop sleep gives time for the Endpoints change to propagate.
Readiness vs Liveness vs Startup
// In your Go server:
func main() {
ready := false
// Startup: load config, warm caches, connect to DB
go func() {
connectDB()
warmCache()
ready = true // Only now accept traffic
}()
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
// Liveness: am I alive and not deadlocked?
w.WriteHeader(http.StatusOK)
})
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
// Readiness: can I serve production traffic right now?
if !ready {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
// Optionally check DB connection
if err := db.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})
http.ListenAndServe(":8080", nil)
}
Never make liveness probes depend on external services. If your database goes down, liveness fails, Kubernetes restarts your pod, the new pod also can't reach the database, it restarts again — crash loop. Use readiness to stop traffic; use liveness only for detecting internal deadlocks.
Blue-Green Deployment
Run two identical environments. Switch traffic from blue (current) to green (new) atomically.
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-blue
spec:
replicas: 4
selector:
matchLabels:
app: api-server
version: blue
template:
metadata:
labels:
app: api-server
version: blue
spec:
containers:
- name: api
image: myregistry/api-server:v2.0.0
---
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-green
spec:
replicas: 4
selector:
matchLabels:
app: api-server
version: green
template:
metadata:
labels:
app: api-server
version: green
spec:
containers:
- name: api
image: myregistry/api-server:v2.1.0
---
# service.yaml — switch by changing selector
apiVersion: v1
kind: Service
metadata:
name: api-server
spec:
selector:
app: api-server
version: blue # Change to "green" to switch
ports:
- port: 80
targetPort: 8080
Switch traffic:
# Deploy green with new version
kubectl apply -f green-deployment.yaml
# Wait for all green pods to be ready
kubectl rollout status deployment/api-server-green
# Switch traffic (atomic — one API call)
kubectl patch service api-server -p '{"spec":{"selector":{"version":"green"}}}'
# Verify, then scale down blue
kubectl scale deployment api-server-blue --replicas=0
Advantage: Instant rollback — just switch the Service selector back to "blue".
Disadvantage: Requires 2x resources during deployment.
Canary Deployment
Route a small percentage of traffic to the new version. If metrics look good, gradually increase.
Simple Canary with Replica Ratios
# Stable: 9 replicas of v2.0.0
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-stable
spec:
replicas: 9
selector:
matchLabels:
app: api-server
track: stable
template:
metadata:
labels:
app: api-server
track: stable
spec:
containers:
- name: api
image: myregistry/api-server:v2.0.0
---
# Canary: 1 replica of v2.1.0 (gets ~10% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-canary
spec:
replicas: 1
selector:
matchLabels:
app: api-server
track: canary
template:
metadata:
labels:
app: api-server
track: canary
spec:
containers:
- name: api
image: myregistry/api-server:v2.1.0
---
# Service selects both — traffic split by replica count
apiVersion: v1
kind: Service
metadata:
name: api-server
spec:
selector:
app: api-server # Matches both stable and canary
ports:
- port: 80
targetPort: 8080
Scale up canary gradually:
# Start: 10% canary
kubectl scale deployment api-server-canary --replicas=1
kubectl scale deployment api-server-stable --replicas=9
# 50% canary
kubectl scale deployment api-server-canary --replicas=5
kubectl scale deployment api-server-stable --replicas=5
# 100% canary (promote)
kubectl scale deployment api-server-canary --replicas=10
kubectl scale deployment api-server-stable --replicas=0
Automated Canary with Metrics
Use a shell script (or Flagger/Argo Rollouts in production):
#!/bin/bash
# canary-promote.sh
CANARY_DEPLOY="api-server-canary"
STABLE_DEPLOY="api-server-stable"
TOTAL_REPLICAS=10
ERROR_THRESHOLD=1 # percent
for pct in 10 25 50 75 100; do
canary_replicas=$((TOTAL_REPLICAS * pct / 100))
stable_replicas=$((TOTAL_REPLICAS - canary_replicas))
echo "Setting canary to ${pct}% (${canary_replicas} replicas)"
kubectl scale deployment $CANARY_DEPLOY --replicas=$canary_replicas
kubectl scale deployment $STABLE_DEPLOY --replicas=$stable_replicas
# Wait and check error rate
sleep 60
# Query Prometheus for error rate (adjust query for your setup)
error_rate=$(curl -s "http://prometheus:9090/api/v1/query?query=\
rate(http_requests_total{deployment=\"${CANARY_DEPLOY}\",code=~\"5..\"}[1m])\
/rate(http_requests_total{deployment=\"${CANARY_DEPLOY}\"}[1m])*100" \
| jq '.data.result[0].value[1] // "0"' -r)
if (( $(echo "$error_rate > $ERROR_THRESHOLD" | bc -l) )); then
echo "Error rate ${error_rate}% exceeds threshold. Rolling back."
kubectl scale deployment $CANARY_DEPLOY --replicas=0
kubectl scale deployment $STABLE_DEPLOY --replicas=$TOTAL_REPLICAS
exit 1
fi
echo "Error rate ${error_rate}% — looks good"
done
echo "Canary promoted to 100%"
Graceful Shutdown
Your application must handle SIGTERM correctly:
func main() {
srv := &http.Server{Addr: ":8080", Handler: mux}
// Start server
go func() {
if err := srv.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
}()
// Wait for SIGTERM
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
<-quit
log.Println("Shutting down — finishing in-flight requests...")
// Give in-flight requests up to 30s to complete
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := srv.Shutdown(ctx); err != nil {
log.Printf("Forced shutdown: %v", err)
}
log.Println("Server stopped")
}
The shutdown sequence:
- Kubernetes sends SIGTERM
-
preStophook runs (sleep 10 — lets Endpoints update propagate) - App receives SIGTERM, stops accepting new connections
- App finishes in-flight requests (up to 30s)
- App exits cleanly
- Kubernetes waits up to
terminationGracePeriodSeconds(60s), then SIGKILL
Pod Disruption Budgets
Prevent too many pods from going down simultaneously (especially during node maintenance):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 3 # Always keep at least 3 pods running
selector:
matchLabels:
app: api-server
Quick Reference
| Strategy | Downtime | Rollback Speed | Resource Cost | Complexity |
|---|---|---|---|---|
| Rolling | Zero | Minutes | 1.25x | Low |
| Blue-Green | Zero | Seconds | 2x | Medium |
| Canary | Zero | Seconds | 1.1x–2x | High |
Start with rolling updates (with proper probes and preStop hooks). Move to canary when you have metrics/monitoring in place. Use blue-green for databases or stateful services where you need instant rollback.
Conclusion
Zero-downtime deployments require getting three things right: readiness probes that genuinely verify readiness, graceful shutdown that drains connections, and preStop hooks that account for Endpoint propagation delay. The deployment strategy (rolling, blue-green, canary) is secondary — if your health checks lie, every strategy will drop requests.
If this was helpful, you can support my work at ko-fi.com/nopkt ☕
If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more production backend patterns.
Top comments (0)