Kubernetes rollouts: promote on SLOs, not on "pods are Ready"

#kubernetes #devops #cloudnative #sre

Readiness is a local signal. Production impact is global.
Pods can be Ready while your SLO window is already burning.
The failure chain
Rollout shifts traffic fast.
New pods saturate before HPA reacts.
HPA scrape window is 15 to 30 seconds minimum.
P95 latency climbs.
Error rate ticks up.
SLI degrades.
Everything looks healthy. The error budget is draining quietly.
Why "pods are Ready" lies to you
Ready means the container started and passed a health check.
It says nothing about P95 latency, error rate, or whether your SLO slice is holding.
Canary gets stuck green because metrics are too coarse.
No labels, no slices, blast radius stays invisible.
Three resolvers
Pre-scale before the first canary step
Bump replicas before traffic shifts.
HPA catches up from a safe baseline instead of a saturated one.
Match step interval to your HPA scaleUp window
Default stabilization window is 3 minutes.
Check yours with:
bashkubectl get hpa -o yaml
Promoting before that window closes is promoting blind.
Gate steps on SLI health
Wire an AnalysisRun in Argo Rollouts that checks error rate and P95 latency are within SLO bounds before promoting.
If the SLI is still recovering, promotion waits.
The rule
Promote only when the canary holds the SLO slice that matters for a fixed window.
Anything outside that window triggers auto-rollback.
Rollout speed and autoscaler reaction time are tuned independently.
That gap is where error budget burns before anyone pages.

Kubernetes rollouts: promote on SLOs, not on "pods are Ready" | Neeraja Khanapure

Pods are Ready. P95 is climbing. Error rate is ticking. HPA has not moved. The rollout looks healthy. The SLO window is already burning. The exact failure chain and three resolvers that actually work. Pre-scaling, matching step interval to your HPA stabilization window, and gating promotion on SLI health instead of pod status. #kubernetes #sre #devops #platformengineering

linkedin.com