LinkedIn Draft — Workflow (2026-04-28)
Not in any textbook — learned this from a 3am page:
Kubernetes rollouts: why 'pods are Ready' is the wrong promotion gate
Readiness is a node-local signal. Production health is a global one. Most rollout pipelines conflate the two — and that's where incidents come from.
Bad gate: Good gate:
Deploy ──▶ Pods Ready? ──▶ Done Deploy ──▶ Pods Ready?
(local signal) │
▼
SLO window check
(error rate + p95)
│
Pass ──▶ Promote
Fail ──▶ Auto-rollback
Where it breaks:
▸ 100% Ready pods while P95 latency spikes — bad cache warmup, noisy neighbor, DB connection saturation.
▸ HPA reacts slower than a fast rollout — you ship overload before autoscaling catches up.
▸ Canary stuck green because metrics lack the right labels/slices to isolate the failing segment.
The rule I keep coming back to:
→ Promote only when the canary holds your SLO slice (error rate + latency) for a fixed observation window. Otherwise: auto-rollback.
How I sanity-check it:
▸ Argo Rollouts or Flagger with Prometheus gates — error rate, latency percentiles, saturation.
▸ Alert on canary-vs-baseline deltas, not absolute thresholds. Catches regressions that pass absolute checks.
Operational maturity isn't about tools — it's about designing for the failure you haven't seen yet.
Strong opinions on this? Good. I want to hear the pushback.
Top comments (0)