DEV Community

Neeraja Khanapure
Neeraja Khanapure

Posted on

Not in any textbook — learned this from a 3am page:

LinkedIn Draft — Workflow (2026-04-28)

Not in any textbook — learned this from a 3am page:

Kubernetes rollouts: why 'pods are Ready' is the wrong promotion gate

Readiness is a node-local signal. Production health is a global one. Most rollout pipelines conflate the two — and that's where incidents come from.

Bad gate:                         Good gate:

Deploy ──▶ Pods Ready? ──▶ Done   Deploy ──▶ Pods Ready?
           (local signal)                    │
                                             ▼
                                    SLO window check
                                    (error rate + p95)
                                             │
                                    Pass ──▶ Promote
                                    Fail ──▶ Auto-rollback
Enter fullscreen mode Exit fullscreen mode

Where it breaks:
▸ 100% Ready pods while P95 latency spikes — bad cache warmup, noisy neighbor, DB connection saturation.
▸ HPA reacts slower than a fast rollout — you ship overload before autoscaling catches up.
▸ Canary stuck green because metrics lack the right labels/slices to isolate the failing segment.

The rule I keep coming back to:
→ Promote only when the canary holds your SLO slice (error rate + latency) for a fixed observation window. Otherwise: auto-rollback.

How I sanity-check it:
▸ Argo Rollouts or Flagger with Prometheus gates — error rate, latency percentiles, saturation.
▸ Alert on canary-vs-baseline deltas, not absolute thresholds. Catches regressions that pass absolute checks.

Operational maturity isn't about tools — it's about designing for the failure you haven't seen yet.

Deep dive: https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-rollouts-why-pods-are-ready-is-the-wrong-promotion-gate

Strong opinions on this? Good. I want to hear the pushback.

kubernetes #reliability #devops #sre

Top comments (0)