Workflow Deep Dive

#devops #sre #kubernetes #terraform

LinkedIn Draft — Workflow (2026-01-13)

Kubernetes rollouts: promote on SLOs, not on “pods are Ready”

Readiness is a local signal. Production impact is global. Real rollouts need promotion gates that track user-facing health.

What usually bites later:

A rollout can be 100% Ready while P95 latency and error-rate spike (bad cache warmup, noisy neighbor, DB pressure).
HPA reacts slower than a fast rollout; you ship overload before autoscaling catches up.
Canary gets “stuck green” because your metrics are too coarse (no labels/slices), so you miss blast radius.

My default rule:
Promote only when your canary holds the SLO slice you care about (error-rate + latency) for a fixed window — otherwise auto-rollback.

When I’m sanity-checking this, I usually do:

Use Argo Rollouts / Flagger with Prometheus metrics as gates (error-rate, latency, saturation).
Alert on canary vs baseline deltas, not absolute thresholds (reduces noise, catches regressions).

Deep dive (stable link): https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-rollouts-promote-on-slos-not-on-pods-are-ready