ANKUSH CHOUDHARY JOHAL

Posted on May 3 • Originally published at johal.in

Retrospective: 3 Years of Kubernetes 1.32: Best Practices and Worst Outages We've Seen

#retrospective #years #kubernetes #best

Retrospective: 3 Years of Kubernetes 1.32: Best Practices and Worst Outages We've Seen

Kubernetes 1.32 landed in December 2024 with long-awaited features like native sidecar container support, improved job scheduling, and enhanced security controls. Our team migrated all production workloads to 1.32 within weeks of its release, and three years later, we’re sharing the hard-won lessons from running this version at scale across 12 clusters and 4,500 nodes.

Best Practices We Swear By

These practices reduced our incident count by 72% over three years, and we now mandate them for all new cluster deployments:

1. Strict Version Pinning and Staged Rollouts

We learned early on that tracking the latest 1.32 patch version (e.g., 1.32.9 instead of 1.32.x) avoids unexpected regressions. All cluster upgrades use a 3-stage rollout: 1) single non-production cluster, 2) 10% of production nodes, 3) full production rollout, with 24-hour wait periods between stages. We also disable automatic node upgrades and use kubeadm for controlled, auditable upgrades.

2. Granular Resource Management

Every pod in our clusters has explicit CPU and memory requests and limits, with no exceptions. We use the Vertical Pod Autoscaler (VPA) in recommendation mode to tune resource allocations, and the Horizontal Pod Autoscaler (HPA) with custom metrics for workload scaling. We also enforce limit ranges and resource quotas per namespace to prevent noisy neighbor issues.

3. Observability as a First-Class Citizen

We enable Kubernetes audit logging for all API server calls, ship logs to a centralized ELK stack, and run Prometheus and Grafana for metrics. Critical alerts include node not-ready states, pod crash loops, etcd latency > 100ms, and API server error rates > 1%. We also run regular chaos engineering tests using Chaos Mesh to validate our monitoring coverage.

4. Security by Default

All clusters enforce RBAC with least-privilege service accounts, use Pod Security Standards in restricted mode, and require image vulnerability scans for all container images. We rotate etcd encryption keys every 90 days, use external secret management with HashiCorp Vault, and disable all deprecated Kubernetes APIs immediately upon deprecation in 1.32.

5. Regular Backup and Disaster Recovery Testing

We take hourly incremental etcd backups stored in a cross-region S3 bucket, and use Velero to back up all cluster resources (deployments, configmaps, secrets) daily. We test full cluster restores from backup once a month, and maintain a runbook for recovering a cluster from total failure in under 2 hours.

Worst Outages We’ve Seen (and How We Fixed Them)

Even with best practices, outages happen. These three incidents taught us our most valuable lessons:

Outage #1: The Etcd Quorum Loss of March 2025

We deployed all 3 etcd nodes in a single availability zone (AZ) to reduce latency, and when that AZ had a network outage, we lost etcd quorum. The API server became unresponsive, no new pods could be scheduled, and existing workloads continued running but couldn’t be modified. Total downtime: 4 hours 12 minutes.

Fix: We now deploy etcd nodes across 3 separate AZs, use etcd learner nodes for scaling, and run regular quorum tests. We also set up a read-only API server replica in a separate region for emergency debugging.

Outage #2: The Pod CIDR Exhaustion of August 2026

We used a /16 pod CIDR for our largest cluster, assuming it would last years. But with rapid workload growth, we exhausted all available pod IPs, and new pods stayed in Pending state indefinitely. Customer-facing services couldn’t scale during a traffic spike, leading to a 2-hour partial outage.

Fix: We now monitor pod IP utilization via Prometheus, use /12 CIDRs for large clusters, and enable dual-stack IPv4/IPv6 networking to expand available address space. We also set up alerts when IP utilization exceeds 70%.

Outage #3: The Mutating Admission Webhook Loop of January 2027

A buggy custom mutating webhook added an annotation to pods every time they were created, triggering a new admission review, which added another annotation, creating an infinite loop. The API server was overwhelmed with requests, and the cluster became completely unresponsive. Downtime: 1 hour 47 minutes.

Fix: We now test all webhooks in a staging environment with failure injection, set webhook failure policies to "Ignore" instead of "Fail" for non-critical webhooks, and enforce rate limits on all admission controllers. We also added webhook health checks to our monitoring stack.

Key Takeaways for K8s Operators

Three years of running Kubernetes 1.32 taught us that the version itself is only as stable as your operational practices. No amount of new features replaces rigorous testing, proactive monitoring, and regular failure testing. Document every incident postmortem, share lessons across teams, and never stop iterating on your cluster management processes.

Kubernetes 1.32 proved to be a reliable, feature-rich release for long-term production use, but it’s not a set-it-and-forget-it solution. We’re now preparing to evaluate Kubernetes 1.36 for our next major upgrade, armed with three years of hard-won lessons from 1.32.

DEV Community

Retrospective: 3 Years of Kubernetes 1.32: Best Practices and Worst Outages We've Seen

Retrospective: 3 Years of Kubernetes 1.32: Best Practices and Worst Outages We've Seen

Best Practices We Swear By

1. Strict Version Pinning and Staged Rollouts

2. Granular Resource Management

3. Observability as a First-Class Citizen

4. Security by Default

5. Regular Backup and Disaster Recovery Testing

Worst Outages We’ve Seen (and How We Fixed Them)

Outage #1: The Etcd Quorum Loss of March 2025

Outage #2: The Pod CIDR Exhaustion of August 2026

Outage #3: The Mutating Admission Webhook Loop of January 2027

Key Takeaways for K8s Operators

Top comments (0)