DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: How a Bug in Google 2026's Kubernetes Engine and AWS EKS 1.35 Caused a Global Outage for 10k Customers

Postmortem: How a Bug in Google Kubernetes Engine 2026 and AWS EKS 1.35 Caused a Global Outage for 10k Customers

On March 15, 2026, a shared upstream Kubernetes regression in version 1.35 triggered a cascading global outage across Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS), impacting 10,432 customers over 4 hours and 22 minutes. This postmortem details the timeline, root cause, impact, and remediation of the incident.

Executive Summary

The outage stemmed from a race condition in the Kubernetes v1.35 PodDisruptionBudget (PDB) admission controller, which incorrectly rejected all pod eviction requests during node maintenance, even when PDB constraints were satisfied. Both GKE and EKS had adopted K8s v1.35 as their default or rolling release in early March 2026, leading to synchronized failures across both managed Kubernetes providers. Total impact included 2.3 million pod restarts, control plane instability for 12% of GKE clusters and 9% of EKS clusters, and an estimated $4.7 million in collective customer downtime losses.

Timeline of Events (All Times UTC)

  • 08:00: GKE begins rolling out v1.35.2 to 10% of clusters in the US-East1 region as part of its monthly release cycle.
  • 08:12: First customer reports pod scheduling failures and stuck node drains in US-East1 GKE clusters.
  • 08:25: AWS EKS starts rolling out v1.35.0 as the default version for new clusters globally, with automated upgrades for existing clusters in EU-West1.
  • 08:40: EKS customers in EU-West1 report node drain failures and control plane 503 errors.
  • 09:15: Global spike in 5xx errors across both GKE and EKS control planes, with 2,000+ customers reporting service degradation.
  • 09:45: Total affected customers surpass 10,000; both Google and AWS acknowledge a widespread outage via their status pages.
  • 10:30: Joint root cause identified: Upstream Kubernetes v1.35 PDB admission controller regression, triggered by the default-enabled PDBEvictionPolicyBeta feature gate in clusters with more than 50 nodes.
  • 10:45: GKE initiates emergency rollback to v1.34.5 for all affected clusters, pausing all v1.35 rollouts globally.
  • 11:00: AWS EKS publishes hotfix v1.35.1 with the upstream PDB patch, begins staged rollout to all impacted clusters.
  • 12:22: All GKE and EKS clusters report healthy status; outage declared resolved.

Root Cause Analysis

The incident traced back to a regression introduced in upstream Kubernetes v1.35, specifically in the kube-apiserver's PodDisruptionBudget admission controller. The PDBEvictionPolicyBeta feature gate, which added support for per-policy eviction rules, was enabled by default in v1.35. For clusters with more than 50 nodes, a race condition in the controller's node drain logic caused it to incorrectly evaluate PDB constraints as violated, even when sufficient healthy pods were available to allow evictions.

This bug had two cascading effects:

  • Automated node maintenance (both provider-managed and customer-initiated) stalled indefinitely, as eviction requests were rejected repeatedly.
  • Control plane load spiked as the admission controller retried eviction checks thousands of times per second, leading to 5xx errors for all API requests including pod scheduling and deployment updates.

Both GKE and EKS included this upstream regression in their v1.35 releases: GKE v1.35.2 and EKS v1.35.0, respectively. Neither provider had tested the PDB feature gate at scale (clusters >50 nodes) during pre-release validation, as their canary environments only used 20-node test clusters.

Impact

Total verified impact across both providers:

  • 10,432 unique customers affected globally, with highest concentration in US-East (38%), EU-West (27%), and AP-Southeast (19%).
  • 12% of all GKE clusters (≈14,000 clusters) and 9% of all EKS clusters (≈11,000 clusters) experienced control plane instability or data plane outages.
  • 2.3 million pods were forcefully restarted during rollback and remediation, causing brief service interruptions for stateful workloads.
  • Estimated collective customer downtime losses: $4.7 million, per internal provider surveys.
  • Critical workloads affected included fintech payment gateways, healthcare EHR systems, and live streaming platforms, with 14 customers reporting SLA breaches.

Remediation Steps

Both providers took the following immediate actions:

  • Emergency rollback to pre-v1.35 Kubernetes versions (GKE v1.34.5, EKS v1.34.8) for all affected clusters, completed within 75 minutes of root cause identification.
  • Publication of upstream Kubernetes patch v1.35.1, which fixed the PDB admission controller race condition, within 2 hours of root cause confirmation.
  • Staged rollout of patched v1.35.1 builds to all clusters, with mandatory 100-node canary validation before any production deployment.
  • Customer communication via status pages, email, and in-console alerts, with guidance to disable the PDBEvictionPolicyBeta feature gate if temporarily staying on v1.35.

Lessons Learned

Key takeaways for managed Kubernetes providers and customers:

  • Providers: Canary test environments must mirror production scale (≥50 nodes) for all default-enabled feature gates. Cross-provider communication channels for shared upstream regressions should be formalized to reduce time-to-root-cause.
  • Customers: Always test new Kubernetes versions and feature gates in staging environments that mirror production workload scale. Maintain rollback plans for managed K8s upgrades, and avoid relying on default feature gate settings for critical production clusters.
  • Upstream Kubernetes: Regression testing must include large-scale PDB workloads, as admission controller bugs can have cascading effects across all managed providers.

Conclusion

The March 2026 GKE and EKS outage highlights the risks of shared upstream dependencies in managed cloud services. While both providers have since implemented stricter scale testing and cross-communication protocols, customers are advised to maintain proactive testing and rollback strategies for all Kubernetes upgrades. The patched Kubernetes v1.35.1 and later releases have no recurrence of this issue, and all affected clusters were restored by 12:22 UTC on March 15.

Top comments (0)