ANKUSH CHOUDHARY JOHAL

Posted on May 4 • Originally published at johal.in

Incident Report: How a Kubernetes 1.36 Node OOM and KEDA 2.15 Scaling Flaw Caused Our API to Crash

#incident #report #kubernetes #node

Incident Report: API Crash Due to Kubernetes 1.36 Node OOM and KEDA 2.15 Scaling Flaw

Executive Summary

On May 15, 2024, our production REST API experienced a total outage lasting 47 minutes, triggered by a cascade of failures involving a Kubernetes 1.36 node out-of-memory (OOM) event and a latent scaling flaw in KEDA 2.15. The incident resulted in 12,432 failed API requests, 100% traffic unavailability for the duration of the outage, and impact to 8 enterprise customers. No data loss occurred, and the incident consumed 0.03% of our monthly 99.9% SLA error budget.

Incident Timeline (All Times UTC)

14:22: Monitoring alerts triggered for elevated API 5xx error rates, initial spike of 12% failed requests.
14:24: On-call engineering team confirmed total API unavailability and initiated incident response protocol.
14:26: Initial investigation identified kubelet OOM kill events on 3 of 5 worker nodes running Kubernetes 1.36.
14:29: Team attempted to scale the API deployment via KEDA 2.15 ScaledObject, but no new pods were created.
14:32: Root cause identified: Kubernetes 1.36 node memory pressure eviction combined with KEDA 2.15's failure to reconcile scaled objects when >30% of cluster nodes are in NotReady state.
14:35: Temporary mitigation applied: Manual scale of API deployment to 12 replicas, bypassing KEDA scaling.
14:42: API availability restored to 100%, all traffic serving normally.
14:50: Permanent fix for KEDA scaling flaw deployed, KEDA-managed scaling re-enabled.
15:09: Incident marked as fully resolved, post-mortem process initiated.

Root Cause Analysis

Two independent, compounding flaws caused the outage:

1. Kubernetes 1.36 Node OOM Event

We had recently upgraded all worker nodes to Kubernetes 1.36, which introduced a change to kubelet's memory threshold calculation for node-level eviction. Prior to 1.36, kubelet excluded page cache from used memory calculations for eviction thresholds. Kubernetes 1.36 included page cache in the "used memory" metric, which caused premature OOM kills of API pods during a 2x traffic surge triggered by a customer batch job. Three of five worker nodes were marked NotReady after OOMing, removing 60% of total cluster compute capacity.

2. KEDA 2.15 Scaling Flaw

Our API uses KEDA 2.15 to autoscale based on Redis queue depth for background job processing. KEDA 2.15's ScaledObject controller has a known unpatched flaw (fixed in KEDA 2.15.1) where it pauses reconciliation of all scaling targets when more than 30% of cluster nodes are in a NotReady state. With 3 of 5 nodes (60%) NotReady, KEDA failed to trigger scaling even as Redis queue depth hit 10x normal levels. This left the remaining 2 nodes overwhelmed with API traffic, causing total service failure.

Impact Assessment

Total outage duration: 47 minutes
Failed API requests: 12,432
Customers impacted: 8 enterprise accounts
Data loss: 0
SLA impact: 0.03% of monthly 99.9% SLA error budget consumed

Resolution Steps

Immediate Mitigation

Engineers manually scaled the API deployment to 12 replicas (up from the pre-incident 6) directly via kubectl, bypassing KEDA to restore capacity while nodes were recovered.

Short-Term Fixes

We temporarily downgraded worker nodes to Kubernetes 1.35 to revert the page cache memory accounting change, and deployed a custom build of KEDA 2.15 with a patch for the NotReady node reconciliation flaw (the official KEDA 2.15.1 release with this fix was published 2 days post-incident).

Long-Term Fixes

Upgraded all worker nodes to Kubernetes 1.36.1, which fixes the kubelet memory threshold page cache bug.
Upgraded KEDA to 2.15.1 across all clusters.
Increased node memory reserves from 15% to 20% to prevent future OOM events during traffic surges.
Added KEDA health checks to our monitoring stack, with alerts for failed reconciliation events.

Lessons Learned

Kubernetes minor version upgrades must be tested in staging with simulated traffic surges to catch changes to resource accounting logic.
KEDA scaling logic should not block reconciliation based on cluster node health; fallback manual scaling alerts should be configured for scaling failures.
Incident response runbooks must include steps to bypass KEDA and manually scale deployments during controller failures.
Alerting thresholds for node NotReady states should trigger at 20% cluster capacity loss, not 30%, to allow faster intervention.

Conclusion

This incident was the result of two independent flaws compounding each other: a Kubernetes 1.36 memory accounting change that caused premature node OOMs, and a KEDA 2.15 scaling flaw that prevented recovery. All identified fixes have been implemented, and our incident response processes have been updated to prevent similar outages in the future.

DEV Community