Production Outage: Prometheus 3.0 Out of Memory Crashed Our K8s Monitoring Stack

#production #outage #prometheus #memory

Production Outage: Prometheus 3.0 Out of Memory Crashed Our K8s Monitoring Stack

Postmortem for the incident occurring on October 12, 2024, affecting all production Kubernetes clusters in the us-east-1 region.

Incident Summary

At 09:17 UTC on October 12, 2024, our primary Prometheus 3.0 instance running in Kubernetes was out-of-memory (OOM) killed by the kubelet, taking down our entire monitoring stack. This resulted in a 47-minute gap in metrics collection, no alerting for downstream services, and blind spots for on-call engineering teams during peak traffic hours. The incident was resolved by rolling back to Prometheus 2.47.1 at 10:04 UTC.

Timeline

2024-10-11 22:00 UTC: Planned upgrade of Prometheus from 2.47.1 to 3.0.0 in production, following successful staging validation.
2024-10-12 09:15 UTC: Morning traffic ramp-up begins, Prometheus memory usage climbs past 3.8Gi (90% of the 4Gi container limit).
09:17 UTC: Prometheus pod OOM-killed by kubelet. Monitoring dashboards go blank, no new alerts fire.
09:22 UTC: On-call engineer notices blank Grafana dashboards, confirms Prometheus pod is in CrashLoopBackOff.
09:31 UTC: Rollback to Prometheus 2.47.1 initiated.
09:38 UTC: Prometheus 2.47.1 pod running, scraping resumes, dashboards populate.
10:04 UTC: All monitoring alerts re-enabled, incident declared resolved.

Root Cause Analysis

Two compounding factors led to the OOM crash:

Prometheus 3.0 default memory footprint increase: Prometheus 3.0 introduced a redesigned TSDB index cache that improves query latency by 40% but doubles default in-memory index storage compared to 2.x. Our existing 4Gi memory limit, sized for 2.x, was insufficient for 3.0 under production load.
Unoptimized high-cardinality metrics: A new payments microservice deployed on October 10 added 12,400 new time series with untruncated user_id labels, increasing total active series by 18% compared to staging load. This caused the 3.0 index cache to grow beyond the available memory.

We also identified a gap in our upgrade process: we did not run memory stress tests in staging with production-like time series counts, as staging only had 60% of the active series present in production.

Impact

47 minutes of lost metrics for all production services, including critical payments and auth systems.
No alerting for 32 minutes (alerts rely on Prometheus), delaying detection of a minor API latency spike in the payments service.
On-call teams had no visibility into cluster health, node utilization, or pod status during the outage window.

Resolution and Remediation

Immediate rollback to Prometheus 2.47.1 restored monitoring within 21 minutes of detection. Post-outage, we took the following steps to safely re-enable Prometheus 3.0:

Increased Prometheus container memory limit to 8Gi, with request set to 6Gi to account for 3.0's larger memory footprint.
Tuned the TSDB index cache size via the --storage.tsdb.index-cache-size flag to 2Gi, down from the 3.0 default of 4Gi.
Added relabeling rules to truncate high-cardinality user_id labels to the first 8 characters, reducing total active series by 14%.
Deployed a Prometheus pre-aggregate job to calculate per-service request rate metrics, reducing the number of raw series scraped by 22%.

Preventative Measures

To avoid similar outages in the future, we implemented the following:

Added Prometheus memory usage alerts: trigger warning at 70% of memory limit, critical at 85%.
Updated upgrade runbooks to require staging tests with production-identical time series counts and load profiles.
Deployed a secondary, lightweight Prometheus instance (2.47.1) in a separate namespace to serve as a fallback for critical infrastructure alerts.
Set a Pod Disruption Budget (PDB) for Prometheus pods to ensure at least 1 pod is always available during maintenance.
Added a pre-upgrade check to compare default resource usage between current and target Prometheus versions using official release notes.

Lessons Learned

Upgrading critical infrastructure components like Prometheus requires more than functional validation: resource footprint changes between major versions must be explicitly tested, especially for stateful workloads handling production-scale metric volumes. We also learned that high-cardinality metrics can have outsized impacts on memory usage for TSDB-based monitoring systems, and proactive label optimization is critical for stability.