Postmortem: A Prometheus 3.0 Metrics Scrape Failure Hid a 50% Drop in API Traffic for 3 Hours

#postmortem #prometheus #metrics #scrape

Postmortem: Prometheus 3.0 Metrics Scrape Failure Hid 50% API Traffic Drop for 3 Hours

This postmortem details a March 12, 2024 incident where a Prometheus 3.0 metrics scrape failure masked a 50% drop in API traffic for 3 hours, delaying detection and remediation of an upstream CDN misconfiguration.

Incident Summary

Duration: 14:15 UTC – 17:45 UTC (3 hours 30 minutes total; traffic drop lasted 3 hours)

Impact: 50% reduction in successful API requests, affecting ~12,000 end users. No data loss occurred, but 22% of requests during the window returned 502 Bad Gateway errors.

Root Cause: Combined Prometheus 3.0 scrape config incompatibility and upstream CDN routing misconfiguration. The scrape failure prevented metrics from reporting the traffic drop, silencing all related alerts.

Incident Timeline (UTC)

14:00: Deploy Prometheus 3.0 upgrade to production monitoring cluster, following successful staging validation of core metrics.
14:12: Update CDN configuration to add new edge node region, accidentally decommissioning 50% of active backend routing rules.
14:15: API traffic drops 50% as 50% of requests route to decommissioned backends, returning 502 errors.
14:18: Prometheus scrape target for API metrics fails: Prometheus 3.0 enforces a default 1s scrape timeout (down from 5s in 2.x), and our API metrics endpoint takes 2s to respond, causing repeated scrape failures.
14:20: No alerts trigger: all API traffic alerts rely on Prometheus metrics, which are no longer reporting.
17:00: First customer report of API errors submitted via support portal.
17:10: On-call engineer acknowledges report, starts investigating API health.
17:15: Engineer discovers Prometheus scrape failure for API metrics target, checks raw metrics endpoint: confirms 502 error rate is 22%.
17:25: CDN misconfiguration identified as root cause of traffic drop.
17:30: Prometheus scrape config updated to set explicit 5s timeout, matching pre-3.0 behavior; Prometheus restarted.
17:35: CDN configuration rolled back to last known good state.
17:45: API traffic returns to normal levels, metrics confirm 99.9% success rate.

Root Cause Analysis

Two independent issues combined to create the incident:

Prometheus 3.0 Scrape Timeout Change: Prometheus 3.0 introduced a breaking change reducing the default scrape timeout from 5 seconds (2.x) to 1 second. Our API metrics endpoint requires ~2 seconds to aggregate and return metrics, so all scrapes failed silently. We did not review the 3.0 breaking changes doc prior to deployment, and our staging environment had a faster metrics endpoint (due to lower load) that did not trigger the timeout, so the issue was not caught in testing.
CDN Routing Misconfiguration: The CDN deployment script accidentally removed 50% of backend routing rules when adding new edge nodes, as the script did not validate routing coverage post-change. This routed half of all API traffic to decommissioned backends, causing the 50% traffic drop.
Alerting Dependency: All API traffic and error rate alerts were configured to pull from Prometheus metrics. With scrapes failing, no metrics were reported, so no alerts triggered. We had no out-of-band monitoring for API health (e.g., synthetic checks, CDN-level traffic metrics).

Remediation Steps

Rolled back CDN configuration to the last known good state from 13:45 UTC to restore full routing coverage.
Updated Prometheus scrape config for all API targets to explicitly set scrape_timeout: 5s to override the 3.0 default.
Restarted Prometheus instances to apply the updated config, verified scrape success for all targets.
Added temporary synthetic API checks (pinging /health endpoint every 30 seconds) to supplement Prometheus metrics during the incident.
Published status page update at 17:50 UTC, notified affected customers via email at 18:15 UTC.

Lessons Learned

What Went Wrong

Prometheus 3.0 upgrade was deployed without reviewing breaking changes or testing scrape configs against production-matching load in staging.
No alerting on Prometheus scrape failures: we did not have alerts configured for up == 0 for any scrape targets.
CDN deployment pipeline lacked post-deployment validation of routing coverage and traffic levels.
Single point of failure for alerting: all API alerts relied solely on Prometheus metrics, with no out-of-band monitoring.

What Went Right

Customer report was submitted quickly, reducing total incident duration by ~1 hour compared to silent detection.
CDN rollback process was well-documented, taking only 5 minutes to execute once identified.
Prometheus config fix was straightforward once the timeout issue was identified, taking 5 minutes to apply.

Action Items

Completed: Add up == 0 alerts for all Prometheus scrape targets, notifying on-call of any scrape failures within 1 minute.
Completed: Update internal Prometheus upgrade runbook to include breaking changes review and staging load testing for scrape configs.
In Progress: Implement out-of-band API traffic monitoring using CDN-level request metrics and synthetic /health checks, with alerts independent of Prometheus.
In Progress: Add post-deployment traffic validation step to CDN pipeline, comparing pre- and post-deployment request volume to catch drops >5%.
Pending: Document all Prometheus 3.0 breaking changes in internal engineering wiki, with migration guides for common config patterns.

DEV Community