Postmortem: Prometheus 3.0 Metrics Scrape Failure Hid 50% API Traffic Drop for 3 Hours
This postmortem details a March 12, 2024 incident where a Prometheus 3.0 metrics scrape failure masked a 50% drop in API traffic for 3 hours, delaying detection and remediation of an upstream CDN misconfiguration.
Incident Summary
Duration: 14:15 UTC – 17:45 UTC (3 hours 30 minutes total; traffic drop lasted 3 hours)
Impact: 50% reduction in successful API requests, affecting ~12,000 end users. No data loss occurred, but 22% of requests during the window returned 502 Bad Gateway errors.
Root Cause: Combined Prometheus 3.0 scrape config incompatibility and upstream CDN routing misconfiguration. The scrape failure prevented metrics from reporting the traffic drop, silencing all related alerts.
Incident Timeline (UTC)
- 14:00: Deploy Prometheus 3.0 upgrade to production monitoring cluster, following successful staging validation of core metrics.
- 14:12: Update CDN configuration to add new edge node region, accidentally decommissioning 50% of active backend routing rules.
- 14:15: API traffic drops 50% as 50% of requests route to decommissioned backends, returning 502 errors.
- 14:18: Prometheus scrape target for API metrics fails: Prometheus 3.0 enforces a default 1s scrape timeout (down from 5s in 2.x), and our API metrics endpoint takes 2s to respond, causing repeated scrape failures.
- 14:20: No alerts trigger: all API traffic alerts rely on Prometheus metrics, which are no longer reporting.
- 17:00: First customer report of API errors submitted via support portal.
- 17:10: On-call engineer acknowledges report, starts investigating API health.
- 17:15: Engineer discovers Prometheus scrape failure for API metrics target, checks raw metrics endpoint: confirms 502 error rate is 22%.
- 17:25: CDN misconfiguration identified as root cause of traffic drop.
- 17:30: Prometheus scrape config updated to set explicit 5s timeout, matching pre-3.0 behavior; Prometheus restarted.
- 17:35: CDN configuration rolled back to last known good state.
- 17:45: API traffic returns to normal levels, metrics confirm 99.9% success rate.
Root Cause Analysis
Two independent issues combined to create the incident:
- Prometheus 3.0 Scrape Timeout Change: Prometheus 3.0 introduced a breaking change reducing the default scrape timeout from 5 seconds (2.x) to 1 second. Our API metrics endpoint requires ~2 seconds to aggregate and return metrics, so all scrapes failed silently. We did not review the 3.0 breaking changes doc prior to deployment, and our staging environment had a faster metrics endpoint (due to lower load) that did not trigger the timeout, so the issue was not caught in testing.
- CDN Routing Misconfiguration: The CDN deployment script accidentally removed 50% of backend routing rules when adding new edge nodes, as the script did not validate routing coverage post-change. This routed half of all API traffic to decommissioned backends, causing the 50% traffic drop.
- Alerting Dependency: All API traffic and error rate alerts were configured to pull from Prometheus metrics. With scrapes failing, no metrics were reported, so no alerts triggered. We had no out-of-band monitoring for API health (e.g., synthetic checks, CDN-level traffic metrics).
Remediation Steps
- Rolled back CDN configuration to the last known good state from 13:45 UTC to restore full routing coverage.
- Updated Prometheus scrape config for all API targets to explicitly set
scrape_timeout: 5sto override the 3.0 default. - Restarted Prometheus instances to apply the updated config, verified scrape success for all targets.
- Added temporary synthetic API checks (pinging /health endpoint every 30 seconds) to supplement Prometheus metrics during the incident.
- Published status page update at 17:50 UTC, notified affected customers via email at 18:15 UTC.
Lessons Learned
What Went Wrong
- Prometheus 3.0 upgrade was deployed without reviewing breaking changes or testing scrape configs against production-matching load in staging.
- No alerting on Prometheus scrape failures: we did not have alerts configured for
up == 0for any scrape targets. - CDN deployment pipeline lacked post-deployment validation of routing coverage and traffic levels.
- Single point of failure for alerting: all API alerts relied solely on Prometheus metrics, with no out-of-band monitoring.
What Went Right
- Customer report was submitted quickly, reducing total incident duration by ~1 hour compared to silent detection.
- CDN rollback process was well-documented, taking only 5 minutes to execute once identified.
- Prometheus config fix was straightforward once the timeout issue was identified, taking 5 minutes to apply.
Action Items
- Completed: Add
up == 0alerts for all Prometheus scrape targets, notifying on-call of any scrape failures within 1 minute. - Completed: Update internal Prometheus upgrade runbook to include breaking changes review and staging load testing for scrape configs.
- In Progress: Implement out-of-band API traffic monitoring using CDN-level request metrics and synthetic /health checks, with alerts independent of Prometheus.
- In Progress: Add post-deployment traffic validation step to CDN pipeline, comparing pre- and post-deployment request volume to catch drops >5%.
- Pending: Document all Prometheus 3.0 breaking changes in internal engineering wiki, with migration guides for common config patterns.
Top comments (0)