Incident Postmortem: Datadog Outage Delayed Detection by 2 Hours, Switched to OTel 1.20

#incident #postmortem #datadog #outage

Incident Postmortem: Datadog Outage Delayed Detection by 2 Hours, Switched to OTel 1.20

Published: October 26, 2024 | Incident Date: October 12, 2024 | Author: Observability Engineering Team

Executive Summary

On October 12, 2024, a widespread Datadog service outage disrupted metric ingestion and alerting for 87% of our production workloads, delaying incident detection by 2 hours and 14 minutes. The outage stemmed from a misconfigured Datadog Agent rollout that conflicted with a recent Datadog backend update, causing silent metric drops. Post-incident, we accelerated our migration to OpenTelemetry (OTel) 1.20, completing the switch for all critical services within 72 hours of resolution.

Incident Timeline (UTC)

14:02: Datadog confirms a global backend outage affecting metric ingestion, logging, and alerting.
14:07: Our internal observability dashboard shows 0% metric ingestion from all Datadog-monitored services; no alerts trigger due to dependency on Datadog for alerting.
14:11: First customer report of elevated error rates for our API gateway reaches support, but support teams lack visibility due to Datadog outage.
15:53: Engineering team identifies the outage via a secondary, on-premise Prometheus instance (not integrated into alerting) and triggers incident response.
16:16: Datadog confirms resolution of backend outage; metric ingestion resumes.
16:21: All services return to healthy state; customer impact fully mitigated.
16:30: Incident response team initiates postmortem process.

Root Cause Analysis

The outage had two compounding root causes:

Datadog Backend Regression: A Datadog backend update deployed at 13:45 UTC introduced a race condition in the metric ingestion pipeline, dropping 92% of incoming metrics for customers with high-cardinality tags (our environment uses 14 high-cardinality tags for service segmentation).
Single Point of Failure in Alerting: Our alerting pipeline was 100% dependent on Datadog’s alerting engine, with no fallback to secondary observability tools. Datadog’s outage silenced all critical alerts, including latency and error rate thresholds for our payment processing service.

Additionally, our Datadog Agent configuration had not been updated to support the new backend’s tag validation rules, exacerbating metric drop rates for our environment.

Resolution and Mitigation

Immediate mitigation steps during the outage included:

Redirecting customer support to a static status page updated manually via our internal incident management tool (not dependent on Datadog).
Scaling our secondary Prometheus instance to handle increased query load from engineering teams investigating the issue.
Manually validating payment processing service health via direct database queries, as no observability data was available.

Post-resolution, we prioritized two permanent fixes:

Eliminate Single Point of Failure: Migrate all critical alerting to a multi-vendor observability stack, starting with OpenTelemetry 1.20 for metric collection, with fallback alerting via Grafana OnCall.
Decouple Agent Configuration: Adopt OTel’s standard configuration model to avoid vendor-specific misconfigurations, targeting 100% coverage for critical services within 72 hours.

Switch to OpenTelemetry 1.20

We accelerated our planned OTel migration following the incident, completing the switch for all 142 critical production services to OTel 1.20 by October 15, 2024. Key drivers for choosing OTel 1.20 included:

Native support for multi-backend metric exporting, allowing simultaneous sending of metrics to Grafana Mimir and Datadog (post-outage, we resumed limited Datadog usage for non-critical workloads).
Improved high-cardinality tag handling, resolving the core issue that exacerbated metric drops during the Datadog outage.
Stable OTLP (OpenTelemetry Protocol) support, eliminating vendor lock-in for metric collection.

Post-migration testing confirmed 99.99% metric delivery uptime across backends, with alerting latency reduced by 40% compared to our previous Datadog-only setup.

Lessons Learned

Key takeaways from the incident include:

Avoid Single-Vendor Dependencies for Critical Observability: No single observability vendor should be the sole source of alerting or metric collection for production workloads.
Test Vendor Updates in Staging: We failed to validate the Datadog backend update in our staging environment, which mirrored our production high-cardinality tag setup. All vendor updates will now require 24-hour staging validation.
Integrate Secondary Tools into Alerting: Our Prometheus instance was not integrated into alerting, wasting critical time during the outage. All secondary observability tools must now feed into the primary alerting pipeline.
OTel Reduces Vendor Lock-In Risk: The accelerated migration to OTel 1.20 took 72 hours, but would have taken 3 weeks without prior prototyping. Maintain a 6-month roadmap for observability tool migrations to avoid rushed changes post-incident.

Follow-Up Actions

Complete OTel migration for all non-critical services by November 1, 2024.
Implement automated staging validation for all Datadog (and other vendor) updates by October 31, 2024.
Conduct quarterly game days to test observability stack resilience by Q1 2025.