Datadog · Reliability · 18 May 2026
On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.
- 24h+ global outage
- $5M revenue loss
- 50–60% Kubernetes nodes lost
- 5 regions, 3 cloud providers
- All affected simultaneously
- Philosophy shift: graceful degradation
The Story
We had built with the assumption that the only way to handle failure was to prevent it entirely — or to stop everything — rather than finding ways to degrade gracefully and continue delivering value to customers, even under extreme conditions.
— — Laura de Vesine, Rob Thomas, Maciej Kowalewski — via Datadog Engineering Blog
At 01:31 EST on March 8, 2023, Datadog experienced its first global outage — every region, every cloud provider, simultaneously. The company that monitors the infrastructure of thousands of other companies could not monitor its own. Dashboards loaded but displayed no data. Logs, metrics, alerting, and traces were all unavailable. The engineers whose job was to diagnose and fix the outage were operating without the observability tools that Datadog itself provides. It lasted over 24 hours. It cost $5 million in direct revenue. And it forced a fundamental rethink of how Datadog builds reliable systems.
The immediate cause was disarmingly mundane: an automated systemd (the init system and service manager used by most modern Linux distributions — it starts processes, manages services, and handles system initialization) update was applied to Datadog's Ubuntu-based virtual machines across all regions simultaneously. This was a legacy security patch mechanism — Datadog had since built a modern lifecycle automation system for all nodes — but the legacy channel was still active and executed its update across the global fleet without any staged rollout, any health gates, or any human awareness. The update caused a systemd-networkd (the systemd component responsible for managing network interfaces on Linux hosts) restart interaction that removed network routes from the machines as they came back up. Nodes that had previously been connected to each other's network simply vanished from the cluster.
THE CIRCULAR DEPENDENCY TRAP
The worst part was not that 50–60% of Kubernetes nodes lost network connectivity — it was what those nodes were running. Among the VMs brought down by the network route removal were the VMs powering Datadog's regionalized control planes based on Cilium (a cloud-native networking platform for Kubernetes that uses eBPF to provide networking, security, and observability for containerized workloads). The control plane going down meant Kubernetes couldn't schedule new pods, auto-repair failed nodes, or scale workloads to compensate. The very system that should have responded to the failure was among the first things the failure took down. This circular dependency — the recovery mechanism depending on the infrastructure that failed — is what turned a 50% node loss into a nearly complete platform outage.
Problem
Simultaneous Global Node Loss at 01:31 EST
A legacy automated Ubuntu security update channel applied a systemd update across Datadog's entire global fleet simultaneously — all five regions, all three cloud providers, all at once. The update caused a systemd-networkd restart interaction that removed network routing tables from nodes as they restarted. 50–60% of Kubernetes nodes lost network connectivity within minutes. Pages loaded but displayed no data. The outage was total from the customer perspective.
Cause
The Control Plane Was in the Blast Radius
The Kubernetes control plane — the cluster management layer responsible for scheduling, auto-repair, and scaling — was among the nodes that lost connectivity. This created a circular dependency: the recovery system needed the cluster to heal, but the cluster could not heal without the recovery system. Additionally, Datadog's multi-region, multi-cloud architecture provided no protection because the update was applied uniformly across all infrastructure simultaneously.
Solution
Manual Node Recovery + Architecture Rethink
Recovery required manual intervention: engineers identified and restarted affected nodes, restoring network routing and bringing Kubernetes control planes back online region by region. The legacy update channel was immediately disabled. But recovery took over 24 hours — far longer than the node loss itself — because services loaded large in-memory caches on startup that were slow to initialize, and the cluster lacked the spare capacity to absorb the sudden recovery surge.
Result
Full Recovery, New Philosophy
Full service restoration after 24+ hours. In the months following, Datadog published a detailed engineering blog describing not just what happened but the architectural shift it drove: away from never-fail systems toward systems designed to degrade gracefully when failure inevitably occurs. Published October 2025, the blog documented two years of architectural work as a direct result of the March 2023 incident.
💸
Datadog operates on usage-based billing — customers pay for the volume of metrics, logs, and traces they send. During the 24-hour outage, Datadog did not charge customers for data they couldn't send. The $5M revenue loss was direct: one day of global service unavailability translated directly into one day of foregone billing. This number was revealed on an earnings call, making the financial cost of the outage unusually concrete and public.
❌
Multi-Cloud Did Not Help
Datadog ran in five regions across three cloud providers — AWS, GCP, and Azure. This architecture is often cited as a reliability best practice. But it provided zero protection in this incident because the failure mechanism — the automated Ubuntu update — operated at the OS layer, uniformly across all infrastructure regardless of cloud provider. Multi-cloud protects against cloud provider failures. It does not protect against failures in your own automation that touch all infrastructure simultaneously.
The 24-hour recovery time was itself a lesson. Even after the Kubernetes control planes came back online and new pods could be scheduled, services were slow to recover. The investigation found two patterns: some services had insufficient compute allocated relative to others, causing them to wait a long time for Kubernetes to schedule their pods after the control plane recovered. Others loaded large, processing-intensive caches into memory at startup — caches that had been optimized for steady-state operation but were extremely expensive to rebuild from scratch after a complete restart. Both of these were design choices that had seemed reasonable in a world where failure was rare and total restarts were rarer still. In a world where failure must be expected, they were traps.
⚠️
The Irony of the Observability Platform
There is a particular quality of darkness in losing observability tooling during an outage. Engineers responding to the incident were using Datadog to understand what was happening — and Datadog was the thing that was down. The response team had to work from first principles: SSH into individual hosts, read raw logs, check systemd status directly. The tooling built to abstract away that complexity was unavailable at precisely the moment the complexity needed to be navigated. The incident revealed how dependent Datadog's own oncall rotation was on Datadog itself.
ℹ️
The Square-Wave Failure Pattern
Datadog's engineers described the outage as a square-wave failure — the platform went from fully operational to nearly completely down almost instantaneously, rather than degrading gradually. This pattern is characteristic of failures at the infrastructure layer: when Kubernetes nodes lose network connectivity, every pod running on those nodes disappears from service meshes and load balancers at once. There is no gradual ramp. For an observability platform designed around monitoring continuous signals, a square-wave drop to zero looked different from every other failure mode the monitoring systems had been trained on.
🌐
Datadog ran infrastructure across five regions on three different cloud providers — a setup specifically designed to avoid single points of failure. It provided no protection at all against this incident because the failure mechanism lived at a layer beneath the cloud provider abstraction: the Ubuntu OS update that ran on every Datadog-managed VM, regardless of which cloud it ran on. The lesson is precise: multi-cloud resilience and OS-level automation independence are orthogonal properties.
THE POSTMORTEM DELAY
Datadog waited over two months to publish a public postmortem — a gap that generated significant industry commentary, particularly after the CEO referenced it on an earnings call before it was publicly available. The eventual postmortem was substantive and technical. But the delay — and the CEO's apparent confusion about whether it had been shared — was widely noted as a departure from the transparency standard set by companies like Cloudflare. Speed of postmortem publication matters for customer trust , especially for a platform whose entire value proposition is reliability and observability.
The Fix
The Philosophical Shift: From Never-Fail to Graceful Degradation
The deep engineering response to the March 2023 outage was not a list of tactical fixes. It was a philosophical shift. Datadog's engineering teams had, historically, built for reliability through redundancy — designing systems so that individual components never went down. This produced what the postmortem called never-fail architectures : systems where components and services had to be fully functional to serve any user use case. When a component did fail, the entire service path that depended on it failed with it. The incident revealed a hidden assumption: that recovery would be fast and partial failure would be brief. A 24-hour outage broke that assumption completely, and exposed how little thought had gone into what the system should do while broken.
- 24h+ — Total outage duration — longer than the initial node loss because service startup was slow and the cluster lacked capacity to absorb the recovery surge
- $5M — Direct revenue loss from usage-based billing — one day of global unavailability translated to one day of zero billing, revealed publicly on an earnings call
- 50–60% — Kubernetes nodes that lost network connectivity from the systemd update — enough to take down control planes and make automated recovery impossible
- 3 clouds — Cloud providers affected simultaneously — AWS, GCP, and Azure all impacted because the failure was in Datadog's own automation, not in any cloud provider's infrastructure
WHAT GRACEFUL DEGRADATION ACTUALLY MEANS
Datadog's post-incident architectural shift was built on a simple principle: when failure occurs, the system should continue to deliver as much value as possible to as many customers as possible , even if it cannot deliver full value to all customers. This means designing every service with an explicit answer to the question: what does this service do when its dependencies are unavailable? Can it serve stale data? Can it serve a subset of features? Can it serve with degraded accuracy? Or does it have to stop entirely? Most services, when the question is asked honestly, can do better than stop.
# Before: Never-fail architecture (implicit assumption)
class MetricsQueryService:
def query_metrics(self, metric_name, time_range):
# If storage is unavailable, this raises an exception
# The exception propagates up — user sees an error page
raw_data = self.storage.fetch(metric_name, time_range)
return self.process(raw_data) # no fallback
# After: Graceful degradation architecture
class MetricsQueryService:
def query_metrics(self, metric_name, time_range):
try:
# Try live storage first
raw_data = self.storage.fetch(metric_name, time_range)
return self.process(raw_data)
except StorageUnavailable:
# Fall back to cached/stale data — user sees old data with a warning
stale_data = self.read_through_cache.fetch(metric_name, time_range)
if stale_data:
return DataResponse(data=stale_data, staleness_warning=True)
# Fall further back — return partial data from other sources
partial = self.fallback_source.fetch(metric_name, time_range)
if partial:
return DataResponse(data=partial, completeness_warning=True)
# Only now surface an error — and make it informative
return DataResponse(error='Storage degraded', retry_in=30)
ℹ️
Startup Optimization: Fixing the Recovery Drag
Two changes addressed the slow recovery after node restoration. First, Datadog used Kubernetes priority mechanisms to ensure critical services got compute allocated before lower-priority ones when the cluster came back online — preventing a thundering herd of equal-priority pods all waiting for the same scarce resources. Second, services with large startup caches shortened their lookback windows and changed data formats to eliminate processing-intensive deserialization at startup. Services that had been trying to rebuild six months of cache at startup were redesigned to start with a smaller warm window and build up over time.
✅
The Architectural Patterns That Emerged
Over the two years following the incident, Datadog published a set of graceful degradation patterns applied across its products: persist data early (write to durable storage as early as possible in the pipeline, so recovery is stateless); stale reads (serve cached data with a staleness indicator rather than surfacing an error); partial serving (return what you have rather than nothing); circuit breaking (automatically stop calling a failing dependency, fall back to alternative, re-probe for recovery). None of these patterns were invented by Datadog — they were standard resilience engineering techniques that Datadog had systematically under-applied.
✅
Persist Data Early: The Durability Pattern
One of the most concrete architectural changes after the incident was implementing a persist early pattern across Datadog's data pipelines. Instead of holding data in-memory for processing before writing to durable storage, the system was changed to write to durable storage as soon as data arrived — before processing. This meant that even if processing services went down, incoming customer telemetry was safely on disk and could be processed retroactively when services recovered. Recovery no longer required customers to resend data that had arrived during the outage window.
⚠️
The Kubernetes Priority Class Oversight
After the outage, Datadog's investigation found that many services had not been assigned appropriate Kubernetes Priority Classes — a mechanism that tells the Kubernetes scheduler which pods should get compute resources first when the cluster is under resource pressure. In normal operation, this doesn't matter much. After a large failure where the entire cluster restarts simultaneously, priority classes determine recovery order. Services that should start first (database proxies, ingestion pipelines) were waiting for the same CPU allocations as low-priority background jobs. Recovery order is a design decision that should be made explicitly, not left to scheduler defaults.
Architecture
The architecture that failed in March 2023 had a specific shape: every product feature in Datadog's platform depended on a chain of services, each of which had to be fully healthy for any part of the chain to work. Logs required a log ingestion pipeline, a storage layer, a query layer, and a frontend — all healthy. If any component in the chain was down, the entire feature was down. The never-fail architecture assumed each link in the chain would always be up. The March 2023 incident showed what happens when multiple links go down simultaneously.
Before: Never-Fail Chain Architecture (Any Failure = Total Failure)
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
After: Graceful Degradation Architecture (Failure = Degraded, Not Dark)
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
MULTI-CLOUD IS NOT A RELIABILITY SILVER BULLET
Datadog's global, multi-cloud infrastructure — five regions, three cloud providers — provided zero protection against this incident. The lesson generalizes: multi-cloud protects against cloud provider failures. It does not protect against failures in your own configuration management, your own automation, your own deployment systems, or your own service design. An automated update that runs across all infrastructure uniformly bypasses all multi-cloud redundancy. Organizations that invest heavily in multi-cloud while neglecting the uniformity of their own automation are addressing the wrong failure vector.
⚠️
The Legacy Channel Problem
The update that caused the outage went through a legacy security update mechanism — a channel that Datadog's security team had kept active while building a modern replacement. The modern system had been built; the legacy system had not been decommissioned. This is one of the most common failure patterns in infrastructure: a replaced system that was never actually turned off. The old system executed one last time at the worst possible moment. Every team with legacy automation that still runs in production should audit whether it could execute in a way that bypasses the modern system's safety gates.
🔬
Root Cause Archaeology: Finding the Network Route Bug
The technical root cause was subtle: when systemd-networkd restarted during the OS update, it cleared the network routing table for container workloads that had been set up by Kubernetes's networking plugin (Cilium). New nodes starting up for the first time don't have this problem — they start with an empty routing table and Cilium populates it correctly. But nodes that were already running had existing routing entries that were erased by the systemd-networkd restart. This was a previously unobserved interaction that only manifested when restarting a running node rather than provisioning a new one.
Lessons
The March 2023 Datadog outage is extraordinary for two reasons: the irony of an observability platform going dark, and the depth of the architectural response it drove. The lessons here are not primarily about the incident itself but about the philosophy that emerged from it.
- 01. Build for graceful degradation, not just failure prevention. Every service should have an explicit answer to: what do I do when my dependencies are unavailable? Stale data with a warning, partial results, degraded accuracy — all of these are better than returning nothing. The goal is to serve as many customers as possible, as fully as possible, even while broken.
- 02. Circular dependencies (when component A depends on component B for recovery, and component B depends on component A to be running) between service infrastructure and recovery infrastructure are a reliability catastrophe waiting to happen. Explicitly audit your control planes, monitoring systems, and automation pipelines: if the thing that fixes failures is also in the blast radius of those failures, you have a recovery problem.
- 03. Decommission legacy automation systems completely. The outage was caused by a legacy update channel that still had execution access after its replacement was built. Every organization has deprecated-but-still-running systems. Audit them. A legacy channel that runs once a year can cause an outage just as reliably as one that runs every day.
- 04. Staged rollouts (applying changes to a small percentage of infrastructure first, checking health, then expanding gradually) are not optional for automated changes to production infrastructure. The Datadog systemd update was applied globally and simultaneously. A staged rollout — 1% of nodes, health check, 10%, health check — would have caught the network route removal on a handful of nodes before it cascaded to the entire fleet.
- 05. Design service startup to be fast under the conditions that follow a large outage. When a cluster recovers from a significant failure, all services restart simultaneously with no warm caches, competing for scarce cluster capacity. Services optimized for steady-state operation can become bottlenecks in this cold-restart scenario. Test your startup behavior under cluster-wide cold-start conditions, not just under normal rolling restarts.
ℹ️
Two Years to Publish the Engineering Post
The March 2023 outage happened in March 2023. The detailed engineering blog documenting the architectural response was published in October 2025 — two and a half years later. This timeline reflects the depth of the work: the blog described real architectural changes that had been implemented and validated in production across Datadog's entire product portfolio, not aspirational plans. Publishing only after the work was done is the responsible version of transparency — claiming to have fixed something before you've fixed it erodes trust faster than silence.
GRACEFUL DEGRADATION AS A DESIGN PRINCIPLE, NOT A FEATURE
The deepest lesson from Datadog's post-incident work is that graceful degradation is not a feature you add to a service after it's built — it's a design principle that shapes how the service is architected from the beginning. A service designed to gracefully degrade will have different internal boundaries, different cache strategies, different dependency contracts, and different SLOs than one designed to always succeed. Retrofitting graceful degradation into a never-fail architecture is expensive. Building for it from the start is cheaper. After two years of retrofitting, Datadog's engineering organization now treats the question 'how does this service degrade?' as a required design review criterion.
Datadog's monitoring platform went down for 24 hours — which means the engineers had to debug a global infrastructure failure using SSH, intuition, and the kind of raw log reading skills that got them into engineering in the first place.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)