TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 18

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

#devops #kubernetes #reliability #webdev

24h+ global outage — 5 regions, 3 cloud providers, all simultaneously
$5M direct revenue loss — revealed on an earnings call
50–60% of Kubernetes nodes lost network connectivity
Multi-cloud provided zero protection — the failure was in Datadog's own automation
Root cause: a legacy systemd update channel nobody had decommissioned
Outcome: two years of architectural work; philosophy shift from never-fail to graceful degradation

On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.

The Story

We had built with the assumption that the only way to handle failure was to prevent it entirely — or to stop everything — rather than finding ways to degrade gracefully and continue delivering value to customers, even under extreme conditions.

— Laura de Vesine, Rob Thomas, Maciej Kowalewski, via Datadog Engineering Blog

At 01:31 EST on March 8, 2023, Datadog experienced its first global outage — every region, every cloud provider, simultaneously. The company that monitors the infrastructure of thousands of other companies could not monitor its own. Dashboards loaded but displayed no data. Logs, metrics, alerting, and traces were all unavailable. The engineers responding to the incident were operating without the observability tools that Datadog itself provides.

The immediate cause was disarmingly mundane: an automated systemd (the init system and service manager used by most modern Linux distributions — it starts processes, manages services, and handles system initialisation) update was applied to Datadog's Ubuntu-based VMs across all regions simultaneously. This was a legacy security patch mechanism — Datadog had since built a modern lifecycle automation system for all nodes — but the legacy channel was still active and executed across the global fleet without staged rollout, health gates, or human awareness. The update caused a systemd-networkd (the systemd component responsible for managing network interfaces on Linux hosts) restart interaction that removed network routes from machines as they came back up. Nodes that had previously been connected to each other's network simply vanished from the cluster.

The Circular Dependency Trap

The worst part was not that 50–60% of Kubernetes nodes lost network connectivity — it was what those nodes were running. Among the VMs brought down were the VMs powering Datadog's regionalized control planes based on Cilium (a cloud-native networking platform for Kubernetes using eBPF to provide networking, security, and observability for containerised workloads). The control plane going down meant Kubernetes couldn't schedule new pods, auto-repair failed nodes, or scale workloads to compensate. The recovery mechanism depended on the infrastructure that failed — and this circular dependency turned a 50% node loss into a nearly complete platform outage.

Problem

Simultaneous Global Node Loss at 01:31 EST

A legacy automated Ubuntu security update channel applied a systemd update across Datadog's entire global fleet simultaneously — all five regions, all three cloud providers, all at once. The update caused a systemd-networkd restart interaction that removed network routing tables from nodes as they restarted. 50–60% of Kubernetes nodes lost network connectivity within minutes.

Cause

The Control Plane Was in the Blast Radius

The Kubernetes control plane — responsible for scheduling, auto-repair, and scaling — was among the nodes that lost connectivity. This created a circular dependency: the recovery system needed the cluster to heal, but the cluster couldn't heal without the recovery system. Multi-region, multi-cloud architecture provided no protection because the update was applied uniformly across all infrastructure simultaneously.

Solution

Manual Node Recovery + Architecture Rethink

Recovery required manual intervention: engineers identified and restarted affected nodes region by region, restoring network routing and bringing Kubernetes control planes back online. The legacy update channel was immediately disabled. But recovery took over 24 hours — far longer than the node loss itself — because services loaded large in-memory caches on startup and the cluster lacked spare capacity to absorb the recovery surge.

Result

Full Recovery, New Philosophy

Full service restoration after 24+ hours. In the months following, Datadog published a detailed engineering blog describing the architectural shift it drove: away from never-fail systems toward systems designed to degrade gracefully when failure inevitably occurs. Published October 2025, the blog documented two years of architectural work as a direct result of the March 2023 incident.

The Fix

The Philosophical Shift: From Never-Fail to Graceful Degradation

The deep engineering response to the March 2023 outage was not a list of tactical fixes. It was a philosophical shift. Datadog's engineering teams had historically built for reliability through redundancy — designing systems so that individual components never went down. This produced never-fail architectures: systems where components had to be fully functional to serve any user use case. When a component failed, the entire service path that depended on it failed with it. The incident revealed a hidden assumption: that recovery would be fast and partial failure would be brief. A 24-hour outage broke that assumption completely.

24h+ — total outage duration; longer than the initial node loss because service startup was slow and the cluster lacked capacity to absorb the recovery surge
$5M — direct revenue loss from usage-based billing; one day of global unavailability = one day of zero billing
50–60% — Kubernetes nodes that lost network connectivity; enough to take down control planes and make automated recovery impossible
3 clouds — AWS, GCP, and Azure all impacted simultaneously; multi-cloud provided zero protection against Datadog's own automation

# Before: Never-fail architecture (implicit assumption)
class MetricsQueryService:
    def query_metrics(self, metric_name, time_range):
        # If storage is unavailable, this raises an exception
        # Exception propagates up — user sees an error page
        # No fallback. No partial data. Nothing.
        raw_data = self.storage.fetch(metric_name, time_range)
        return self.process(raw_data)

# After: Graceful degradation architecture
class MetricsQueryService:
    def query_metrics(self, metric_name, time_range):
        try:
            # Try live storage first
            raw_data = self.storage.fetch(metric_name, time_range)
            return self.process(raw_data)

        except StorageUnavailable:
            # Fall back to cached/stale data — user sees old data with a warning
            stale = self.read_through_cache.fetch(metric_name, time_range)
            if stale:
                return DataResponse(data=stale, staleness_warning=True)

            # Fall further — return partial data from alternative sources
            partial = self.fallback_source.fetch(metric_name, time_range)
            if partial:
                return DataResponse(data=partial, completeness_warning=True)

            # Only now surface an error — and make it informative
            return DataResponse(error='Storage degraded', retry_in=30)

# The key principle: serve as many customers as possible,
# as fully as possible, even while broken.
# Stale data > partial data > informative error > silent nothing.

What Graceful Degradation Actually Means

Datadog's post-incident architectural shift was built on a simple principle: when failure occurs, continue to deliver as much value as possible to as many customers as possible, even if you can't deliver full value to all customers. This means designing every service with an explicit answer to: what does this service do when its dependencies are unavailable? Can it serve stale data? Serve a subset of features? Serve with degraded accuracy? Or does it have to stop entirely? Most services, when the question is asked honestly, can do better than stop.

The architectural patterns that emerged across two years:

Persist data early — write to durable storage as early as possible in the pipeline; recovery becomes stateless and doesn't require customers to resend data
Stale reads — serve cached data with a staleness indicator rather than surfacing an error
Partial serving — return what you have rather than nothing
Circuit breaking — automatically stop calling a failing dependency, fall back to alternative, re-probe for recovery

Startup optimisation: fixing the recovery drag

Two changes addressed slow recovery after node restoration. First, Datadog used Kubernetes Priority Classes to ensure critical services got compute allocated before lower-priority ones when the cluster came back online — preventing a thundering herd of equal-priority pods all waiting for the same scarce resources. Second, services with large startup caches shortened their lookback windows and changed data formats to eliminate processing-intensive deserialisation at startup. Services trying to rebuild six months of cache at startup were redesigned to start with a smaller warm window and build up over time.

Root cause archaeology: finding the network route bug

The technical root cause was subtle: when systemd-networkd restarted during the OS update, it cleared the network routing table for container workloads that had been set up by Cilium. New nodes starting up for the first time don't have this problem — they start with an empty routing table and Cilium populates it correctly. But nodes that were already running had existing routing entries that were erased by the systemd-networkd restart. This was a previously unobserved interaction that only manifested when restarting a running node rather than provisioning a new one.

The postmortem delay — and why the timing matters

The March 2023 outage happened in March 2023. The detailed engineering blog documenting the architectural response was published in October 2025 — two and a half years later. This timeline reflects the depth of the work: the blog described real architectural changes that had been implemented and validated in production across Datadog's entire product portfolio, not aspirational plans. Publishing only after the work was done is the responsible version of transparency. However, the CEO referenced the incident on an earnings call before the postmortem was publicly available — a gap that generated significant industry commentary and underscored that speed of public postmortem publication matters for customer trust.

Architecture

The architecture that failed in March 2023 had a specific shape: every product feature depended on a chain of services, each of which had to be fully healthy for any part of the chain to work. Logs required a log ingestion pipeline, a storage layer, a query layer, and a frontend — all healthy. If any component was down, the entire feature was down. The never-fail architecture assumed each link in the chain would always be up.

Before: Never-Fail Chain Architecture (Any Failure = Total Failure)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Graceful Degradation Architecture (Failure = Degraded, Not Dark)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Multi-Cloud Is Not a Reliability Silver Bullet

Datadog ran infrastructure across five regions on three different cloud providers — a setup specifically designed to avoid single points of failure. It provided zero protection against this incident because the failure mechanism lived at a layer beneath the cloud provider abstraction: the Ubuntu OS update that ran on every Datadog-managed VM, regardless of which cloud it ran on. Multi-cloud protects against cloud provider failures. It does not protect against failures in your own configuration management, your own automation, or your own deployment systems. These are orthogonal properties.

The legacy channel problem: deprecated but not decommissioned

The update that caused the outage went through a legacy security update mechanism — a channel that Datadog's security team had kept active while building a modern replacement. The modern system had been built; the legacy system had not been decommissioned. This is one of the most common failure patterns in infrastructure: a replaced system that was never actually turned off. The old system executed one last time at the worst possible moment. Every team with legacy automation that still runs in production should audit whether it could execute in a way that bypasses the modern system's safety gates.

Lessons

Build for graceful degradation, not just failure prevention. Every service should have an explicit answer to: what do I do when my dependencies are unavailable? Stale data with a warning, partial results, degraded accuracy — all are better than returning nothing. The goal is to serve as many customers as possible, as fully as possible, even while broken.
Circular dependencies (when component A depends on component B for recovery, and component B depends on component A to be running) between service infrastructure and recovery infrastructure are a reliability catastrophe. Explicitly audit your control planes, monitoring systems, and automation pipelines: if the thing that fixes failures is also in the blast radius of those failures, you have a recovery problem.
Decommission legacy automation systems completely. The outage was caused by a legacy update channel that still had execution access after its replacement was built. Every organisation has deprecated-but-still-running systems. Audit them. A legacy channel that runs once a year can cause an outage just as reliably as one that runs every day.
Staged rollouts (applying changes to a small percentage of infrastructure first, checking health, then expanding gradually) are not optional for automated changes to production infrastructure. The Datadog systemd update was applied globally and simultaneously. A staged rollout — 1% of nodes, health check, 10%, health check — would have caught the network route removal on a handful of nodes before it cascaded to the entire fleet.
Design service startup to be fast under the conditions that follow a large outage. When a cluster recovers from a significant failure, all services restart simultaneously with no warm caches, competing for scarce cluster capacity. Test your startup behaviour under cluster-wide cold-start conditions — not just under normal rolling restarts.

Engineering Glossary

Cilium — a cloud-native networking platform for Kubernetes that uses eBPF to provide networking, security, and observability for containerised workloads. Datadog's Kubernetes control planes were built on Cilium; the systemd-networkd restart cleared the routing tables Cilium had established.

Circular dependency (recovery) — a failure pattern where the recovery mechanism depends on the infrastructure that failed. When Datadog's Kubernetes control plane (which auto-repairs nodes) was itself among the nodes that lost connectivity, the cluster could not self-heal — engineers had to intervene manually.

Graceful degradation — the design principle that when failure occurs, a system should continue to deliver as much value as possible to as many customers as possible, even if it cannot deliver full value. Contrasted with never-fail architecture, which assumes full health and provides no fallback when components fail.

Kubernetes Priority Class — a mechanism telling the Kubernetes scheduler which pods should get compute resources first when the cluster is under resource pressure. Critical for controlling recovery order after a large failure — without explicit priority classes, critical services compete equally with background jobs for scarce resources.

Never-fail architecture — a system design where components must be fully functional for any user-facing feature to work. Any component failure causes total failure of all dependent feature paths. The implicit architecture of most services before graceful degradation is explicitly designed in.

Persist early — an architectural pattern where data is written to durable storage as soon as it arrives, before processing. Means that if processing services go down, incoming data is safely on disk and can be processed retroactively when services recover — without requiring customers to resend.

Staged rollout — applying a change to a small percentage of infrastructure first, checking health metrics, then expanding gradually. The safety mechanism absent from the March 2023 Datadog deployment that would have caught the systemd-networkd routing issue before it affected the entire global fleet.

systemd-networkd — the systemd component responsible for managing network interfaces on Linux hosts. When it restarted during the OS update, it cleared the network routing table for container workloads previously established by Cilium — the specific interaction that caused 50–60% of Kubernetes nodes to lose connectivity.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community