TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

Cloudflare's Datacenter Partner Failed and the Control Plane Went Dark for 40 Hours

#devops #reliability #architecture #webdev

~40 hours — control plane outage from November 2–4 2023; customers couldn't configure anything
6 hours — time to partial DR restoration; core API and dashboard restored at 17:57 UTC
Data plane kept running — PoPs operated autonomously; existing configs kept working throughout
Log push unavailable for full outage duration — not replicated to DR; some log gaps are permanent
Code Orange — the all-hands incident mobilisation process that didn't exist before this incident
Postmortem published same day; Cloudflare's CEO wrote the first draft from Lisbon that evening

On November 2, 2023, Cloudflare's primary datacenter partner experienced a power failure. The control plane — the system that lets customers configure DNS, firewall rules, and every Cloudflare service — went dark. It stayed dark, in various forms, for nearly 40 hours. The postmortem introduced a concept Cloudflare hadn't had before: Code Orange.

The Story

We have not had such a process in the past, but it's clear today we need to implement a version of it ourselves: Code Orange.

— Cloudflare Engineering, via Post-Mortem on the Cloudflare Control Plane and Analytics Outage, November 2023

Cloudflare operates one of the world's largest content delivery and security networks, with hundreds of PoPs (Points of Presence — Cloudflare's globally distributed server locations that handle actual traffic routing, DDoS mitigation, and content delivery for customers) handling customer traffic across the globe. These PoPs operate largely autonomously from the control plane — once configurations are pushed, the network continues operating even if the control plane has issues. But when the control plane goes down, customers can't configure anything. They can't add DNS records, update firewall rules, change SSL settings, or deploy new Workers. The network keeps running, but it becomes effectively immutable.

The cause was a power failure at Flexential, Cloudflare's primary datacenter partner hosting the control plane infrastructure. Flexential is not a cloud provider — it's a colocation facility where Cloudflare runs its own physical servers. What made this incident severe was that the control plane recovery was not automatic. Failover to Cloudflare's disaster recovery facility required manual orchestration, and some services — particularly raw log delivery — were not replicated to the DR facility and therefore couldn't be recovered until the primary datacenter came back. Services were still not fully restored 40 hours after the initial failure.

The Edge Is Resilient. The Center Is Not.

During the entire 40-hour control plane outage, Cloudflare's data plane continued operating normally — DDoS mitigation, CDN caching, SSL termination, and traffic routing were all functioning. Customers using Cloudflare for traffic performance and security saw no degradation. This is a testament to Cloudflare's edge-resilient architecture — PoPs operate autonomously from the control plane once configured. The outage was exclusively a management plane failure: you couldn't change anything, but what was already configured kept working. Cloudflare's architecture reveals a common pattern in distributed systems: the edges are designed for resilience; the centre is designed for convenience. The November 2023 incident argues that for systems managing global internet infrastructure, the centre must be held to the same resilience standards as the edges.

Problem

Flexential Power Failure at 11:43 UTC Nov 2

Cloudflare's primary datacenter partner experienced a power failure. The control plane — API, dashboard, analytics services — went offline. Edge traffic continued operating normally, but customers could not make any configuration changes. Internal monitoring and log analytics were also impacted.

Cause

Control Plane Not Designed for Autonomous Failover

Unlike Cloudflare's edge network, the control plane was not designed for automatic failover. Recovery required manual orchestration to bring services up at the DR facility. Some data — particularly raw log streams — was not replicated to DR, meaning certain services could not be restored until the primary facility recovered.

Solution

DR Failover + Manual Service Restoration

Control plane core functionality was restored at the DR facility at 17:57 UTC on Nov 2 — ~6 hours after the incident started. Many customers saw restored API access at this point. However, some services continued to experience issues until Nov 4 as teams worked through recovery of systems that had data in the primary datacenter only.

Result

Full Restoration Nov 4, Code Orange Invented

Services were fully restored at 04:25 UTC on November 4, nearly 40 hours after the initial failure. The incident prompted Cloudflare to create a new process — Code Orange — modeled on Google's Code Yellow/Red, for major incidents requiring all-hands engineering mobilisation.

The Fix

Post-Incident Architecture Changes

The November 2023 control plane outage forced Cloudflare to confront a fundamental architectural gap: the network edge was designed for resilience and independence, but the control plane was not. The fixes needed were architectural. The postmortem identified several categories of required investment: automatic failover for control plane services, expanded data replication to DR, staged rollouts for configuration changes to prevent future single-change outages, and the Code Orange process for mobilising resources during major incidents.

~40h — total outage duration from power failure to full service restoration across all affected services
6h — time to partial restoration at DR facility; core API and dashboard restored manually
0 — data recovery possible for log push gaps; certain log streams not replicated to DR resulted in permanent data loss
1 — new process created: Code Orange, Cloudflare's all-hands engineering mobilisation protocol

# Conceptual DR architecture requirements derived from the incident
# Every control plane service needs to meet ALL of these criteria

control_plane_service:
  # Recovery requirements
  automatic_failover:
    enabled: true     # no manual orchestration — the 6h delay was manual work
    rto: "< 30 minutes"
    rpo: "< 5 minutes"

  # Data replication requirements
  data_replication:
    primary_dc: "colo-us-east"
    dr_dc: "colo-us-west"
    replication_lag: "< 60s"
    log_streams:
      replicated: true  # log gaps = customer data loss; cannot be recovered
      # The log push service failed this requirement in Nov 2023

  # Configuration safety (action item from this postmortem)
  config_rollout:
    staged: true      # NOT global instant propagation
    health_checks: true
    rollback_trigger: "auto"
    # This item was still incomplete when the Dec 2023 Bot Management outage occurred

Code Orange: The All-Hands Protocol That Didn't Exist

Google has a practice where significant crises trigger a Code Yellow or Code Red — most engineering resources shift to address the issue. Cloudflare had no equivalent process before this incident. The 40-hour outage demonstrated the need for a structured mechanism to mobilise engineering resources across all teams for critical incidents. Code Orange was created as Cloudflare's version: defined criteria for invocation, clear authority chains, cross-team coordination protocols, and explicit criteria for declaring the incident resolved. Before Code Orange, major incidents relied on informal escalation — slower and less coordinated under pressure.

The log push service — Cloudflare's product that delivers raw access logs directly to customer storage buckets in real time — was unavailable for the majority of the outage duration. Unlike the control plane API, which could be brought up at the DR facility using replicated state, the log pipeline infrastructure was primarily hosted in the primary datacenter and not fully replicated to DR. Customers who relied on log push for security monitoring, compliance logging, or billing reconciliation had gaps in their log data that could not be recovered. Cloudflare's postmortem explicitly noted that some datasets would have persistent gaps — data that would never be recovered regardless of DR restoration success.

The DR facility limitation: why it couldn't handle everything

Cloudflare's disaster recovery facility was able to handle core API and dashboard functionality after the 6-hour manual failover. But each service needed to be evaluated individually for DR completeness: which data is replicated, which processes can be restarted at DR, and what the residual capability is when the primary is down. Services that stored state locally in the primary datacenter without replication could not be restored at DR. DR readiness is not a single binary state — it's a per-service property that requires explicit audit.

The staged rollout action item — and the December consequence

The November 2023 postmortem explicitly identified staged configuration rollouts as a required improvement: "Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input." The subsequent December 2023 Bot Management outage happened because that work hadn't been completed yet — another global configuration change propagated instantly and caused a systemwide outage affecting 28% of HTTP traffic. This sequence is a stark illustration of why postmortem action items need urgency tracking: the cost of the follow-on incident exceeded the cost of the work that would have prevented it.

Colocation vs cloud: the resilience trade-off made explicit

Cloudflare chose colocation over cloud hosting for the control plane for cost and hardware control reasons. This decision is defensible — cloud hosting at Cloudflare's scale would be extremely expensive. But it comes with a trade-off: manual failover instead of automated region failover. The November 2023 incident makes this trade-off concrete: 40 hours of control plane impact versus the cost of cloud hosting or of building proper automatic DR. Organizations must make this trade-off explicitly, not accidentally.

Architecture

Cloudflare's architecture has a conceptual split: the data plane (PoPs, edge servers, traffic processing) and the control plane (API, dashboard, configuration management, analytics). The data plane is designed for resilience — distributed across 300+ locations, continuing to serve traffic even if the control plane is unavailable. The control plane was designed for correctness and consolidation — centralized, manageable, cost-efficient. The November 2023 incident exposed the asymmetry.

Cloudflare's Architecture: Data Plane vs Control Plane

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Required Control Plane Architecture (DR + Replication)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Log Data Persistence: The Gap in DR Planning

The log push service's inability to be restored at the DR facility reveals a gap that is common in operational log pipelines: log data is often considered less critical than application data and receives less investment in replication and DR. But for Cloudflare's enterprise customers, raw access logs are security audit trails, compliance records, and billing evidence. A gap in log data can trigger regulatory issues, security investigations, and customer trust problems that persist long after service is restored. Operational log data is customer data.

Lessons

Expect entire data centres to fail. This is not a pessimistic assumption — it's a design requirement. Any system that cannot survive the loss of a single datacenter needs to be redesigned. Resilience is not about preventing failure; it's about maintaining operation through it.
Automatic failover (a mechanism that detects failure and routes traffic to backup systems without human intervention) versus manual failover is the difference between a 30-minute recovery and a 6-hour recovery. Colocation is cost-efficient, but if it requires manual failover orchestration, the cost of the efficiency is availability during rare but consequential failures.
Replicate all data to your DR facility, not just the most critical data. Log pipelines, analytics streams, and operational data that is not replicated to DR creates irrecoverable data gaps during primary datacenter failures. The cost of replication is fixed; the cost of data loss is unbounded.
Large incidents benefit from an explicit all-hands mobilisation process. Google's Code Yellow/Red, Cloudflare's new Code Orange — these are mechanisms for concentrating engineering attention on critical problems without the usual resource negotiation overhead. If your organisation doesn't have an equivalent, the first time you need one will be too late to create it.
Configuration changes to global infrastructure need staged rollouts, not global instant propagation. The November 2023 postmortem explicitly identified this as a required improvement. The subsequent December 2023 global outage from a Bot Management configuration change was a direct consequence of that improvement not yet being implemented. Infrastructure safety systems must be built before the next incident, not after.

Engineering Glossary

Automatic failover — a mechanism that detects failure and routes traffic to backup systems without human intervention. The control plane lacked this — recovery required manual orchestration, extending the outage from minutes to 6+ hours.

Code Orange — Cloudflare's all-hands engineering mobilisation protocol for major incidents, created as a direct result of this outage. Modelled on Google's Code Yellow/Red. Defines criteria for invocation, authority chains, coordination protocols, and resolution criteria.

Colocation — a datacenter facility where a company runs its own physical servers in space rented from a third-party provider. Cheaper than cloud and gives hardware control, but requires manual failover coordination during datacenter failures rather than cloud-provider automated region failover.

Control plane — the management layer of Cloudflare's infrastructure: API, dashboard, configuration management, and analytics. Used by customers to configure DNS records, firewall rules, SSL settings, and Workers. Operates separately from the data plane — when the control plane goes down, existing configurations keep working but nothing can be changed.

Data plane — Cloudflare's PoP network that handles actual traffic routing, DDoS mitigation, CDN caching, and SSL termination. Designed for autonomous operation independent of the control plane. Continued operating normally throughout the 40-hour control plane outage.

DR (Disaster Recovery) facility — Cloudflare's backup datacenter to which the control plane could be failed over during a primary datacenter failure. Required 6 hours of manual orchestration to activate. Could not restore services (like log push) that weren't replicated from the primary.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community