DEV Community

TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

Cloudflare's Datacenter Partner Failed and the Control Plane Went Dark for 40 Hours

Cloudflare · Reliability · 17 May 2026

On November 2, 2023, Cloudflare's primary datacenter partner experienced a power failure. The control plane — the system that lets customers configure DNS, firewall rules, and every Cloudflare service — went dark. It stayed dark, in various forms, for nearly 40 hours. The postmortem introduced a concept Cloudflare hadn't had before: Code Orange.

  • Nov 2–4 2023 outage
  • ~40 hours control plane down
  • Flexential datacenter failure
  • DR restored at 17:57 Nov 2
  • Log push unavailable full duration
  • Code Orange process created

The Story

We have not had such a process in the past, but it's clear today we need to implement a version of it ourselves: Code Orange.

— — Cloudflare Engineering — via Post-Mortem on the Cloudflare Control Plane and Analytics Outage, November 2023

Cloudflare operates one of the world's largest content delivery and security networks, with hundreds of PoPs (Points of Presence — Cloudflare's globally distributed server locations that handle actual traffic routing, DDoS mitigation, and content delivery for customers) handling customer traffic across the globe. These PoPs operate largely autonomously from the control plane — once configurations are pushed, the network continues operating even if the control plane has issues. But when the control plane goes down, customers can't configure anything. They can't add DNS records, update firewall rules, change SSL settings, or deploy new Workers. The network keeps running, but it becomes effectively immutable. On November 2, 2023, at 11:43 UTC, that's exactly what happened.

The cause was a power failure at Flexential , Cloudflare's primary datacenter partner hosting the control plane infrastructure. Flexential is not a cloud provider — it's a colocation facility where Cloudflare runs its own physical servers. Power failures in colocation facilities, while rare, happen. What made this incident severe was that the control plane recovery was not automatic. Failover to Cloudflare's disaster recovery facility required manual orchestration, and some services — particularly raw log delivery — were not replicated to the DR facility and therefore couldn't be recovered until the primary datacenter came back. Services were still not fully restored 40 hours after the initial failure.

🏭

Cloudflare's control plane infrastructure ran in a colocation datacenter — physical servers in a facility owned by a third party, not in a public cloud. This architecture gives Cloudflare hardware control and cost efficiency but means datacenter-level failures require manual failover coordination rather than cloud provider automated region failover.

Problem

Flexential Power Failure at 11:43 UTC Nov 2

Cloudflare's primary datacenter partner experienced a power failure. The control plane — API, dashboard, analytics services — went offline. Edge traffic continued operating normally (PoPs are autonomous), but customers could not make any configuration changes. Internal monitoring and log analytics were also impacted.


Cause

Control Plane Not Designed for Autonomous Failover

Unlike Cloudflare's edge network, the control plane was not designed for automatic failover. Recovery required manual orchestration to bring services up at the DR facility. Some data — particularly raw log streams — was not replicated to DR, meaning certain services could not be restored until the primary facility recovered.


Solution

DR Failover + Manual Service Restoration

Control plane core functionality was restored at the DR facility at 17:57 UTC on Nov 2 — ~6 hours after the incident started. Many customers saw restored API access at this point. However, some services continued to experience issues until Nov 4 as teams worked through recovery of systems that had data in the primary datacenter only.


Result

Full Restoration Nov 4, Code Orange Invented

Services were fully restored at 04:25 UTC on November 4, nearly 40 hours after the initial failure. The incident prompted Cloudflare to create a new process — Code Orange — modeled on Google's Code Yellow/Red, for major incidents requiring all-hands engineering mobilization.


CODE ORANGE: A NEW INCIDENT PROCESS IS BORN

Google has a practice where significant crises trigger a Code Yellow or Code Red — most engineering resources are shifted to address the issue. Cloudflare had no equivalent process before this incident. The 40-hour outage demonstrated the need for a structured mechanism to mobilize engineering resources across all teams for critical incidents. Code Orange was created as Cloudflare's version of this concept. The process includes defined escalation paths, cross-team coordination protocols, and clear criteria for when to invoke it.

The log push service — Cloudflare's product that delivers raw access logs directly to customer storage buckets in real time — was unavailable for the majority of the outage duration. Unlike the control plane API, which could be brought up at the DR facility using replicated state, the log pipeline infrastructure was primarily hosted in the primary datacenter and not fully replicated to DR. Customers who relied on log push for security monitoring, compliance logging, or billing reconciliation had gaps in their log data that could not be recovered. Cloudflare's postmortem explicitly noted that some datasets which are not replicated in the EU would have persistent gaps — data that would never be recovered regardless of DR restoration success.

⚠️

The 'Expect Entire Datacenters to Fail' Principle

The postmortem contained a striking engineering principle statement: we must expect that entire data centers may fail. This is a design requirement, not a risk acceptance. Any system that would be materially impaired by a single datacenter failure needs to be redesigned so that it either operates independently of any single datacenter or fails over automatically when one goes dark. The control plane architecture had not been held to this standard — and the 40-hour outage was the consequence.

📋

Questions for Flexential Still Outstanding

As of the postmortem publication, Cloudflare stated it had a number of questions that needed to be answered from Flexential. A power failure of this duration at a major colocation facility raises questions about redundant power systems, UPS capacity, diesel generator performance, and facility operations procedures. The postmortem was transparent about this outstanding accountability — an unusual admission that the root cause investigation wasn't complete.

ℹ️

The Data Plane Continued Running

During the entire 40-hour control plane outage, Cloudflare's data plane continued operating normally — DDoS mitigation, CDN caching, SSL termination, and traffic routing were all functioning. Customers using Cloudflare for traffic performance and security saw no degradation. This is a testament to Cloudflare's edge-resilient architecture — PoPs operate autonomously from the control plane once configured. The outage was exclusively a management plane failure: you couldn't change anything, but what was already configured kept working.

Analytics and Dashboard Dark for Enterprise Customers

For Cloudflare's enterprise customers, the control plane outage had real operational consequences beyond configuration changes. Security dashboards showing live attack traffic, WAF logs, DNS analytics, and firewall event monitoring were all unavailable. During a 40-hour window when customers couldn't see what was happening on their infrastructure, security teams had reduced visibility precisely when they might have needed it most. The monitoring and analytics darkness was in some ways more operationally painful than the configuration lock.

THE CUSTOMER EXPERIENCE ASYMMETRY

Cloudflare's customer base experienced the outage very differently based on what they used Cloudflare for. Performance-focused customers (CDN, caching) saw nothing — their traffic ran fine. Security-focused customers (WAF, DDoS mitigation) had protection but lost visibility into attacks. Developer customers (Workers, Pages, DNS) were locked out of deploying changes for 40 hours. Analytics-dependent customers had data gaps that couldn't be recovered. Same outage, four different impact profiles.


The Fix

Post-Incident Architecture Changes

The November 2023 control plane outage forced Cloudflare to confront a fundamental architectural gap: the network edge (PoPs) was designed for resilience and independence, but the control plane was not. The fixes needed were architectural — not configuration tweaks. The postmortem identified several categories of required investment: automatic failover for control plane services, expanded data replication to DR, staged rollouts for configuration changes to prevent future single-change outages, and the Code Orange process for mobilizing resources during major incidents.

  • ~40h — Total outage duration from power failure to full service restoration across all affected services — November 2–4, 2023
  • 6h — Time to partial restoration at DR facility — core API and dashboard functions restored at 17:57 UTC on November 2 after the 11:43 UTC failure
  • 0 — Data recovery possible for log push gaps — certain log streams not replicated to DR resulted in permanent data loss for the outage window
  • 1 — New process created: Code Orange — Cloudflare's all-hands engineering mobilization protocol for major incidents, modeled on Google's Code Yellow/Red
# Conceptual DR architecture requirements derived from the incident
# Every control plane service needs to meet these criteria:

control_plane_service:
  # Recovery requirements
  automatic_failover:
    enabled: true # No manual orchestration needed
    rto: "< 30 minutes" # Recovery Time Objective
    rpo: "< 5 minutes" # Recovery Point Objective

  # Data replication requirements
  data_replication:
    primary_dc: "colo-us-east"
    dr_dc: "colo-us-west" # replicated to DR
    replication_lag: "< 60s" # data freshness at DR
    log_streams: # ALL streams, not just some
      replicated: true # persistent log gaps = customer data loss

  # Configuration safety
  config_rollout:
    staged: true # NOT global instant propagation
    health_checks: true # validate before wider rollout
    rollback_trigger: "auto" # revert on health check failure
Enter fullscreen mode Exit fullscreen mode

THE CONFIGURATION ROLLOUT PROBLEM (FORESHADOWING)

In the postmortem, Cloudflare identified a specific action item: 'Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input' — including staged rollouts rather than global instant propagation of configuration files. This action item would become prophetic: weeks after the November 2023 outage, a global configuration change for Bot Management caused another major outage. The staged rollout work hadn't been completed in time.

Code Orange: The All-Hands Protocol

Cloudflare's new Code Orange process, modeled on Google's Code Yellow/Red, provides a structured mechanism for major incidents. When Code Orange is declared, most or all engineering resources shift to the incident. The process defines: who has authority to declare Code Orange, how resources are mobilized, what communication protocols apply, and what the criteria are for declaring the incident resolved. Before Code Orange, major incidents relied on informal escalation — which is slower and less coordinated under pressure.

⚠️

The Colocation Trade-Off

Cloudflare chose colocation over cloud hosting for the control plane for cost and hardware control reasons. This decision is defensible — cloud hosting at Cloudflare's scale would be extremely expensive. But it comes with a trade-off: manual failover instead of automated region failover. The November 2023 incident makes this trade-off concrete: 40 hours of control plane impact versus the cost of cloud hosting or of building proper automatic DR. Organizations must make this trade-off explicitly, not accidentally.

ℹ️

The DR Facility Limitation

Cloudflare's disaster recovery facility was able to handle core API and dashboard functionality after the 6-hour manual failover. But it could not handle everything. Services that stored state locally in the primary datacenter without replication — including the log push pipeline — could not be restored at DR. The architectural lesson: DR readiness is not a single binary state. Each service needs to be evaluated individually for DR completeness: which data is replicated, which processes can be restarted at DR, and what the residual capability is when the primary is down.

🔍

Questions for Flexential Still Open

Cloudflare's postmortem noted they had outstanding questions for Flexential about the power failure — specifically about the adequacy of redundant power systems, UPS performance, and datacenter operations. This transparency is notable: Cloudflare publicly acknowledged that they didn't yet have the full picture of why a colocation facility experienced a power failure significant enough to take down their control plane for 40 hours, and that accountability from the vendor was part of the recovery process.


Architecture

Cloudflare's architecture has a conceptual split: the data plane (PoPs, edge servers, traffic processing) and the control plane (API, dashboard, configuration management, analytics). The data plane is designed for resilience — distributed across 300+ locations, able to continue serving traffic even if the control plane is unavailable. The control plane was designed for correctness and consolidation — centralized, manageable, cost-efficient. The November 2023 incident exposed the asymmetry: data plane is nearly indestructible; control plane is a single point of failure.

Cloudflare's Architecture: Data Plane vs Control Plane

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Required Control Plane Architecture (DR + Replication)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

THE EDGE IS RESILIENT. THE CENTER IS NOT.

Cloudflare's network architecture reflects a common pattern in distributed systems: the edges are designed for resilience; the center is designed for convenience. Edge nodes must work autonomously during control plane outages. Centralized control planes are often designed for operational simplicity rather than failure resilience — because failures at the center are rare and the cost of the engineering to harden them is high. The November 2023 incident makes the explicit argument that for systems managing global internet infrastructure, the center must be designed with the same resilience standards as the edges.

ℹ️

Log Data Persistence: A Gap in DR Planning

The log push service's inability to be restored at the DR facility reveals a gap in DR planning that is common in operational log pipelines: log data is often considered less critical than application data and receives less investment in replication and DR. But for Cloudflare's enterprise customers, raw access logs are security audit trails, compliance records, and billing evidence. A gap in log data can trigger regulatory issues, security investigations, and customer trust problems that persist long after service is restored.

⚠️

Colocation vs Cloud: The Resilience Trade-Off Made Explicit

The Cloudflare control plane outage makes concrete a trade-off that every infrastructure team makes implicitly: colocation is cheaper and gives hardware control, but requires manual failover during datacenter failures. Public cloud providers offer automated region failover but at significantly higher cost. The November 2023 incident is the explicit data point for how expensive the colocation approach can be during a once-a-decade power failure: 40 hours of control plane impact, customer trust damage, and the engineering investment required to build DR parity that cloud providers offer by default.


Lessons

The November 2023 Cloudflare control plane outage is a master class in the asymmetry between designing edges and designing centers. The edge was fine. The center failed. The lesson is to apply the same resilience standards to both.

  1. 01. Expect entire data centers to fail. This is not a pessimistic assumption — it's a design requirement. Any system that cannot survive the loss of a single datacenter needs to be redesigned. Resilience is not about preventing failure; it's about maintaining operation through it.
  2. 02. Automatic failover (a mechanism that detects failure and routes traffic to backup systems without human intervention) versus manual failover is the difference between a 30-minute recovery and a 6-hour recovery. Colocation is cost-efficient, but if it requires manual failover orchestration, the cost of the efficiency is availability during rare but consequential failures.
  3. 03. Replicate all data to your DR facility, not just the most critical data. Log pipelines, analytics streams, and operational data that is not replicated to DR creates irrecoverable data gaps during primary datacenter failures. The cost of replication is fixed; the cost of data loss is unbounded.
  4. 04. Large incidents benefit from an explicit all-hands mobilization process. Google's Code Yellow/Red, Cloudflare's new Code Orange — these are mechanisms for concentrating engineering attention on critical problems without the usual resource negotiation overhead. If your organization doesn't have an equivalent, the first time you need one will be too late to create it.
  5. 05. Configuration changes to global infrastructure need staged rollouts , not global instant propagation. The November 2023 postmortem explicitly identified this as a required improvement. The subsequent December 2023 global outage from a Bot Management configuration change was a direct consequence of that improvement not yet being implemented. Infrastructure safety systems must be built before the next incident, not after.

⚠️

The Staged Rollout Action Item and the December Outage

The November 2023 postmortem explicitly identified staged configuration rollouts as a required improvement. The December 2023 Bot Management outage happened because that work hadn't been completed yet — another global configuration change propagated instantly and caused a systemwide outage. This sequence is a stark illustration of why postmortem action items need urgency tracking: the cost of the follow-on incident exceeded the cost of the work that would have prevented it.

DATA REPLICATION IS NOT OPTIONAL FOR LOGS

The log push service's permanent data gaps during the outage elevated a truth that gets underinvested in: operational log data is customer data. For security audits, compliance requirements, billing reconciliation, and forensic investigation, log streams are as important as application data. Organizations that treat logs as ephemeral operational noise rather than durable customer data will find this assumption tested during major incidents.

Cloudflare routed traffic for half the internet the entire time their control plane was dark — which proves the edge works and proves the center matters, simultaneously.

TechLogStack — built at scale, broken in public, rebuilt by engineers


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).

Top comments (0)