- 3rd configuration-related global outage in the 2023–2025 period — same root cause each time
- December 2025 — React CVE fix → testing tool error → global killswitch → HTTP 500 across the network
- Same-day postmortem published — Cloudflare's consistency maintained even when it revealed a repeated pattern
- "Months" — the estimated implementation time for staged rollouts, as quoted in the November 2023 postmortem that preceded this outage
- Priority #1 — CTO Dane Knecht's public commitment to staged configuration rollouts after the third incident
- The staged rollout fix that would have prevented this outage was identified two years and two incidents earlier
In late 2025, Cloudflare was rolling out a fix for a React security vulnerability. To do so, they needed to disable an internal testing tool with a global killswitch. The killswitch, unexpectedly, triggered a bug that sent HTTP 500 errors across Cloudflare's entire global network. This was the third major configuration-related global outage in two years.
The Story
In this latest outage, Cloudflare was burnt by yet another global configuration change. The previous outage in November happened thanks to a global database permissions change. This change would make it so that Cloudflare's configuration files do not propagate immediately to the full network, as they still do now. But making all global configuration files have staged rollouts is a large implementation that could take months. Evidently, there wasn't time to make it yet, and it has come back to bite Cloudflare.
— The Pragmatic Engineer newsletter, analysis of the Cloudflare December 2025 outage
By December 2025, Cloudflare had experienced two major configuration-related global outages and had identified staged configuration rollouts as the primary systemic fix. That fix was still not fully implemented. Then came a React CVE (a Common Vulnerabilities and Exposures report for a security flaw in the React JavaScript library — CVEs trigger mandatory patching workflows across the industry). Cloudflare was deploying a fix for it in their internal tooling. The patch introduced an error in an internal testing tool. The team disabled the testing tool with a global killswitch. That killswitch, unexpectedly, triggered a bug in an unrelated code path — causing HTTP 500 errors across Cloudflare's entire network.
The pattern was impossible to ignore. Cloudflare had experienced multiple major outages in the 2023–2025 period, each with the same root-cause category: a configuration change that propagated globally and instantly, without staged rollout, caused unexpected systemic failures. The November 2023 Bot Management outage's primary action item — implement staged configuration rollouts — was explicitly identified as a large implementation that could take months. Each new outage was paying the price of that implementation not yet being complete.
Problem
React CVE Fix Introduces Testing Tool Error
Cloudflare was rolling out a fix for a React security vulnerability in internal tooling. The fix caused an error in an internal testing tool, prompting the team to disable the tool. The disable was executed as a global configuration change via killswitch.
Cause
Killswitch Triggered Unexpected Code Path Bug
The global killswitch that disabled the testing tool unexpectedly triggered a bug in a connected code path. The bug caused HTTP 500 errors across Cloudflare's network. Because the killswitch was propagated globally and instantly, the impact was immediate and global.
Solution
Revert Killswitch Configuration
The fix was to revert the killswitch configuration — undoing the disable of the testing tool that had triggered the bug. This brought Cloudflare's network back to its pre-fix state. The React CVE patch then needed to be reworked to avoid triggering the testing tool error.
Result
Service Restored, Pattern Acknowledged
Service was restored after reverting the configuration. The postmortem was published on the same day. CTO Dane Knecht acknowledged the pattern publicly and committed to making enhanced rollouts and versioning "the first priority across the organisation" — the same commitment made after the 2023 outages, now with resource allocation and deadline commitment attached.
The Fix
The Systemic Fix: Enhanced Rollouts and Versioning
Cloudflare's CTO described the required fix as "Enhanced Rollouts and Versioning" — applying the same safety and blast mitigation features to configuration data that Cloudflare already applies to software deployments. Software at Cloudflare is deployed gradually, with strict health validation at each stage. Configuration changes had no equivalent safety system. The fix required building one.
- 3rd — configuration-related global outage in 2023–2025; each traceable to instant global config propagation without safety gates
- Months — estimated implementation time for staged rollouts as quoted in the November 2023 postmortem; the duration that allowed two more outages to occur
- Same day — postmortem publication time; Cloudflare's consistency maintained even when it revealed a repeated failure to implement a known fix
- Priority #1 — stated organisational priority for staged configuration rollouts after the December 2025 outage, now with named ownership and deadline commitment
# The required Enhanced Rollouts and Versioning system
# Key design: distinguishes security-critical changes (fast) from config changes (staged)
class ConfigRolloutEngine:
def deploy_change(self, change: ConfigChange):
# Security-critical changes: DDoS mitigations, attack signatures
# Still fast — but with validation gate to catch malformed configs
if change.type == ConfigChangeType.SECURITY_CRITICAL:
self._validate_config(change) # must pass before any propagation
self._deploy_global_fast(change) # then deploy fast
return
# All other changes: staged rollout with health gates
# This is the path a killswitch would have taken
self._validate_config(change)
# Stage 1: 1% canary — catch the killswitch bug here, not globally
self._deploy_to_percentage(change, pct=0.01)
self._wait_and_check_health(minutes=5)
# Stage 2: 10% cohort
self._deploy_to_percentage(change, pct=0.10)
self._wait_and_check_health(minutes=5)
# Stage 3: 50% cohort
self._deploy_to_percentage(change, pct=0.50)
self._wait_and_check_health(minutes=10)
# Stage 4: Full rollout — only after all health gates pass
self._deploy_global(change)
def _validate_config(self, change: ConfigChange):
# Size limits, schema validation, semantic checks
# Catches malformed configs before any propagation occurs
pass
def _wait_and_check_health(self, minutes: int):
# Error rate, latency, traffic drop metrics
# Auto-rollback if thresholds exceeded at any stage
pass
The three-outage timeline
Cloudflare's 2023–2025 configuration incidents follow a precise pattern: (1) a routine operational change is made to production infrastructure, (2) the change has unexpected downstream effects, (3) the affected configuration is propagated globally and instantly, (4) the impact is global and immediate. The November 2023 Bot Management outage — a database permissions change — was the first. The December 2025 React outage was the third. Each postmortem identified staged rollouts as the fix. The fix was acknowledged as a large implementation requiring months. Two outages occurred in that window.
Postmortem action items need owners, resources, and deadlines
The Cloudflare staged rollout story is one of the industry's clearest examples of what happens when postmortem action items are treated as backlog entries rather than critical debt. The November 2023 postmortem identified the fix. Two subsequent incidents demonstrated the cost of not implementing it. Engineering organisations need mechanisms to track postmortem action items with urgency — including escalation paths when critical action items age without progress. The difference between "we identified the need for staged rollouts" and "engineer X owns staged rollouts with Y engineers and a Q1 deadline" is the difference between an action item that ages and one that gets done.
Cloudflare's same-day transparency: the accountability mechanism
Cloudflare published their postmortem for the December 2025 React outage on the same day the incident resolved — maintaining their transparency standard for the third major outage in two years. The postmortem explicitly referenced the November 2023 action item that hadn't been completed, and included CTO Dane Knecht's public acknowledgment that staged configuration rollouts "remains our first priority." Three same-day postmortems, three public commitments to the same fix, growing organisational accountability. The third outage finally resulted in resources, deadline commitment, and executive ownership for the staged rollout project.
Architecture
The React outage sits in a chain of failures that reveals a systemic architectural vulnerability in Cloudflare's control plane. At the data plane level — PoPs, traffic routing, DDoS mitigation — Cloudflare's architecture is highly resilient. At the configuration plane level — the system that distributes rules and settings to the data plane — the architecture was designed for speed rather than safety. Three outages in two years from the same root cause is the empirical evidence that speed without safety is not viable at global infrastructure scale.
The Configuration Safety Gap: 2023–2025 Timeline
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Required Enhanced Rollout Architecture for Cloudflare
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Lessons
A postmortem action item that isn't implemented before the next incident becomes evidence. The staged rollout fix was identified in November 2023. Three subsequent incidents demonstrated its absence. Each was preventable if the fix had been implemented. Organisations that deprioritise critical postmortem action items pay the price in the form of the next incident.
Killswitches (configuration flags that disable functionality globally) are configuration changes and must be treated with the same safety rigor. A killswitch that propagates globally and instantly, without validation and health gating, is a global instant configuration change. Apply staged rollout requirements to all configuration changes — including disables, removes, and shutdowns.
Security patches create deployment urgency that can override normal safety practices. CVE patches are time-sensitive, creating pressure to deploy quickly. Build explicit processes for security patching that maintain urgency while preserving safety gates — staged deployment with fast canary windows is both fast and safe compared to instant global deployment.
Postmortem action items need named owners, resource allocation, and deadline commitment — not just backlog entries. The difference between "we identified the need for staged rollouts" and "engineer X owns staged rollouts with Y engineers and a Q1 deadline" is the difference between an action item that ages and one that gets done.
Repeated incidents with the same root cause are not evidence that the fix is impossible — they are evidence that the fix is insufficiently prioritised. Three configuration-related global outages is a forcing function for resource allocation. If the first incident's postmortem doesn't unlock the resources to fix the root cause, count on needing either the second or third incident to do it — and budget the cost of those incidents accordingly.
Engineering Glossary
CVE (Common Vulnerabilities and Exposures) — a public registry of disclosed security vulnerabilities, each assigned a unique identifier. A CVE against a library like React triggers mandatory patching workflows across the industry. CVE patches are time-sensitive, creating deployment urgency that can override normal safety practices.
Enhanced Rollouts and Versioning — Cloudflare CTO Dane Knecht's term for the required fix: applying the same staged deployment and health-gating features to configuration data that Cloudflare already applies to software deployments. The system needs to distinguish between security-critical changes (fast path) and configuration updates (staged path).
Gray failure — (see Slack cellular architecture article) — a partial failure where different components have inconsistent views of system availability. Relevant here: the killswitch bug caused inconsistent behaviour across Cloudflare's network before the full HTTP 500 pattern became visible.
Killswitch — a configuration flag that disables functionality globally. Conceptually simple, but in a complex distributed system it is a configuration change with potentially unexpected dependencies. Must be staged and validated like any other configuration change.
Postmortem action item — a specific engineering commitment made following an incident, describing a change that would prevent recurrence. Treated as critical debt when the incident's root cause is high-severity and reoccurring. Requires named ownership, resource allocation, and deadline commitment to avoid ageing in a backlog.
Staged rollout — deploying a configuration change to a small percentage of infrastructure first, checking health metrics, then expanding gradually. The safety mechanism that was missing from Cloudflare's configuration distribution system and whose absence caused three global outages in two years.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)