Cloudflare · Reliability · 17 May 2026
On November 2, 2023 — the same day as the control plane datacenter failure — Cloudflare also experienced a separate six-hour global outage. The cause: a database permission change in ClickHouse generated a corrupt configuration file that was silently propagated to every server in Cloudflare's Bot Management system, crashing it globally.
- Nov 2 2023 outage
- 28% HTTP traffic impacted
- 6 hours total duration
- 2.5h to find root cause
- ClickHouse permission change
- Bot Management crashed globally
The Story
November 2, 2023 was an unusually bad day at Cloudflare. The datacenter power failure that took down the control plane had already created a major incident. Then, separately and concurrently, a different failure caused a completely independent global outage affecting 28% of Cloudflare's HTTP traffic. The two incidents shared a date but not a cause. The Bot Management outage was caused by a database permission change in ClickHouse (a column-oriented database designed for real-time analytical queries, used by Cloudflare for its Bot Management system to query feature metadata) that inadvertently generated a corrupt configuration file — and the corrupt file was propagated globally to every Bot Management node before anyone noticed something was wrong.
The mechanics are precise. Cloudflare's Bot Management system queries a ClickHouse database to fetch feature metadata — data used to evaluate whether a given request exhibits bot-like behavior patterns. A database change altered the permissions for queries, causing them to fall back to a different database called 'default' that contained a different, larger set of 60 features rather than the distributed tables normally used. The Bot Management configuration file generator fetched this expanded feature set, generated a file that was larger than the software processing it could handle, and emitted the oversized file. The oversized file was then propagated throughout Cloudflare's global network — instantly and completely — as a standard configuration update.
THE GLOBAL PROPAGATION PROBLEM
Cloudflare's configuration system was designed to propagate changes globally as fast as possible — this is a feature for legitimate configuration updates. For security changes, speed matters. For this incident, speed was the accelerant: a corrupt configuration file reached every Cloudflare server globally within seconds of being generated. There was no staged rollout, no canary deployment, no percentage-based rollout. One bad file. Every server. Instantly.
Problem
ClickHouse Permission Change Triggers Fallback
A database permission change in ClickHouse caused Bot Management queries to fall back from distributed tables to the 'default' database containing 60 features. The configuration file generator fetched the larger dataset, generating a file that exceeded the size limit of the consuming software.
Cause
Oversized Config Silently Propagated Globally
The oversized configuration file was not validated before propagation. Cloudflare's configuration distribution system treated it like any other config update and propagated it globally to all Bot Management nodes. Every node crashed when it tried to load the oversized file. 28% of HTTP traffic was impacted because Bot Management is in the critical path for Cloudflare's proxy layer.
Solution
2.5h to Find Root Cause, 3.5h to Fix and Deploy
It took 2.5 hours to identify the incorrect configuration files as the source of the outage — early investigation suspected a DDoS attack because Cloudflare's status page coincidentally went offline at the same time (unrelated outage). Once identified, stopping the propagation and deploying a correct file took another hour, and cleanup took 2.5 more hours.
Result
Service Restored 6 Hours After Start
The outage was resolved at 17:06 UTC, approximately 6 hours after it started. A new configuration file was deployed. Bot Management came back online globally. The postmortem identified staged configuration rollouts as the primary required fix — the same action item from the control plane outage postmortem that hadn't been implemented yet.
🔍
Cloudflare's status page went offline at the same time as the outage, causing the incident response team to initially suspect a DDoS attack. The status page failure was a coincidence — an unrelated issue — not part of the outage. This created a 2.5-hour investigation red herring: engineers were looking for evidence of an attack while the actual cause was a configuration file size issue.
Matthew: 'None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. Sent the draft over to the SF team, who did one more sweep, then posted it.'
— — Matthew Prince, CEO of Cloudflare — discussing the postmortem publication, via The Pragmatic Engineer
Cloudflare CEO Matthew Prince wrote the first version of the incident review at home in Lisbon, the evening the incident resolved. This was not a PR-managed corporate response — it was an engineer's honest account of what went wrong, written while the incident was fresh. The postmortem was then circulated internally, reviewed by the SF team, and published. The same-day publication is unusual for a company of Cloudflare's size and is a demonstration of the cultural commitment to transparency that makes Cloudflare's postmortems some of the most cited in the industry.
⚠️
The November 2023 Postmortem Action Item (Uncompleted)
The previous November 2023 Cloudflare control plane outage had included an explicit action item: implement staged configuration rollouts so that configuration files do not propagate immediately to the full global network. The Bot Management outage was, in part, a consequence of that work not yet being completed. The postmortem was explicit: staged config rollouts 'remains our first priority across the organization' but implementation was a large project that could take months.
❌
Why 28% of Traffic Was Affected
Bot Management is not a peripheral feature — it's in the critical path for Cloudflare's proxy layer. When Bot Management crashes on a node, that node's proxy functionality goes offline. 28% of Cloudflare's HTTP traffic routes through nodes where Bot Management is active in the serving path. This architectural coupling — a feature module that can take down the core proxy function — is exactly the kind of dependency that staged rollouts would have contained: a crash on 1% of nodes is very different from a crash on 100%.
ℹ️
Bot Management Architecture: Why It Was Critical Path
Cloudflare's Bot Management evaluates every HTTP request against behavioral signals to determine if it's bot traffic. This evaluation happens inline in the request path — the proxy holds the request while Bot Management runs its checks. This design is necessary for real-time bot mitigation: if the check happened asynchronously, bots could complete their requests before being blocked. The trade-off is that a Bot Management failure blocks the request path entirely rather than allowing traffic through unprotected.
❌
Two Outages, One Day: The Coincidence Tax
The fact that Cloudflare experienced two separate major outages on November 2, 2023 — one from a datacenter power failure, one from a configuration file — created disproportionate reputational damage. Each incident was explainable individually. Together, they suggested to some customers that Cloudflare had a systemic reliability problem rather than two independent bad-luck events. The same is true in reliability engineering generally: coincident failures compound trust damage beyond what either would cause alone.
THE ERROR LOGGING GAP
One finding from the postmortem: the line of code that returned an error from the oversized configuration file did not log the error. If errors had been logged and alerted on when they spiked on nodes, root cause identification would have taken minutes rather than 2.5 hours. Logging errors at the point they occur — not just aggregating them — and alerting on error rate spikes is fundamental debugging infrastructure. This was one of the most actionable lessons from the incident.
The Fix
Required Fixes: Staged Rollouts and Config Validation
The Bot Management outage had two independent root causes that both needed to be addressed. The first: the ClickHouse permission change that caused the query fallback should have been tested in a staging environment where the configuration file output could be validated before propagation. The second: the configuration distribution system should have validated the file size and format before propagating globally — and should never have propagated a configuration change globally and instantly regardless of its validity.
- 28% — HTTP traffic impacted — because Bot Management is in the critical path of Cloudflare's proxy layer, a module crash takes down the proxy function for that node
- 2.5h — Time to identify the root cause — delayed by initial suspicion of DDoS attack after the status page coincidentally went offline at the same time
- 6h — Total outage duration from start to full resolution — 2.5h investigation, 1h fix deployment, 2.5h cleanup
- Instant — Configuration propagation speed before fix — the system was designed to propagate configs globally as fast as possible, which made the corrupt config a global instant failure
# Simplified config validation and staged rollout logic
# Addresses both root causes of the Bot Management outage
class ConfigDeployer:
MAX_CONFIG_SIZE_BYTES = 10_000_000 # explicit size limit
def deploy_config(self, config_data: bytes, config_type: str):
# VALIDATION GATE: Reject invalid configs before any propagation
self._validate_config(config_data, config_type)
# STAGED ROLLOUT: Not global-instant anymore
# Phase 1: Deploy to 1% of nodes
self._deploy_to_percentage(config_data, pct=0.01)
if not self._health_check_passes(window_minutes=5):
self._rollback() # automatic rollback on health failure
raise ConfigDeploymentError("Health check failed at 1%")
# Phase 2: Expand to 10%
self._deploy_to_percentage(config_data, pct=0.10)
if not self._health_check_passes(window_minutes=5):
self._rollback()
raise ConfigDeploymentError("Health check failed at 10%")
# Phase 3: Full deployment
self._deploy_global(config_data)
def _validate_config(self, data: bytes, config_type: str):
# Size validation — catches the ClickHouse fallback issue
if len(data) > self.MAX_CONFIG_SIZE_BYTES:
raise ConfigValidationError(
f"Config size {len(data)} exceeds maximum {self.MAX_CONFIG_SIZE_BYTES}"
)
# Schema validation — catches structural issues
parser = CONFIG_PARSERS[config_type]
parser.validate(data) # raises on malformed config
THE INVESTIGATION RED HERRING
One of the most instructive details in this postmortem is the DDoS attack hypothesis. Cloudflare's status page went offline coincidentally at the same time as the Bot Management outage — completely unrelated. Incident responders, seeing both the outage and the status page failure, initially focused on finding evidence of an attack. This wasted 2.5 hours investigating the wrong hypothesis. The lesson: when an incident starts, explicitly enumerate and test competing hypotheses rather than pursuing only the first plausible one.
ℹ️
ClickHouse Permission Architecture
Cloudflare's Bot Management uses ClickHouse to query feature metadata — data about which behavioral signals to look for in traffic. The ClickHouse cluster had two query paths: the distributed tables path (normal operation, queries a subset of features), and the 'default' database fallback (60 features, designed for different purposes). The permission change that triggered the fallback was routine maintenance — there was no intent to cause the fallback. This is a reminder that permission changes to production databases require the same testing rigor as code changes.
✅
Same-Day Postmortem: The Transparency Standard
Cloudflare published the incident postmortem on the same day as the outage. This is exceptional — most companies take days or weeks to publish postmortems. The same-day publication reflects a culture where transparency with customers is treated as part of incident response, not a post-recovery PR exercise. Cloudflare's CEO wrote the first draft the evening the incident resolved. That speed and candor is why Cloudflare's postmortems are among the most trusted in the industry.
⚠️
The Missing Error Log
A key finding in the postmortem: the code that crashed when loading the oversized configuration file returned an error but did not log it. This meant that even as nodes were crashing, the specific error causing the crash was not visible in monitoring. Engineers investigating the incident had to work backward from service failures rather than forward from error messages. Every error should be logged at the point it occurs, and log-level alerts should be configured for error rate spikes.
THE SPEED-SAFETY TRADEOFF IN CONFIG PROPAGATION
Cloudflare's instant global config propagation was designed for a real use case: when a new DDoS attack signature is detected, Cloudflare needs to push the mitigation rule globally as fast as possible. Security changes genuinely benefit from fast propagation. The fix isn't to make config propagation slower — it's to distinguish between security-critical changes (fast propagation with validation) and configuration updates (staged rollout with health gates). Not all configuration changes have the same urgency requirements.
Architecture
The Bot Management outage reveals how Cloudflare's internal architecture works at a feature module level. Bot Management is a module within Cloudflare's proxy software that evaluates every HTTP request against bot detection criteria. When it loads its configuration file at startup (or on configuration update), it reads the feature definitions that determine what signals to analyze. If that configuration file is oversized or malformed, the module crashes — and because it's in the critical path of the proxy, the proxy function for that node crashes too.
Bot Management Outage: The Configuration Propagation Chain
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
After: Config Validation + Staged Rollout Architecture
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
THE 'SAME MISTAKE TWICE' CONCERN
Two separate Cloudflare outages within weeks of each other, both caused by a configuration change propagating globally without staged rollout, created a serious customer confidence problem. The November 2023 datacenter outage was an external failure. The Bot Management outage was a self-inflicted failure with a root cause that the team had already identified from the prior incident. Customers rightly noticed the pattern. CTO Dane Knecht acknowledged in the postmortem that global configuration changes 'remains our first priority across the organization' — a public commitment to completing the staged rollout work that the team already knew it needed.
⚠️
Module Criticality and Blast Radius
An architecture question raised by this incident: should Bot Management be in the critical path of the proxy layer? If Bot Management crashes, the proxy crashes. An alternative design isolates Bot Management as a non-critical component that the proxy bypasses on failure — allowing traffic to flow (without bot protection) rather than blocking entirely. This fail-open vs fail-closed design decision has security implications (fail-open allows bots through temporarily) versus availability implications (fail-closed takes the proxy down). For a CDN, the availability argument may outweigh the security argument.
🔒
Fail-Open vs Fail-Closed: The Bot Management Design Decision
The Bot Management outage surfaces a fundamental architecture decision: when a security module fails, should the system fail-open (allow traffic through unprotected) or fail-closed (block traffic until the module recovers)? Fail-open maintains availability but exposes customers to unprotected bot traffic during the failure window. Fail-closed maintains security posture but impacts availability. Cloudflare's current design is fail-closed — 28% of traffic went down rather than flowing unprotected. The right answer depends on whether your customers value security continuity or availability continuity more during module failures.
Lessons
The Cloudflare Bot Management outage teaches a simple lesson about configuration safety that applies to every distributed system: fast global propagation is an availability risk. The lessons here are architectural and process-oriented.
- 01. Validate configuration files before propagating them. Size limits, schema validation, and semantic checks should all run before a configuration update is distributed to production nodes. A corrupt config that fails validation is an alert; a corrupt config that propagates globally is an outage.
- 02. Staged rollouts (deploying configuration changes to a small percentage of nodes first, checking health, then expanding gradually) for configuration changes are as important as staged rollouts for code changes. The same principles apply: canary, health gate, expand. Global instant propagation for configuration changes is a global outage waiting to happen.
- 03. Database permission changes are code changes. They modify system behavior and can cause unexpected fallbacks, query plan changes, and downstream effects. Test them in staging. Apply them with the same rigor as schema migrations. The Cloudflare ClickHouse permission change was routine maintenance that caused a global outage because it wasn't tested for downstream effects.
- 04. When investigating incidents, explicitly enumerate competing hypotheses and test the most likely ones in parallel. The DDoS false lead cost 2.5 hours because investigators committed too quickly to one explanation. Structured incident investigation that tests multiple hypotheses simultaneously finds root causes faster.
- 05. Postmortem action items must have urgency. The same staged rollout improvement identified in the November 2023 control plane outage postmortem would have prevented the Bot Management outage if it had been implemented before the second incident. Postmortem action items are not backlog items — they are debt with interest that accrues in the form of the next incident.
ℹ️
The 2023 Cloudflare Transparency Report
Cloudflare's CEO published the incident review on the same day as the outage, and the write-up was detailed and candid about the mistakes made. This level of post-incident transparency is unusual and valuable for the industry. When major infrastructure providers share honest postmortems , they give other engineering teams a chance to learn from failures they didn't experience themselves — and raise the industry standard for incident communication.
CONFIGURATION AS CODE: THE MISSING GATE
The Bot Management config file was generated by a system that fetched data from a database and formatted it. This is code that produces configuration. It had no equivalent of a test suite, a staging environment validation, or a size limit check. Configuration generators need the same quality gates as application code : unit tests for the generation logic, integration tests against real database states, validation of the output before propagation, and size/schema checks at the distribution layer. Configuration generation is engineering, not operations.
The same configuration safety fix that would have prevented the first outage also would have prevented the second outage — which makes the second outage Cloudflare's most expensive action item ever left in a backlog.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)