TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

A Database Permission Change in ClickHouse Took Down 28% of Cloudflare's HTTP Traffic

#devops #reliability #database #webdev

28% of Cloudflare's HTTP traffic impacted — because Bot Management is in the critical proxy path
6 hours total outage duration; 2.5 hours to find root cause (delayed by a DDoS false lead)
Cause: a ClickHouse permission change triggered a query fallback to a larger dataset, generating an oversized config file
No validation, no staged rollout — the corrupt file reached every server globally within seconds
Same action item from the prior November 2023 postmortem that hadn't been implemented yet
Same-day postmortem; CEO wrote the first draft the evening the incident resolved

On November 2, 2023 — the same day as the control plane datacenter failure — Cloudflare also experienced a separate six-hour global outage. The cause: a database permission change in ClickHouse generated a corrupt configuration file that was silently propagated to every server in Cloudflare's Bot Management system, crashing it globally.

The Story

November 2, 2023 was an unusually bad day at Cloudflare. The datacenter power failure that took down the control plane had already created a major incident. Then, separately and concurrently, a different failure caused a completely independent global outage affecting 28% of Cloudflare's HTTP traffic. The Bot Management outage was caused by a database permission change in ClickHouse (a column-oriented database designed for real-time analytical queries, used by Cloudflare for its Bot Management system to query feature metadata) that inadvertently generated a corrupt configuration file — and the corrupt file was propagated globally to every Bot Management node before anyone noticed something was wrong.

The mechanics are precise. A database change altered the permissions for queries, causing them to fall back to a different database called 'default' that contained a larger set of 60 features rather than the distributed tables normally used. The Bot Management configuration file generator fetched this expanded feature set, generated a file that was larger than the software processing it could handle, and emitted the oversized file. The oversized file was then propagated throughout Cloudflare's global network — instantly and completely — as a standard configuration update.

The Global Propagation Problem

Cloudflare's configuration system was designed to propagate changes globally as fast as possible — a feature for legitimate security updates. For this incident, speed was the accelerant: a corrupt configuration file reached every Cloudflare server globally within seconds of being generated. There was no staged rollout, no canary deployment, no percentage-based rollout. One bad file. Every server. Instantly.

Problem

ClickHouse Permission Change Triggers Fallback

A database permission change caused Bot Management queries to fall back from distributed tables to the 'default' database containing 60 features. The configuration file generator fetched the larger dataset, generating a file that exceeded the size limit of the consuming software.

Cause

Oversized Config Silently Propagated Globally

The oversized configuration file was not validated before propagation. Cloudflare's configuration distribution system treated it like any other config update and propagated it globally to all Bot Management nodes. Every node crashed when it tried to load the oversized file.

Solution

2.5h to Find Root Cause, 3.5h to Fix and Deploy

It took 2.5 hours to identify the incorrect configuration files as the source of the outage — early investigation suspected a DDoS attack because Cloudflare's status page coincidentally went offline at the same time (unrelated outage). Once identified, deploying a correct file took another hour, and cleanup took 2.5 more hours.

Result

Service Restored 6 Hours After Start

Matthew: 'None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. Sent the draft over to the SF team, who did one more sweep, then posted it.'

— Matthew Prince, CEO of Cloudflare, discussing the postmortem publication, via The Pragmatic Engineer

The outage was resolved at 17:06 UTC, approximately 6 hours after it started. A correct configuration file was deployed. Cloudflare CEO Matthew Prince wrote the first version of the incident review at home in Lisbon, the evening the incident resolved — not a PR-managed corporate response but an engineer's honest account, written while the incident was fresh and published the same day.

The Fix

Required Fixes: Staged Rollouts and Config Validation

The Bot Management outage had two independent root causes. The first: the ClickHouse permission change that caused the query fallback should have been tested in a staging environment where the configuration file output could be validated. The second: the configuration distribution system should have validated the file size and format before propagating globally — and should never have propagated any configuration change globally and instantly regardless of its validity.

28% — HTTP traffic impacted; Bot Management is in the critical path of Cloudflare's proxy layer
2.5h — time to identify root cause; delayed by initial DDoS hypothesis after status page coincidentally went offline
6h — total outage duration: 2.5h investigation, 1h fix deployment, 2.5h cleanup
Instant — configuration propagation speed before fix; designed for security updates, catastrophic for corrupt configs

# Simplified config validation and staged rollout logic
# Addresses both root causes of the Bot Management outage

class ConfigDeployer:
    MAX_CONFIG_SIZE_BYTES = 10_000_000  # explicit size limit

    def deploy_config(self, config_data: bytes, config_type: str):
        # VALIDATION GATE: Reject invalid configs before any propagation
        # The oversized ClickHouse config would have been caught here
        self._validate_config(config_data, config_type)

        # STAGED ROLLOUT: Not global-instant anymore
        # Phase 1: Deploy to 1% of nodes, check health
        self._deploy_to_percentage(config_data, pct=0.01)
        if not self._health_check_passes(window_minutes=5):
            self._rollback()
            raise ConfigDeploymentError("Health check failed at 1%")

        # Phase 2: Expand to 10%
        self._deploy_to_percentage(config_data, pct=0.10)
        if not self._health_check_passes(window_minutes=5):
            self._rollback()
            raise ConfigDeploymentError("Health check failed at 10%")

        # Phase 3: Full deployment — only after both health gates pass
        self._deploy_global(config_data)

    def _validate_config(self, data: bytes, config_type: str):
        # Size validation — catches the ClickHouse fallback issue
        if len(data) > self.MAX_CONFIG_SIZE_BYTES:
            raise ConfigValidationError(
                f"Config size {len(data)} exceeds maximum {self.MAX_CONFIG_SIZE_BYTES}"
            )
        # Schema validation — catches structural issues before propagation
        parser = CONFIG_PARSERS[config_type]
        parser.validate(data)  # raises on malformed config

The Investigation Red Herring

Cloudflare's status page went offline coincidentally at the same time as the Bot Management outage — completely unrelated. Incident responders, seeing both the outage and the status page failure, initially focused on finding evidence of a DDoS attack. This wasted 2.5 hours investigating the wrong hypothesis. The lesson: when an incident starts, explicitly enumerate and test competing hypotheses rather than pursuing only the first plausible one. The most visible failure is not always the cause.

One key finding from the postmortem: the line of code that returned an error from the oversized configuration file did not log the error. If errors had been logged and alerted on when they spiked on nodes, root cause identification would have taken minutes rather than 2.5 hours. Logging errors at the point they occur — not just aggregating them — and alerting on error rate spikes is fundamental debugging infrastructure.

Why 28% of traffic was affected — and the fail-open vs fail-closed question

Bot Management is not a peripheral feature — it's in the critical path of Cloudflare's proxy layer. When Bot Management crashes on a node, that node's proxy functionality goes offline. This surfaces a fundamental architecture decision: when a security module fails, should the system fail-open (allow traffic through unprotected) or fail-closed (block traffic until the module recovers)? Fail-closed maintains security posture but impacts availability. Fail-open maintains availability but exposes customers to unprotected bot traffic during the failure window. Cloudflare's current design is fail-closed — 28% of traffic went down rather than flowing unprotected. The right answer depends on whether customers value security continuity or availability continuity more during module failures.

The prior postmortem action item that wasn't completed

The previous November 2023 Cloudflare control plane outage had included an explicit action item: implement staged configuration rollouts so that configuration files do not propagate immediately to the full global network. The Bot Management outage was, in part, a consequence of that work not yet being completed. CTO Dane Knecht acknowledged in the postmortem that staged config rollouts "remains our first priority across the organisation" but implementation was a large project that could take months.

The speed-safety tradeoff in config propagation

Cloudflare's instant global config propagation was designed for a real use case: when a new DDoS attack signature is detected, Cloudflare needs to push the mitigation rule globally as fast as possible. Security changes genuinely benefit from fast propagation. The fix isn't to make config propagation slower — it's to distinguish between security-critical changes (fast propagation with validation) and configuration updates (staged rollout with health gates). Not all configuration changes have the same urgency requirements.

Architecture

The Bot Management outage reveals how Cloudflare's internal architecture works at a feature module level. Bot Management is a module within Cloudflare's proxy software that evaluates every HTTP request against bot detection criteria. When it loads its configuration file, it reads the feature definitions that determine what signals to analyse. If that configuration file is oversized or malformed, the module crashes — and because it's in the critical path of the proxy, the proxy function for that node crashes too.

Bot Management Outage: The Configuration Propagation Chain

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Config Validation + Staged Rollout Architecture

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Configuration as Code: The Missing Gate

The Bot Management config file was generated by a system that fetched data from a database and formatted it — code that produces configuration. It had no equivalent of a test suite, a staging environment validation, or a size limit check. Configuration generators need the same quality gates as application code: unit tests for the generation logic, integration tests against real database states, validation of the output before propagation, and size/schema checks at the distribution layer. Configuration generation is engineering, not operations.

Lessons

Validate configuration files before propagating them. Size limits, schema validation, and semantic checks should all run before a configuration update is distributed to production nodes. A corrupt config that fails validation is an alert; a corrupt config that propagates globally is an outage.
Staged rollouts (deploying configuration changes to a small percentage of nodes first, checking health, then expanding gradually) for configuration changes are as important as staged rollouts for code changes. The same principles apply: canary, health gate, expand. Global instant propagation for configuration changes is a global outage waiting to happen.
Database permission changes are code changes. They modify system behaviour and can cause unexpected fallbacks, query plan changes, and downstream effects. Test them in staging. Apply them with the same rigour as schema migrations. The ClickHouse permission change was routine maintenance that caused a global outage because it wasn't tested for downstream effects.
When investigating incidents, explicitly enumerate competing hypotheses and test the most likely ones in parallel. The DDoS false lead cost 2.5 hours because investigators committed too quickly to one explanation. Structured incident investigation that tests multiple hypotheses simultaneously finds root causes faster.
Postmortem action items must have urgency. The same staged rollout improvement identified in the prior outage's postmortem would have prevented this outage if implemented before the second incident. Postmortem action items are not backlog items — they are debt with interest that accrues in the form of the next incident.

Engineering Glossary

ClickHouse — a column-oriented database designed for real-time analytical queries. Used by Cloudflare's Bot Management system to query feature metadata — the data that defines which behavioral signals to evaluate for bot detection. A permission change in ClickHouse triggered the query fallback that caused this outage.

Config validation — pre-propagation checks on a configuration file including size limits, schema validation, and semantic checks. The missing gate in this incident: the oversized Bot Management config file was never validated before being distributed globally.

Fail-closed — a failure mode design where a module failure causes the system to block traffic rather than allow it through unprotected. Cloudflare's Bot Management is fail-closed — when it crashed, the proxy function went down with it, impacting 28% of HTTP traffic.

Fail-open — a failure mode design where a module failure causes the system to allow traffic through (unprotected) rather than blocking it. The availability-first alternative to fail-closed; trades security continuity during failure for uptime continuity.

Staged rollout (configuration) — deploying a configuration change to a small percentage of nodes first, running health checks, then expanding gradually to the full fleet. The absent safety mechanism in this incident; would have contained the corrupt config to 1% of nodes rather than propagating it globally.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community