technologyInsideOut

Posted on Nov 22

What Actually Happened to the Internet on November 18, 2025?

#cloud #cloudflarechallenge #distributedsystems #architecture

On November 18th, 2025, the Internet seemed to come apart at the seams. Services used by billions; OpenAI, X (Twitter), Canva, Uber, and countless others suddenly returned 5xx errors in bright red banners. It wasn’t just one company having a rough day… it was the entire modern Internet gasping for air.

But contrary to what many assumed, the issue wasn’t some massive cyberattack or worldwide server meltdown. It all traced back to a single guardian of the Internet’s infrastructure:
CLOUDFLARE!

Cloudflare sits at the gateway of the modern web, absorbing DDoS attacks, accelerating performance, securing APIs, and handling DNS resolution for millions of customers. When Cloudflare breaks, the Internet breaks.

But on this particular day, something strange happened: instead of defending others against denial of service, Cloudflare accidentally denied service to itself.

The Breakdown: How a Tiny Misconfiguration Cascaded Into Global Failure
Cloudflare later released a detailed postmortem. The simplified version is:

A seemingly harmless configuration change in one of their internal systems caused their proxy servers to crash — repeatedly — taking huge parts of the global Internet with them.

Let’s break down what happened, in plain English.

Cloudflare’s Bot Management System Received a Bad Update Cloudflare heavily relies on its Bot Management system — a component that classifies incoming traffic as “human” or “bot” using hundreds of behavioural features. These features are periodically updated as patterns change.

Each update is packaged into a “feature file” that all Cloudflare proxy servers download.

But on November 18th:

⚠️ A permissions change in Cloudflare’s ClickHouse cluster caused the system to generate duplicate rows in the feature file.
This made the file much larger than normal — but the system that processed it wasn’t designed to handle that.

Proxy Servers Tried to Load the Oversized Feature File — and Crashed Cloudflare’s next-gen proxy engine (FL2), written in Rust, makes a performance-oriented assumption:

The bot feature file will never contain more than 200 features.

To optimize, the system pre-allocated memory for exactly that amount.

But the corrupted file contained more than twice that number.

When the proxies attempted to load it, the system encountered an out-of-bounds state and performed an unwrap() on an error — a fatal action in Rust — triggering a panic.

In other words:

One malformed config file caused proxies to crash instantly. And since Cloudflare proxies serve nearly all user traffic… 5xx errors began spreading across the Internet like wildfire.

The Worst Part: The Bad File Kept Regenerating Cloudflare’s infrastructure automatically regenerates the bot feature file every few minutes.

Because it was pulling from different ClickHouse nodes — some corrected, some not — the system kept randomly generating:

sometimes a good file (proxies recovered)

sometimes a bad file (proxies crashed again)

This created a yo-yo cycle of recovery and collapse, making diagnosis extremely difficult.

At first, even Cloudflare engineers thought it might be a DDoS attack due to the scale and pattern of failure.

Global Impact: Why So Many Platforms Went Down Cloudflare’s proxy layer is foundational. When it collapses, several layers of the modern web collapse with it:

Websites can’t route traffic

APIs can’t be reached

Authentication systems fail

Applications behind Cloudflare appear “offline”

Internally, Cloudflare services also broke:

Turnstile (CAPTCHA alternative) stopped working

Workers KV showed elevated 5xx

Access authentication failed for new logins

Dashboard became unreachable

This explains why so many independent platforms, even those with massive infrastructures of their own — OpenAI, X, and others — suddenly looked like they were having outages.

They just couldn’t get through Cloudflare.

Why Did This Happen?
A Deeper Look at Architectural Issues

Your original questions were spot on. Let’s answer them with updated clarity.

Why was that memory limit chosen as the upper limit? Cloudflare set the 200-feature limit because:

it kept memory allocation constant and fast

the bot detection system historically never exceeded that number

preallocation improves performance and safety

But this optimization became a fragile single point of failure.

If assumptions aren’t validated at runtime — even high-performance systems can break catastrophically.

Was this an internal Distributed Denial-of-Service? Technically, no — not in the traditional sense.

But conceptually?

Yes — Cloudflare unintentionally DDoSed itself.

Here’s why:

every proxy tried to download the oversized feature file

each attempt caused a crash

crashed proxies kept retrying

retries added load to internal systems

regenerating a bad file caused waves of failures

This resembles a self-induced DDoS loop, even though the root wasn’t malicious.

It reveals a weakness of microservice-style architectures:

If a core internal service feeds invalid data into the system, it can overwhelm the entire infrastructure — not through volume, but through bad assumptions.

Cloudflare themselves acknowledged this: their configuration pipeline wasn’t protected by enough safeguards or validation layers.

What Cloudflare Did to Fix It
Cloudflare implemented multiple fixes to prevent a repeat:

Global kill switch to stop rollout of corrupted feature files

Stricter validation to reject oversized or malformed bot feature files

Runtime safety checks (no more panicking on bad input)

Better circuit breakers so proxies fall back to a safe state instead of crashing

Hard limits and guardrails around ClickHouse DB permissions

Slower rollout of Bot Management updates, instead of blasting them globally at once

The fixes show that Cloudflare took the incident seriously — and that the real problem wasn’t just a bug, but its ability to cascade.

Conclusion: The Internet’s Fragility in One Bug
The November 18, 2025 outage wasn’t caused by an attacker, a massive data centre failure, or a cyber-war event.

It was caused by:

a small configuration change

a duplicated set of rows

a feature file that grew too large

a proxy system that panicked on invalid input

and a rollout mechanism that propagated the mistake globally

When systems scale to the size of Cloudflare, tiny bugs no longer produce tiny failures. They can break the Internet.
And on November 18th, 2025 — they did.

DEV Community

What Actually Happened to the Internet on November 18, 2025?

Top comments (0)