Inside Cloudflare's November 18, 2025 Outage: A Deep Dive into What Broke the Internet (Temporarily)

#ai #cloud #architecture #discuss

On November 18, 2025, a routine change at Cloudflare, a company that powers about 20% of the web, turned into a nightmare for millions of internet users. Websites ground to a halt, apps failed to load, and error pages popped up like uninvited guests. For over five hours, core parts of the internet felt the ripple effects, from e-commerce sites to developer tools. It wasn't a hacker's plot or a massive cyber assault, as some first feared. Instead, it was a classic case of a small tweak snowballing into chaos due to overlooked limits in the system. In this article, we'll walk through the outage step by step, peering behind the curtain at the tech that failed, why it happened, and what Cloudflare is doing to ensure it doesn't repeat. Drawing from Cloudflare's own detailed postmortem, we'll keep things straightforward, no tech-speak without a plain-English translation.

The Moment Everything Stopped: What Users Saw

Imagine clicking on your favorite online store at around 11:20 AM UTC (that's 6:20 AM Eastern Time in the US, for context), only to hit a wall. Instead of the homepage, you get a stark error page from Cloudflare: "Something went wrong in our network." That's exactly what happened to visitors of countless sites protected by Cloudflare, which handles traffic for giants like Shopify, Discord, and thousands of smaller players. This wasn't a glitch on your end; it was Cloudflare's core network choking, unable to route traffic properly. Services like Workers KV (a fast storage tool for developers) and Access (a secure login system) also sputtered, leaving apps and APIs in limbo.

The outage hit hard because Cloudflare sits at the heart of the modern web. It acts like a traffic cop for the internet, speeding up sites, blocking bad bots, and shielding against attacks. When it falters, the web feels slower and less reliable. By some estimates, it disrupted access to over 10 million domains worldwide. But fear not, this was a self-inflicted wound, not malice, and the team sprang into action to patch it up.

A Tick-Tock Timeline: From Spark to Recovery

Outages like this unfold in real time, and Cloudflare's recap gives us a clear play-by-play. Here's how the day unraveled:

11:20 UTC: The First Cracks Appear

Traffic started failing across Cloudflare's global network. Requests that should have zipped through hit roadblocks, triggering those infamous 5xx server errors (think of them as the digital equivalent of a "server meltdown" message). It looked like a flood of bad traffic overwhelming the system, enough to fool even experts into thinking it was a denial-of-service (DDoS) attack on steroids.
11:20–12:00 UTC: Confusion Sets In

Engineers scrambled in internal chat rooms, eyeing dashboards that screamed overload. To add to the panic, Cloudflare's own status page, ironically hosted outside their network to avoid just this problem, went dark too. Users saw a simple "Error" message, fueling suspicions of a coordinated hit. (It turned out to be a fluke, unrelated to the main issue, but in the heat of the moment, it felt like the attackers were toying with them.)
12:00–14:30 UTC: Digging for the Root

The team ruled out DDoS after spotting odd patterns in the logs. No massive inbound floods, just internal failures. They zeroed in on the culprit: a bloated file in their bot-fighting system. By reverting to an older version of that file, core traffic began flowing again around 2:30 PM UTC.
14:30–17:06 UTC: The Cleanup Crew

With the immediate fire out, attention turned to the aftermath. As sites came back online, a surge of pent-up traffic hammered other parts of the network. Engineers monitored loads, tweaked configurations, and by 5:06 PM UTC, everything was humming normally again.

In total, the worst of it lasted about three hours, with full recovery by late afternoon. Cloudflare's transparency here is commendable, they didn't sugarcoat the mess.

Behind the Scenes: The Tech That Tripped Up

To understand why a simple permission tweak caused such havoc, let's pull back the hood on Cloudflare's engine room. Picture their network as a bustling highway system: requests from your browser or app enter via secure tunnels (that's the HTTP and TLS layer, encrypting your data like a locked envelope). From there, they hit the "core proxy" Cloudflare's traffic brain, nicknamed FL for "Frontline." This is where the magic (and occasional mayhem) happens: rules for security, speed boosts, and bot detection get applied. Finally, if needed, data zips to Pingora, Cloudflare's backend for caching or fetching fresh content from origin servers.

At the heart of the failure was Bot Management, one of those frontline modules. Bots aren't always bad, they're automated scripts that can scrape sites, launch attacks, or just crawl for search engines. Cloudflare's system uses machine learning (fancy algorithms that learn from patterns) to score every request: Is this a human typing away or a sneaky robot? Customers set rules based on these scores, like "block anything under 20/100."

The scoring relies on a "feature file" think of it as a constantly updating cheat sheet of traits the AI checks, like "how fast did the request arrive?" or "does the user agent string look fishy?" This file gets refreshed every few minutes from a database called ClickHouse (a speedy tool for crunching huge datasets). It's pushed out to thousands of servers worldwide to keep defenses sharp against evolving bot tricks.

Here's where it went wrong: A routine change to ClickHouse's user permissions, meant to tighten security, backfired. The query that builds the feature file started spitting out duplicates, like a printer jamming and copying the same page twice. Overnight, the file doubled in size, from a tidy list to a bloated mess.

Cloudflare's software? It had a hard-coded size limit on that file, like a suitcase that only holds so much before it bursts. When the oversized version propagated across the network, the bot module crashed. Boom any traffic needing a bot score got slapped with a 5xx error. This rippled to dependent services: Developers using Workers KV for quick data pulls saw reads fail, and Access logins timed out.

Complicating things, Cloudflare was mid-migration to a shiny new proxy version, FL2 (an upgrade for handling even more traffic efficiently). Sites on FL2 hit full errors, while legacy FL ones limped along with bogus scores (everything rated as "totally human," leading to false alarms for bot blockers). It was like two lanes of the highway breaking differently one with a full blockade, the other with misleading signs.

No diagrams were needed to see the pain: Internal chats buzzed with "Is this Aisuru-level DDoS?" (referring to recent mega-attacks hitting terabits per second). But logs told the truth it was an inside job, born of good intentions gone awry.

The Fix: Quick Thinking Under Pressure

Once the team pinpointed the swollen file, recovery was straightforward but not instant. They halted its spread, rolled back to a pre-bloat version, and watched traffic normalize. The real grind came next: Balancing the "thundering herd" effect, where delayed users all rushed back at once, spiking loads elsewhere. By evening, the network was stable, and apologies flowed from Cloudflare's leadership. "We know we let you down today," they wrote, owning the hit to their reputation and the broader web.

Lessons from the Wreckage: Building a Tougher Web

Cloudflare didn't stop at "oops." Their postmortem lays bare the failures: Undocumented size limits in code, untested permission changes, and over-reliance on a single file for critical features. It's a reminder that even giants stumble on basics, like assuming a database query won't go rogue.

Looking ahead, they're committing to overhauls: Better testing for config changes, elastic limits that scale with file sizes, and more redundancy in bot scoring to avoid single points of failure. They'll share more as fixes roll out, turning this embarrassment into a blueprint for resilience.

In the end, outages like this underscore the internet's fragility. One company's hiccup can dim lights for millions, but transparency like Cloudflare's helps us all learn. If you're a dev or site owner, it might be time to diversify your stack or chat with your CDN provider about failover plans. The web's only as strong as its weakest link and today, we're all a bit wiser about reinforcing it.