DEV Community: technologyInsideOut

What Actually Happened to the Internet on November 18, 2025?

technologyInsideOut — Sat, 22 Nov 2025 17:02:14 +0000

On November 18th, 2025, the Internet seemed to come apart at the seams. Services used by billions; OpenAI, X (Twitter), Canva, Uber, and countless others suddenly returned 5xx errors in bright red banners. It wasn’t just one company having a rough day… it was the entire modern Internet gasping for air.

But contrary to what many assumed, the issue wasn’t some massive cyberattack or worldwide server meltdown. It all traced back to a single guardian of the Internet’s infrastructure:
CLOUDFLARE!

Cloudflare sits at the gateway of the modern web, absorbing DDoS attacks, accelerating performance, securing APIs, and handling DNS resolution for millions of customers. When Cloudflare breaks, the Internet breaks.

But on this particular day, something strange happened: instead of defending others against denial of service, Cloudflare accidentally denied service to itself.

The Breakdown: How a Tiny Misconfiguration Cascaded Into Global Failure
Cloudflare later released a detailed postmortem. The simplified version is:

A seemingly harmless configuration change in one of their internal systems caused their proxy servers to crash — repeatedly — taking huge parts of the global Internet with them.

Let’s break down what happened, in plain English.

Cloudflare’s Bot Management System Received a Bad Update Cloudflare heavily relies on its Bot Management system — a component that classifies incoming traffic as “human” or “bot” using hundreds of behavioural features. These features are periodically updated as patterns change.

Each update is packaged into a “feature file” that all Cloudflare proxy servers download.

But on November 18th:

⚠️ A permissions change in Cloudflare’s ClickHouse cluster caused the system to generate duplicate rows in the feature file.
This made the file much larger than normal — but the system that processed it wasn’t designed to handle that.

Proxy Servers Tried to Load the Oversized Feature File — and Crashed Cloudflare’s next-gen proxy engine (FL2), written in Rust, makes a performance-oriented assumption:

The bot feature file will never contain more than 200 features.

To optimize, the system pre-allocated memory for exactly that amount.

But the corrupted file contained more than twice that number.

When the proxies attempted to load it, the system encountered an out-of-bounds state and performed an unwrap() on an error — a fatal action in Rust — triggering a panic.

In other words:

One malformed config file caused proxies to crash instantly. And since Cloudflare proxies serve nearly all user traffic… 5xx errors began spreading across the Internet like wildfire.

The Worst Part: The Bad File Kept Regenerating Cloudflare’s infrastructure automatically regenerates the bot feature file every few minutes.

Because it was pulling from different ClickHouse nodes — some corrected, some not — the system kept randomly generating:

sometimes a good file (proxies recovered)

sometimes a bad file (proxies crashed again)

This created a yo-yo cycle of recovery and collapse, making diagnosis extremely difficult.

At first, even Cloudflare engineers thought it might be a DDoS attack due to the scale and pattern of failure.

Global Impact: Why So Many Platforms Went Down Cloudflare’s proxy layer is foundational. When it collapses, several layers of the modern web collapse with it:

Websites can’t route traffic

APIs can’t be reached

Authentication systems fail

Applications behind Cloudflare appear “offline”

Internally, Cloudflare services also broke:

Turnstile (CAPTCHA alternative) stopped working

Workers KV showed elevated 5xx

Access authentication failed for new logins

Dashboard became unreachable

This explains why so many independent platforms, even those with massive infrastructures of their own — OpenAI, X, and others — suddenly looked like they were having outages.

They just couldn’t get through Cloudflare.

Why Did This Happen?
A Deeper Look at Architectural Issues

Your original questions were spot on. Let’s answer them with updated clarity.

Why was that memory limit chosen as the upper limit? Cloudflare set the 200-feature limit because:

it kept memory allocation constant and fast

the bot detection system historically never exceeded that number

preallocation improves performance and safety

But this optimization became a fragile single point of failure.

If assumptions aren’t validated at runtime — even high-performance systems can break catastrophically.

Was this an internal Distributed Denial-of-Service? Technically, no — not in the traditional sense.

But conceptually?

Yes — Cloudflare unintentionally DDoSed itself.

Here’s why:

every proxy tried to download the oversized feature file

each attempt caused a crash

crashed proxies kept retrying

retries added load to internal systems

regenerating a bad file caused waves of failures

This resembles a self-induced DDoS loop, even though the root wasn’t malicious.

It reveals a weakness of microservice-style architectures:

If a core internal service feeds invalid data into the system, it can overwhelm the entire infrastructure — not through volume, but through bad assumptions.

Cloudflare themselves acknowledged this: their configuration pipeline wasn’t protected by enough safeguards or validation layers.

What Cloudflare Did to Fix It
Cloudflare implemented multiple fixes to prevent a repeat:

Global kill switch to stop rollout of corrupted feature files

Stricter validation to reject oversized or malformed bot feature files

Runtime safety checks (no more panicking on bad input)

Better circuit breakers so proxies fall back to a safe state instead of crashing

Hard limits and guardrails around ClickHouse DB permissions

Slower rollout of Bot Management updates, instead of blasting them globally at once

The fixes show that Cloudflare took the incident seriously — and that the real problem wasn’t just a bug, but its ability to cascade.

Conclusion: The Internet’s Fragility in One Bug
The November 18, 2025 outage wasn’t caused by an attacker, a massive data centre failure, or a cyber-war event.

It was caused by:

a small configuration change

a duplicated set of rows

a feature file that grew too large

a proxy system that panicked on invalid input

and a rollout mechanism that propagated the mistake globally

When systems scale to the size of Cloudflare, tiny bugs no longer produce tiny failures. They can break the Internet.
And on November 18th, 2025 — they did.

Inside AWS's outage - and What it teaches Developers.

technologyInsideOut — Fri, 24 Oct 2025 21:01:00 +0000

When AWS sneezed, the internet caught a cold.

When AWS experienced its outage, it wasn’t just a small glitch in the cloud; it triggered a domino effect that rippled across much of the internet. The incident wasn’t caused by lost servers or faulty disks, it began with a failure in a monitoring system, the very component meant to keep everything running smoothly.

Where it all started
An internal monitoring tool inside AWS began reporting wrong information about service health. It reported some systems as unhealthy when they were actually fine. That wrong reporting triggered automated reactions. Those reactions included updating DNS and routing data to move traffic away from endpoints that were marked unhealthy.

The Core Component: DynamoDB and Its Importance

To understand the outage, it was important to grasp the role of DynamoDB within AWS. It wasn’t just another database instead it served as the data backbone that many AWS services depended on.

Critical elements such as IAM session tokens, Service metadata, and Routing configurations were often stored in DynamoDB. So, when it began to slow down, AWS itself started to struggle.

Although the outage originated from a malfunctioning monitoring tool, it felt as if DynamoDB had failed. Systems waiting for its responses came to a standstill. Authentication slowed, internal routing delayed, and customers faced errors everywhere.

Why that mattered
Many AWS services and customer workloads relied on those DNS and routing records to find endpoints. One of the services most affected was DynamoDB. DynamoDB does more than hold customer data. It also stores critical metadata and state that many AWS control plane functions and other services used. When routing and DNS made DynamoDB appear unreachable, many internal and external operations stopped working or timed out.

In simple terms; when DynamoDB paused, AWS paused.

The Culprit: AWS Monitoring Service

Under normal conditions, AWS’s internal monitoring service continuously evaluated the health of thousands of systems, from load balancers and databases to routing layers and DNS records.

It supplied this data to other AWS systems that made decisions based on those health signals. For example, Route 53 updated DNS records when a region appeared unhealthy, Auto Scaling adjusted capacity according to demand, and service dashboards displayed those “green” or “red” status lights everyone relies on.

This feedback loop functioned as AWS’s invisible control system.
When it worked correctly, AWS appeared seamless, always fast, self-healing, and always available.

When the Brain Misfired

Then came the turning point. The monitoring system started misreporting health statuses. Route 53 and other routing layers saw the monitoring tool’s unhealthy signals and started removing or de-prioritizing endpoints. That meant requests could not find the right server addresses. In many cases the servers themselves were up and running. The problem was the map that pointed clients to those servers.

DNS is not instant. DNS records have time to live values and caches exist across the internet. When DNS entries changed, caches held stale values. Clients tried to resolve names and often received answers that no longer matched the actual topology. That added confusion and delay while the system tried to converge on correct records.

This wasn’t a simple service failure rather it was automation amplifying bad data. It was like a self-healing system applying the wrong cure to the wrong patient.

The Cascade: Flooded Regions and Retry Storms

How retries amplified the problem
The combined effect. So the failure was not a single component crashing. It was a sequence:

A monitoring signal went wrong.
DNS and routing were updated based on that wrong signal.
Clients could not reach the correct endpoints even though those endpoints were often alive.
Clients retried at scale and generated huge load.
Control plane APIs and internal services that relied on DynamoDB and those DNS records started failing or timing out.
Recovery took much longer than the original event because the system was under a self inflicted load and DNS cache propagation delayed fixes.

Every API, application, and Lambda function continued to bombard AWS with traffic, overwhelming systems that were already in recovery mode. That was what turned a two-hour disruption into a full-day meltdown.
The network wasn’t brought down by external traffic but rather it was overwhelmed by its own reflexes.

In architectural terms, this was a classic retry storm as too many clients retrying too aggressively, flooding a half-healed system that was trying to stabilize.

Lessons for Developers and Architects

I want to be clear. AWS runs at enormous scale and provides functionality that would be extremely hard for most teams to build. Respect for that does not stop us from asking why a failure like this happened and what we can learn.

The outage wasn’t about hardware but a control plane collapse, proof that the brain of the cloud could fail even when the body remained healthy.

How could AWS miss a seemingly simple design problem?

Extreme system complexity Large systems are full of dependencies. A monitoring tool that looks simple can influence many control loops. At scale, the interactions between independent systems are hard to model exhaustively.
Trust and automation Automation relies on signals. If signals are usually accurate, systems are built to trust them. That optimization makes normal operations efficient. It creates fragility when a foundational signal is wrong.
Operational tradeoffs Decisions about where to place metadata, how to handle health checks, and how fast to roll out changes all balance availability, latency, and cost. Some tradeoffs that make day to day operation efficient can increase risk in rare edge cases.
Testing and rollout limits Simulating every possible interaction at global scale is nearly impossible. Features and monitoring logic are tested, but certain combinations only appear in production conditions.
Human factors When things go wrong, teams must decide fast. Automated responses can help, but they can also amplify mistakes before humans can intervene. The window between bad automation and human correction becomes the failure mode.

Why DNS became central in this outage

DNS is the directory DNS maps names to addresses. Almost every service discovery and routing mechanism ultimately uses name resolution somewhere in the chain.
DNS affects many surfaces When DNS records change, load balancers, client resolvers, and caches must converge. That convergence is not instantaneous and it is global. Missteps at the DNS level therefore have wide reach.
Health checks drive routing Many failover and routing strategies depend on health checks that update DNS. That means bad health signals directly change where traffic goes.
Cache and TTL dynamics DNS caching improves performance but slows correction. If a wrong record is cached widely, fixes take longer to take effect across clients and networks.

From a developer’s standpoint, this taught some painful but valuable lessons:

Never trust a single source of truth. If your monitoring or DNS layer failed, you needed external validation — even something simple and independent.
Design for cloud lies. Your app should be able to handle AWS saying “unavailable” when it wasn’t. Implement backoff logic and alternate data paths.
Managed doesn’t mean invincible. DynamoDB, Route 53, IAM etc. all were dependencies that could fail in ways beyond your control.
Retry storms are real. Exponential backoff wasn’t optional; it was a survival mechanism. Retries needed to calm the storm, not feed it.

The Real Problem?

This outage exposed a core truth. We have automated systems that make cloud operations possible. Those systems also become single points of failure if we assume they are always correct. A resilient design assumes that control can lie, caches can be stale, and a trusted health signal can be wrong.

Ask yourself and your team:

If the provider’s monitoring or DNS reported false information, could your system still serve core use cases?
Do your retry policies help or hurt during partial outages?
Where do you hold critical metadata and how quickly can you switch that path if the control plane is slow?
Do you have an external sanity checker for critical dependencies?

Practical, concrete recommendations

For developers and architects

Do not assume provider health equals your system health. Add independent health checks and a second opinion from outside the region or provider when critical.
Build graceful degradation. If a managed service is unavailable, let less critical features degrade while core functionality stays online.
Implement exponential backoff, jitter, and circuit breakers. Make retry logic designed to reduce pressure when the system is struggling.
Keep critical metadata replicated or cached in ways that let your application continue operating if control APIs are slow.
Consider alternative service discovery or fallback strategies for DNS failures, such as client side caches with safe fallbacks, or a lightweight secondary resolver outside the provider.
Run chaos experiments that target provider control plane failures, not just resource failures. Test how your system behaves if DNS or IAM became temporarily inconsistent.

For platform providers in general

Treat monitoring infrastructure as a first class system with independent protection. Monitoring and control oracles need isolation and cross checks.
Build multi source validation into critical control loops. If two independent checks disagree, favor a safe, conservative state or a manual escalation rather than an immediate automated removal of endpoints.
Make rollback simple and fast for routing changes. Minimize change blast radius by staging routing updates and providing easier means to revert.
Improve transparency to customers during incidents, with clear signals about which subsystems are affected and what fallback actions customers can take.