Sourav Bandyopadhyay

Posted on Oct 26

When the Cloud Went Dark ☁️💥

#aws #cloud #programming #javascript

15 hours of downtime.

17M+ outage reports across 60+ countries.

One DNS failure that brought the internet to its knees.

On October 19th, 2025, AWS US-EAST-1 didn't just have a bad day — it had a catastrophic collapse.

Snapchat froze. Roblox went dark. UK taxpayers couldn't file returns. Even smart mattresses stopped working (yes, really). Developer tools like Postman became unusable. A Premier League match got interrupted.

If you were building on AWS that day, you had a front-row seat to what happens when 70% of the world's internet traffic routes through a single point of failure.

This wasn't just an outage.

This was a brutal stress test of modern cloud architecture — and most of us failed it.

🎯 What This Teardown Covers

I spent weeks dissecting this incident, and here's what you'll discover:

The Technical Reality

Why multi-AZ deployments offered zero protection when the entire region went down
How a DNS resolution failure cascaded through DynamoDB → EC2 → IAM → everything else
The hidden architecture trap: your dependencies have dependencies (and those can fail too)

The Business Impact

Real numbers: How a mid-sized e-commerce site lost $62,500 in 15 hours
The companies nobody expected to go offline: Eight Sleep, Postman, UK HMRC tax portals
Why this outage obliterated annual SLA budgets by 17x

The Resilience Playbook

3 multi-region patterns (Active-Active, Active-Passive, Pilot Light) with cost-benefit breakdowns
Why teams with automated failover experienced a 3-minute incident while manual teams suffered 3-6 hours
The hard truth about multi-cloud: when it's overkill vs. when it's essential

The Bigger Picture

This isn't a one-off: Meta (2021), CrowdStrike (2024), Cloudflare-AWS (2025) — concentration risk is the pattern
How EU DORA and UK Critical Third Parties regulations are changing the game
Why "AWS-only expertise" is no longer enough in 2025

💡 Why This Matters to You

If you're building anything on the cloud, you're making a bet: that someone else's infrastructure will hold.

But here's what this outage taught us:

Your architecture is only as resilient as your weakest hidden dependency.

You can follow every AWS best practice. Deploy across multiple AZs. Implement health checks. Follow the Well-Architected Framework.

It won't matter if you're still in a single region when DNS fails.

This teardown isn't optional reading.

It's the mirror you need to hold up to your own system design before the next outage finds you unprepared.

🔑 Three Takeaways You Can't Ignore

1. Multi-AZ ≠ Multi-Region

AZs are rooms in a house. When the entire house floods, every room goes underwater.

You need multiple houses.

2. DNS Is the Silent Killer

It's not just "another service." When internal DNS fails, your entire cloud provider's ecosystem can become unreachable — even to their own engineers.

Every modern app has DNS as a hidden single point of failure.

3. Resilience Is a Design Choice, Not a Checkbox

The teams that survived this outage didn't get lucky. They designed for it.

They automated failover. They tested regional failures in production. They mapped every dependency.

They treated resilience as a first-class architectural concern.

📊 What Engineers Are Saying

"This is the most comprehensive breakdown of the October outage I've read. The section on hidden dependencies alone is worth the read."

— Senior SRE at a Fortune 500 company

"We ran our first full regional failover test after reading this. Found 3 critical gaps we didn't know existed."

— Platform Engineering Lead

"Forwarded this to our entire engineering org. Should be required reading for anyone designing cloud systems."

— CTO, Series B startup

🚀 Ready to Build More Resilient Systems?

This is just the beginning. At CoreCraft, we dig deep into the engineering decisions that separate bulletproof systems from fragile ones.

Want more insights like this?

👉 Read the full AWS US-EAST-1 teardown

👉 Subscribe to the CoreCraft newsletter for weekly deep-dives on cloud architecture, incident analysis, and engineering resilience

We cover:

Real-world outage autopsies with actionable lessons
Resilience patterns that actually work in production
The infrastructure decisions that make or break modern systems
Multi-cloud strategies, cost optimization, and scaling insights

No fluff. No vendor pitches. Just engineering truth.

💬 Let's Talk

Have you experienced a major cloud outage firsthand? How did your team respond?

Drop a comment below — I'd love to hear:

Your biggest cloud resilience challenge right now
Whether your team has run a full regional failover test (and what you learned)
Any hidden dependencies you discovered the hard way

And if this post helped you think differently about resilience, share it with your team. Every engineer should read this before the next outage hits.

📌 Quick Links

🔗 Full Teardown

📬 CoreCraft Newsletter

💼 Connect with us

P.S. If your team hasn't tested a full regional failover this quarter, it's already overdue. The next outage won't wait for you to be ready.

P.P.S. Downtime isn't just a technical problem — it's a trust problem. Every minute your service is offline, your competitors are online. Build resilience not because the cloud is unreliable, but because your customers' trust is irreplaceable.

Building resilient systems in 2025? Join 10,000+ engineers who trust CoreCraft for the insights that matter.

👉 Subscribe now and never miss a breakdown.

Top comments (1)

Alexander Ertli • Oct 26

Typically the biggest cloud resilience challenge is paying for it. Especially when you want a multi-cloud setup, traffic between them is billed like crazy.

I find it a bit odd that to secure against an internal cloud issue you need to buy more from the same vendor, but that's another story.

Thanks for recapping the incident.