FirstPassLab

Posted on Apr 2 • Originally published at firstpasslab.com

Why Multi-AZ Failed: Lessons from the First Kinetic Attack on a Major Cloud Region

#security #aws #devops #cloud

When Iranian drones struck AWS data centers in the UAE and Bahrain on March 1, 2026, they didn't just destroy server racks — they invalidated the multi-AZ assumptions that most cloud architectures are built on. AWS responded by waiving all March charges for ME-CENTRAL-1 and ME-SOUTH-1, an unprecedented move. Here's what engineers need to understand about what failed and how to design around it.

TL;DR: Multi-AZ is not a disaster recovery plan when the threat is geopolitical. Only tested, cross-region failover with active data replication protects against the physical destruction of an entire cloud region.

What Actually Happened

Shahed 136 drones struck two AWS data center facilities, causing structural damage, power grid disruption, and water damage from fire suppression systems. The AWS Service Health Dashboard confirmed:

ME-CENTRAL-1 (UAE): 2 of 3 AZs impaired (mec1-az2, mec1-az3)
ME-SOUTH-1 (Bahrain): 1 of 3 AZs lost power entirely (mes1-az2)
84+ services offline including EC2, S3, DynamoDB, Lambda, RDS, and the Management Console

Regional customers — Careem, Alaan, Tabby, and banking services — went down immediately (CNBC, 2026).

Impact Detail	ME-CENTRAL-1 (UAE)	ME-SOUTH-1 (Bahrain)
AZs Impaired	2 of 3	1 of 3
Services Affected	84+	60+
Power Status (Late March)	Partially restored	Still restoring mes1-az2
Customer Migration	Active to unaffected regions	Active to unaffected regions

The Billing Waiver Created a Second Problem

AWS waived all March usage charges — unprecedented. But the waiver also removed Cost and Usage Report (CUR) data from billing dashboards. As Cory Quinn pointed out, for most enterprises the CUR isn't just an invoice — it's the authoritative record of what infrastructure exists. Compliance teams, auditors, and FinOps teams all build on it.

AWS later clarified the data wasn't deleted, just filtered from standard reports. But the lesson is clear: your billing data is also your infrastructure inventory. If your DR playbook doesn't account for billing data availability during a region-wide failure, you have an audit gap.

Why Multi-AZ Didn't Protect Workloads

This is the critical engineering lesson. Multi-AZ distributes across physically separate data centers within a single region, but all AZs sit within the same metro area and share the same geopolitical threat envelope.

Three assumptions that broke:

1. Multi-AZ ≠ Multi-Region

AZs within ME-CENTRAL-1 are ~50-100 km apart. A coordinated strike targeting a metropolitan area reaches multiple AZs. Engineers running workloads across all three AZs still experienced degradation because the surviving AZ couldn't absorb full regional load.

2. Control Plane Failures Cascade

Even where data plane instances survived (mec1-az1), the control plane was disrupted. Customers couldn't launch new instances, modify security groups, or execute failover automation. If your DR runbook requires API calls to the impaired region's control plane, your failover is dead on arrival.

3. Shared Dependencies Are Invisible

Services running in healthy AZs had hidden dependencies on impaired zones — internal load balancers, DNS resolution, IAM authentication endpoints. These cross-AZ dependencies aren't documented in customer-facing architecture diagrams.

Architecture Pattern	Protects Against	Does NOT Protect Against
Multi-AZ (same region)	Single AZ failure, hardware failure	Regional disaster, military strike
Multi-Region (active-passive)	Full region outage	Data lag during failover, control plane dependency
Multi-Region (active-active)	All above + zero RPO failover	Complexity, cost, global routing challenges
Multi-Cloud	Single provider failure	Doubled operational complexity

A Practical Redesign Framework

Here's how to rethink your architecture:

Tier 1: Region Risk Assessment. Before deploying to any region, evaluate the sovereign risk profile. Map regions against active conflict zones, not just latency numbers. AWS operates in the UAE, Bahrain, and is investing $5.3B in Saudi Arabia. Each region has a different threat model.

Tier 2: Cross-Region Data Replication. Implement async or sync replication to a geographically and politically distant region. S3 Cross-Region Replication, DynamoDB Global Tables, Aurora Global Database. RPO under 1 minute requires active-active with Global Accelerator routing.

Tier 3: Tested Failover. "Untested failover is no failover." Schedule quarterly game days where you actually cut traffic from one region. Organizations that never tested ME-CENTRAL-1 failover discovered missing encryption keys, expired credentials, and incomplete replication during the crisis.

Tier 4: Decouple Data Residency from Compute. If regulations require data in a specific country, architect so compute/serving can operate from a different region while maintaining data locality compliance.

Industry Impact

This was the first confirmed kinetic attack destroying a major cloud provider's infrastructure. Israel reportedly struck a Tehran data center on March 11 (Jerusalem Post), confirming both sides view digital infrastructure as strategic targets.

Counterintuitively, Amazon's stock rallied ~3% — investors betting the incident accelerates cloud spending on resilience. Oracle's Middle East regions experienced zero incidents during the same period, validating the multi-cloud argument for critical workloads.

Compared to Previous Outages

Outage	Cause	Duration	Regions	Services Down
AWS us-east-1 (Dec 2021)	Scaling bug	~10 hours	1	20+
AWS ME-CENTRAL-1 (Mar 2026)	Drone strikes	Weeks	2	84+
Azure (Jan 2023)	WAN routing misconfig	~5 hours	Multiple	15+
Google Cloud (Apr 2023)	Paris region power failure	~12 hours	1	10+

Physical infrastructure cannot be rebooted. It must be rebuilt. Organizations with infrastructure defined in Terraform or Ansible redeployed in hours. Those relying on ClickOps are still migrating a month later.

Five Things to Do Right Now

Audit region dependencies: aws ec2 describe-instances --query 'Reservations[].Instances[].[InstanceId,Placement.AvailabilityZone]'
Verify cross-region replication — check actual RPO/RTO metrics, not just config
Schedule a real failover test within 30 days
Review your CUR data pipeline for gaps during crisis scenarios
Document a geopolitical risk matrix for every region where you run workloads

The cloud is not an abstraction. It's concrete, steel, and cooling systems sitting on land that exists inside a geopolitical reality. Engineers who build truly resilient multi-region architectures will define the next decade of enterprise cloud design.

Originally published at firstpasslab.com. More deep dives on cloud networking, infrastructure security, and network architecture at FirstPassLab.

AI Disclosure: This article was adapted from original research with AI assistance for editing and formatting. All technical claims are sourced and linked. The original article contains full source citations.

DEV Community