DEV Community

Cover image for DynamoDB Outage: Why Multi-Cloud Fails Startups (And Real DR Wins)

DynamoDB Outage: Why Multi-Cloud Fails Startups (And Real DR Wins)

If you felt like half the internet was broken this week, you weren't wrong. 📉 A massive, 15-hour outage in Amazon's us-east-1 region took down DynamoDB and with it, a huge chunk of the web.

This wasn't just "a server went down." It was a complex, cascading failure that exposed the deep interconnectedness of cloud services. For startups and scaleups, the immediate reaction is often, "We need to be multi-cloud to prevent this!"

Hold on!

The real lesson here isn't about running from your cloud provider. It's about understanding what failed, why us-east-1 is a special kind of dangerous and how to build a realistic Disaster Recovery (DR) plan that won't bankrupt you.


The Anatomy of a Cascading Failure

This outage was a masterclass in how modern, automated systems can fail in spectacular ways. It wasn't one thing; it was a chain of dominoes.

  1. The Trigger: A DNS Race Condition
    It all started with the system that manages the DNS for DynamoDB. Think of DNS as the Internet's phonebook. This automated system had a latent race condition—a hidden bug. Two of its own processes tried to update the DynamoDB DNS record at the exact same time.

    • One process (let's call it "Slow-Worker") grabbed an old plan.
    • A second process ("Fast-Worker") grabbed a new plan and applied it successfully.
    • "Fast-Worker" then did its cleanup, deleting the old plan that "Slow-Worker" was still holding.
    • "Slow-Worker" finally woke up and applied its plan... which was now empty.
    • Result: The main DNS record for dynamodb.us-east-1.amazonaws.com was wiped clean. All its IP addresses vanished.
  2. The First Domino: DynamoDB Goes Offline
    Instantly, any application (including AWS's own internal services) attempting to access DynamoDB in that region received a "does not exist" error. The service was unreachable.

  3. The Cascade: EC2, Lambda and IAM Fall Next
    This is where it gets scary. Cloud services are built on top of other cloud services. And DynamoDB is a Tier 0 service—a foundational block.

    • EC2 failed because its control plane (the "brain" that launches new servers) uses DynamoDB to track the state and leases of its physical hardware. No DynamoDB, no new EC2 instances.
    • Lambda, ECS, EKS and Fargate all failed because they all run on EC2. They couldn't get new computing capacity.
    • Network Load Balancers started failing health checks, causing connection errors for services that were technically still running.
    • IAM, which handles authentication, was also impacted. This is critical: during the outage, some engineers were unable to log in to the console to fix the problem.
  4. The 15-Hour Recovery and "Congestive Collapse"
    Engineers fixed the DNS record relatively quickly, but the outage lasted 15 hours. Why?

    • DNS Caching: The "empty" (and wrong) DNS record was cached by resolvers all over the internet. They had to wait for that cache to expire.
    • Congestive Collapse: When the service finally came back, a "thundering herd" of every single service retrying at once hammered DynamoDB. The system, in its weakened recovery state, was so overwhelmed by recovery work that it couldn't make forward progress. Engineers had to manually throttle traffic and drain backlogs to bring it back online safely.

The Global Blast Radius: Why You Should Never Host in us-east-1

"But I don't even use us-east-1!" you might say. "I'm in eu-west-3 (Paris)!"

It didn't matter. This outage had a global impact and it exposes the dirty secret of AWS: us-east-1 (N. Virginia) is not just another region.

Because it's the oldest AWS region, many "global" services have their control planes homed there by default.

  • Global IAM Console: The main IAM dashboard and API are, by default, in us-east-1. During the outage, users in other regions reported being unable to manage permissions or roles.
  • S3 Management Console: The "global" S3 console is also hosted there. You could still access your data in a bucket in Frankfurt, but you couldn't manage the bucket (e.g., change policies, create new buckets).
  • Global Services: Services like DynamoDB Global Tables, which replicate data worldwide, saw massive replication lag to and from the failed region.

The Multi-Cloud Fallacy: Doubling Your Problems, Not Your Uptime

When an event like this happens, the C-suite's first question is, "Why aren't we on GCP and Azure, too?"

For a startup or scaleup, "multi-cloud" is a trap. It's a strategy for massive, risk-averse banks and Fortune 100s with regulatory requirements, not for a company that needs to move fast.

Chasing multi-cloud to solve for availability is a terrible trade-off. Here’s why:

  1. Exponential Complexity: You think AWS IAM is hard? Now try to manage AWS IAM, Google Cloud IAM and Azure Entra ID and make them all talk to each other securely. Your 3-person DevOps team is now responsible for three entirely different networking stacks, security models and deployment pipelines.
  2. The "Lowest Common Denominator" Problem: This is the killer. The real power of AWS is in its managed services—DynamoDB, S3, Kinesis and Lambda. If you design your app to be "cloud-agnostic," you cannot use any of them. You're forced to build on basic VMs and manage your own databases (PostgreSQL on EC2) and message queues (RabbitMQ on EC2). You've just sacrificed your biggest competitive advantage (velocity) for a false sense of security.
  3. The Talent Chasm: Finding great AWS engineers is hard enough. Finding engineers who are equally expert-level in AWS, GCP and Azure is finding a unicorn. 🦄 More likely, you'll have a team that is mediocre at all three.
  4. The Hidden Costs: You won't save money. You'll lose all your volume discounts and you'll be hit with a constant stream of data egress fees just to keep your data in sync between clouds. This cost alone can cripple a startup.

The Right Answer: A Real DR Plan (Multi-Region, Not Multi-Cloud)

The problem this week wasn't that AWS failed. The problem was that a single region, us-east-1, failed.

The smart, resilient and cost-effective solution for a startup is not to go multi-cloud, but to go multi-region within your primary cloud.

This is where you must have an honest conversation about Cost vs. Availability. Your availability is a business decision, not just a technical one. Here are your options, from cheapest to most expensive:

1. Cold DR: Backup & Restore

  • How it works: You take regular backups (e.g., S3 snapshots, DynamoDB backups) and replicate them to another region using S3 Cross-Region Replication (CRR). If a disaster happens, you manually spin up a new environment from scratch in the new region and restore from the backup.
  • Cost: Very low. Just storage costs.
  • Availability (RTO/RPO): Very poor. RTO (Recovery Time Objective) is in hours or days. RPO (Recovery Point Objective) is high (e.g., "we lose the last 4 hours of data").
  • Use Case: Good for non-critical systems, dev/test environments.

2. Warm DR: Pilot Light (The Startup Sweet Spot 💡)

  • How it works: This is the best balance for most startups.
    • Data: Your critical data is actively replicated to the second region. Use DynamoDB Global Tables or Aurora Global Databases.
    • Infra: A minimal copy of your core infrastructure (e.g., your container images in ECR, a tiny app server, your IaC scripts) is "on" but idle in the DR region. The "pilot light" is lit.
    • Failover: When a disaster hits, you "turn up the gas." You run your scripts to scale up the app servers, promote the standby database to be the new primary and use Route 53 DNS Failover to automatically redirect all traffic.
  • Cost: Medium. You pay for data replication and minimal idle infrastructure.
  • Availability (RTO/RPO): Good. RTO is in minutes. RPO is near-zero (you lose no data).

3. Hot DR: Active-Active

  • How it works: You run your full production stack in two or more regions simultaneously. Route 53 (or a global load balancer) splits traffic between them. If one region fails, it just takes on 100% of the traffic.
  • Cost: Very high. You are paying for 2x (or more) of everything.
  • Availability (RTO/RPO): Excellent. RTO is in seconds (or zero). RPO is zero.
  • Use Case: Only for your absolute, mission-critical, "company-dies-if-it's-down-for-1-minute" services.

Your Survival Checklist

Don't wait for the next outage. As a startup, you can survive this.

  1. Move out of us-east-1 for your primary workloads. Seriously.
  2. Define your RTO/RPO. Have the business conversation: "How long can we be down and how much data can we afford to lose?" This dictates your budget.
  3. Implement a Pilot Light strategy for your core services.
  4. Use native replication: Use DynamoDB Global Tables, Aurora Global DBs and S3 CRR.
  5. Replicate your CI/CD assets: Make sure your container images (ECR) and deployment scripts are in your DR region, too. You can't recover if your recovery tools are in the fire.
  6. Test your plan. A DR plan you've never tested is not a plan. it's a prayer.

This outage was a wake-up call. But the lesson isn't to flee AWS. It's to stop treating "the cloud" as one magic box and start treating a region as your true failure domain.

Top comments (0)