Eyal Estrin for AWS Community Builders

Posted on Oct 20 • Originally published at security-24-7.com

Mitigating the risk of a global public cloud outage

#aws #cloud #architecture #design

When designing for resiliency, we usually think about deploying an entire workload over multiple availability zones (i.e., multi-AZ) or, if the business requires it, design for multi-region (with the challenges coming with it).

One thing most organization fail to understand is that many of the global services that we use every day (from identity management, DNS, compute, storage and databases) are built on the concept of “control plane” and “data plane”.

In public cloud environments, the concepts of "control plane" and "data plane" are fundamental to understanding how cloud services operate and ensure scalability, security, and resilience.

Control Plane - the administrative and orchestration layer responsible for managing, configuring, and controlling cloud resources and services. It handles all control operations such as creating, updating, deleting, and listing resources (CRUDL operations). For example, launching a VM instance, creating an object storage bucket, or configuring network policies are control plane activities.

Data Plane - the functional layer that executes the core operations of the service, such as running virtual machines, handling storage, transmitting network packets, and processing application workloads. For example, the running VM instances, the data stored and retrieved from an object storage, and network packet forwarding are all part of the data plane.

In this blog post, I will share some recommendations I have collected from the official public cloud documentation.

Recommendations for Resilient Application Design to Mitigate Control Plane Outages in AWS

1. Do Not Rely on Control Plane in Recovery Paths

Avoid dependencies on control plane operations in global or partitional services during failover or disaster recovery. For instance, IAM, Route 53, and Amazon S3 control plane actions are often centralized in specific regions like us-east-1 or us-west-2. Outage in these regions impacts CRUDL operations (create, read, update, delete, list) on resources.
Instead, architect systems to rely primarily on data plane operations, which are regionally distributed and more resilient.
Pre-provision resources like ELBs, Route 53 DNS records, S3 buckets, and API Gateway endpoints to avoid the need for control plane changes during failures.
Use caching and replication to maintain critical configuration and state data accessible via the data plane during control plane outage.

2. Use Static Stability Design Patterns to Avoid Control Plane Overload and Failure

Implement static stability by minimizing dynamic changes to the control plane during normal and failure operations. Avoid rapid scaling, reconfiguration, or failover operations that require heavy control plane interactions.
Favor stable, predictable configuration models that do not require frequent updates to DNS records, API endpoints, or IAM policies.
Use fallback mechanisms and circuit breakers to prevent cascading overloads on control plane services.
Design smaller, scoped services in charge of control plane operations to isolate and reduce the risk of large-scale failure due to control plane overload.

3. Multi-AZ and Multi-Region Architectures for Control Plane Controlled Evacuation

Use multi-AZ deployments to increase availability; utilize patterns like control-plane-controlled evacuation from unavailable AZs.
Automate graceful evacuation of workloads from impaired AZs by orchestrating controlled drain or failover without overwhelming control plane requests.
For critical service control planes hosted in a single region (like the global IAM control plane in us-east-1), consider geographic separation and preplanned failover or fallback strategies.
Validate failover automation works without real-time dependency on control plane changes whenever possible.

4. Prepare for Single Points of Failure in Global Control Planes

Understand AWS global services whose control planes operate in a single region while their data planes are globally distributed (e.g., IAM, Route 53, CloudFront).
Design for "partitional" services failure modes by isolating workloads from control plane failures impacting global endpoints.
Implement "break-glass" procedures and users pre-configured for emergency access during control plane outages.
Use regional endpoints for services like AWS STS to reduce reliance on global endpoints.

5. Operational Excellence and Monitoring

Use AWS Systems Manager Parameter Store, DynamoDB, or S3 (data plane) to store critical runtime configuration separately from the control plane.
Continuously monitor control plane health using AWS CloudWatch and Route 53 health checks.
Enable automated alarms and operational runbooks for rapid detection and resolution of control plane degradation.

References

Summary

Designing only for data plane resilience is insufficient. To withstand regional or global control plane disruptions, workloads must avoid real-time control plane dependencies during failover, pre-provision critical resources, and rely on static, replicated, and automated recovery mechanisms.

Readers should broaden their design considerations beyond resiliency and disaster recovery. Include chaos engineering exercises in regular deployment cycles, and plan for security and operational access—specifically, how teams will connect to and manage environments during a public cloud outage.

About the author

Eyal Estrin is a seasoned cloud and information security architect, AWS Community Builder, and author of Cloud Security Handbook and Security for Cloud Native Applications. With over 25 years of experience in the IT industry, he brings deep expertise to his work.

Connect with Eyal on social media: https://linktr.ee/eyalestrin.

The opinions expressed here are his own and do not reflect those of his employer.

Top comments (3)

Ben Halpern • Oct 20

Very topical

leob • Oct 21 • Edited

Sounds really really hard, for anyone who's not at "rocket science" competence level - is this advice even practical (for the mere mortals among us) ?

Maybe the pragmatic (also economically) approach is to just accept the tiny risk of an outage like the AWS one which occurred around October the 20th, 2025 - I mean, how often do "big" outages like that occur - every 3 or 4 years? What fraction of a tenth of a percent does that take out of your system's uptime?

If it really requires 100% rock solid uptime, then I suppose you should host it in your own data center, in a nuclear proof bunker - or build a complete copy of your system in a second separate cloud, with continuous data replication and fully automatic failover ;-)

Warren Parad AWS Community Builders • Nov 7

With all the articles on what just happened exactly with that AWS outage and DNS and DynomoDB, I thought it would actually be valuable to share how companies with critical software mitigate these types of incidents. This took me quite some time to write up, and it's quite the long read: dev.to/aws-builders/how-when-aws-w...

Some comments may only be visible to logged-in visitors. Sign in to view all comments.