Anderson Leite

Posted on Oct 27

"Multi-Cloud" and "Return to On-Prem" Aren't your silver bullets: A reality check after the AWS outage

#infrastructureascode #cloudcomputing #uptime #challenge

The recent AWS outage in US-EAST-1 sent shockwaves through the tech industry. DynamoDB went down for hours, taking with it a cascade of services including EC2, Lambda, EKS, and countless third-party platforms. Atlassian Cloud (among several other services) was completely inaccessible or severely degraded. If your team couldn't log in to Jira or Confluence, you felt it.

Note: Amazon released a post-mortem report for the outage, you can find it here.

In the aftermath, two choruses emerged: "This is why you need multi-cloud!" and "This is why we should have never left on-premises!" Blog posts and LinkedIn hot takes proclaimed both as obvious solutions.

You want to hear the uncomfortable truth most of these posts won't tell you? neither approach is a silver bullet, and for many organisations, both are cures worse than the disease.

The SLA Wake-Up Call

Let's talk numbers. AWS IAM's control plane SLA is 99.90% (three nines). That sounds good until you do the math:

99.90% uptime = 8.76 hours of allowed downtime per year
99.99% uptime = 52.56 minutes of allowed downtime per year

The October outage lasted approximately 14.5 hours from initial impact (11:48 PM PDT Oct 19) to full recovery across all services (2:20 PM PDT Oct 20). However, here's the SLA detail that matters: AWS Management Console authentication (the IAM control plane component covered by the SLA) was restored at 1:25 AM PDT—just 1 hour and 37 minutes of downtime.

While DynamoDB, EC2, Lambda, and countless other services remained impaired for hours longer, the specific service covered by the 99.90% SLA recovered within its threshold. AWS was technically compliant. No compensation. No penalties. This is the contract you signed, and this is why reading the fine print matters.

And, of course, if you would be eligible to any compensation, Amazon would never hand over you cash, but service credits as stated here.

Many engineering teams never look at these numbers until an outage forces them to. Here's what I've observed: Organisations that truly cannot tolerate this level of downtime are rare. Most think they can't tolerate it, but when pressed on actual business impact, the numbers tell a different story.

Why Multi-Cloud Isn't the Answer for Most Teams

1. The Architectural Lock-In Problem

Multi-cloud evangelists often gloss over the single biggest challenge: cloud services are not interchangeable commodities.

Consider these common scenarios:

Managed Services Lock-In:

AWS SQS → Azure Service Bus or Google Pub/Sub (different APIs, different guarantees, different behaviours)
AWS DynamoDB → Azure Cosmos DB or Google Firestore (fundamentally different data models)
AWS Lambda → Azure Functions or Google Cloud Functions (different runtime behaviours, cold start characteristics, limits)
Azure Key Vault → AWS Secrets Manager or Google Secret Manager (different access patterns, rotation mechanisms, while Azure store keys, secrets and certificates, the other two do it in separate services)

You can't just swap these services. Each has unique characteristics, APIs, and operational models. Abstracting them means building your own compatibility layer, which becomes your problem to maintain, test, and debug.

IAM and Security Models:

Each cloud provider has a completely different approach to identity and access management. AWS IAM roles work fundamentally differently from Azure Active Directory and Google Cloud IAM. Multi-cloud means your security team needs deep expertise in multiple authorisation paradigms.

Your own code Lock-In:

Your software engineering teams most probably uses the SDK from the cloud provider you are using to abstract a lot of the complexity of handling things to/from the provider (e.g. Azure Managed Identities), how many changes in the code would be required to make it run on two (or more) cloud providers at the same time ?

2. The Cost Iceberg

Everyone talks about the visible costs, but the real damage is below the waterline.

Visible Costs:

Running duplicate infrastructure across providers
Higher commitment discounts lost (you can't commit as heavily to one provider)
Licensing costs that don't transfer between clouds

The Hidden Costs (The Real Killers):

Engineering Time:

Learning and maintaining expertise across multiple platforms
Writing and maintaining abstraction layers
Managing multiple deployment pipelines
Debugging cross-cloud integration issues
Training new team members on multiple platforms

A rough estimate: a team maintaining true multi-cloud architecture will spend 30-50% more engineering time on infrastructure compared to a single-cloud team of equivalent size. That's not a small number. That's headcount.

Network Egress Costs:
Here's where it gets painful. Cross-cloud data transfer isn't just expensive; it's shockingly expensive.

AWS to internet: $0.09/GB (after first GB)
Azure to internet: $0.087/GB (after first 5GB)
Cross-cloud transfers: Both sides charge egress

If your multi-cloud strategy involves real-time data synchronisation or frequent failover testing, you'll watch your network costs explode. I've seen companies with monthly bills in the tens of thousands just for egress they didn't anticipate.

Operational Complexity:

Multiple monitoring systems (CloudWatch, Azure Monitor, Google Cloud Operations) or a separate solution (NewRelic, DataDog), and it's costs for ingesting and processing data
Multiple logging aggregation strategies
Multiple support contracts and vendor relationships
Multiple compliance and security audits

3. The Disaster Recovery Misconception

Let's address the elephant in the room: multi-cloud doesn't automatically give you better disaster recovery.

The AWS outage primarily affected US-EAST-1. Organisations with resources in other AWS regions (US-WEST-2, EU-WEST-1, etc.) remained operational (I know, you wasn't maybe able to login to the console or query some services during the disruption, but your workloads are kept running, thats my point here). This is the key insight: multi-region within a single cloud provider offers 90% of the resilience benefits at 20% of the complexity.

AWS, Azure, and Google all operate multiple independent regions with separate control planes, separate infrastructure, and geographic separation. A properly architected multi-region deployment on a single cloud can survive:

Regional disasters
Regional control plane failures
Natural disasters
Network partitions

What it can't survive:

Global control plane failures (extremely rare)
Account-level issues
Catastrophic vendor failure

The question is: are you optimising for that final 1% scenario? And at what cost?

The "We Should Never Have Left On-Prem" Nostalgia

Another narrative emerging from the AWS outage: "This is why we should have never left bare-metal/on-premises!"

Let's be clear: This is nostalgia talking, not math.

Let's be brutally honest about what "going back to on-prem" actually means. Have you ever calculated how much costs are involved in build and operate datacenters? Been there, done that back in my days in Angola, and trust me: If you manage to convince your upper management to give you the shitload of money needed to rebuild everything, be prepared to deal with:

ALL contractors delays (yes, they will)
Import duties (depending of your location) and damage of goods during transport
Initial test runs and fixes (yes, you will NOT bring your systems to the location without confirm everything works as expected)
The huge maintenance window to move everything back (even if you start with a secondary location and after it move the main one, you will still need a maintenance window, and during this time, no offsite / cross-replication will be in place)
Operation and monitoring (good luck!)

The Infrastructure Reality Check

Let's break it down only to the huge/main build blocks, ok?

Physical Infrastructure Costs:

Data center space: Lease or build (construction costs start at millions, and if you are dreaming on "I will certify my datacenters following the uptime institute standards" add AT LEAST an extra million here to start)
HVAC systems: Commercial-grade cooling isn't cheap; expect $50K-$500K+ depending on scale, doesn't matter which model you will choose: cooling tower, in-row, top-down, below-row, all of them are expensive, and, please don't forget to add the "emergency" system (yes, those multi-split systems you assumed you would never need anymore)
Fire suppression: Specialised systems (FM-200, Novec, etc.) run $30K-$100K+ per installation
Physical security: Badge systems, cameras, intrusion detection, 24/7 monitoring
Power infrastructure:
- Primary power with redundant feeds
- UPS systems (enterprise-grade: $50K-$500K+)
- Generator backup (installation: $100K-$1M+, plus fuel storage and maintenance)
- Power distribution units (PDUs), redundant circuits
Network connectivity:
- Multiple ISP connections (redundancy)
- BGP and iBGP setup and management
- DDoS protection hardware/services

Secondary Location Requirements:

Here's the kicker: if you're building on-prem for resilience, you need at least two sites. Otherwise, you're far worse off than single-region cloud. That means:

Everything listed above × 2 (minimum)
Geographic separation for disaster recovery
Network connectivity between sites (MPLS - or EoMPLS - depending of how you want your environment to act, dark fiber or VPN)
Data replication infrastructure

Physical Infrastructure (a bit more detailed list of items than above):

Data center space: Lease or purchase, often with multi-year commitments
HVAC systems: Industrial cooling that runs 24/7/365 (datacenters generate enormous heat)
Power infrastructure:
- Redundant power feeds from different substations
- UPS systems (battery backup) for instantaneous failover
- Generator systems with fuel contracts and maintenance
- Automatic transfer switches
- PDUs (Power Distribution Units) and redundant circuits
Fire suppression: Specialised systems (often with inert gas-based, replacing the oxygen at the room, not water) that won't destroy equipment
Physical security:
- 24/7 monitoring and guard services
- Biometric access controls
- Mantrap entries
- Security cameras and recording systems
- Intrusion detection and prevention
Network connectivity:
- Multiple ISP connections for redundancy
- BGP routing equipment
- Firewall and edge security appliances

Disaster Recovery / Failover Requirements:

Replication infrastructure: Dedicated network circuits between sites (not cheap)
Offsite backups: Tape or disk rotation to a third location, plus managing security and environmental aspects of this location (don't forget to acquire a fire-proof safe to store the offline/offsite data)
DR testing: Regular failover drills that consume time and resources

Direct Operating Costs:

Hardware: Servers, storage, networking gear (with 3-5 year refresh cycles)
Licenses: OS licenses, hypervisor licenses, monitoring tools, backup software
Personnel:
- Data center technicians (24/7 coverage = multiple shifts)
- Network engineers
- Storage administrators
- Security team
- Facilities management
- On-call rotations
Maintenance contracts: For all hardware and critical systems
Compliance and auditing: SOC 2, ISO 27001, PCI-DSS audits for your facilities

The Real Kicker: Capacity Planning

Here's what cloud critics forget: in on-prem, you pay for peak capacity all the time.

That Black Friday traffic spike? You bought servers for that in January. That annual report processing that runs for 3 days each quarter? Those servers sit idle 99% of the year.

Cloud providers can play tetris with workloads across millions of customers. You can't. Every server you buy for peak capacity is money sitting in a rack depreciating when you're at normal load.

The "We Never Had Outages On-Prem" Myth

Let's inject some honesty here. On-prem had outages too:

That time the HVAC failed over the weekend
The network switch that died and took 6 hours to replace
The storage array that needed an emergency firmware update
The fire suppression test that went wrong
The power outage that revealed your generator didn't start
The database backup that was silently failing for weeks

The difference? Fewer people heard about them. When your internal systems go down, it doesn't make TechCrunch. When AWS goes down, it's global news.

When On-Prem Actually Makes Sense

I'm not saying on-prem is never the answer. It makes sense when:

Regulatory Requirements: Data that legally cannot leave your physical control (rare, but real)
Extreme Security Requirements: Government, defense, or highly sensitive financial systems
Massive Scale with Predictable Load: If you're Netflix-sized with steady traffic, the economics flip
Legacy Constraints: Applications that physically cannot be migrated (yet) and are too critical to rewrite
Specific Hardware Requirements: Custom hardware, specialised GPUs, or equipment that can't be virtualised

For the other 95% of companies? The TCO math doesn't lie. Cloud wins on cost, agility, and yes, even reliability when you architect properly.

When Multi-Cloud Actually Makes Sense

I'm not saying multi-cloud is never the answer. It makes sense when:

Regulatory Requirements: Some industries or regions legally require data to live in specific clouds or geographic locations that only certain providers serve well.
Best-of-Breed Services: You have a clear, specific reason to use different services from different providers (e.g., AWS for compute/storage, while using Snowflake on Azure for data warehousing).
Merger/Acquisition Integration: You've inherited infrastructure from an acquisition and need time to consolidate (personally I did it recently, setting up a AWS account for a product the company I was working acquired, while our entire stack relies on MS Azure).
Massive Scale with Resources: You're operating at a scale where vendor negotiations and cost optimisation across providers makes financial sense, and you have a dedicated platform team.
Strategic Vendor Independence: Your organisation has a strategic directive (not just an architectural preference) to avoid single vendor dependency, and leadership understands and accepts the costs.

Notice what's not on this list: "preventing outages." That's what multi-region is for.

What You Should Do Instead

1. Design for Regional Resilience

Invest in proper multi-region architecture within your primary cloud:

Deploy critical services across at least 2-3 regions
Implement proper health checking and automated failover
Test your failover regularly (not just once a year)
Use global load balancers (Route 53, Azure Traffic Manager, Cloud Load Balancing)

2. Understand Your True RTO/RPO Requirements

Stop guessing. Calculate actual business impact:

What's the cost per hour of downtime?
What's your Recovery Time Objective (RTO)?
What's your Recovery Point Objective (RPO)?
How much are you willing to pay to improve these numbers?

Most teams discover their true requirements are more forgiving than they thought.

3. Build for Degraded States

Instead of assuming perfect availability, design systems that can operate in degraded modes:

Queue writes when databases are unavailable
Cache aggressively
Implement circuit breakers
Design graceful degradation paths
Communicate clearly with users about reduced functionality

This resilience works regardless of whether you're single-cloud or multi-cloud.

4. Invest in Observability

You can't fix what you can't see:

Comprehensive monitoring across all services
Distributed tracing
Proper alerting (not alert fatigue)
Runbooks for common failure scenarios
Regular disaster recovery drills

5. Know Your Dependencies

Map your critical path dependencies:

Which managed services are you using?
What happens if each one fails?
Do you have fallback strategies?
Can you operate without them temporarily?

The AWS outage revealed that many organisations had invisible dependencies they didn't know about. Don't let that be you.

The Real Question

Before embarking on a multi-cloud journey or romanticising a return to on-premises, ask yourself:

"Are we trying to solve a technical problem or an organisational fear?"

Fear of vendor lock-in is valid. Nostalgia for "simpler times" with physical servers is understandable. But neither should drive architecture decisions without rigorous business justification.

The costs of multi-cloud in engineering time, complexity, operational overhead, and actual dollars/euros/reais/whatever currency you use must be weighed against the risk you're trying to mitigate.

The cost of on-premises in capital investment, personnel, inflexibility and opportunity cost must be weighed against the control you think you're gaining.

Conclusion

Last week's AWS outage was significant. It affected major services and disrupted businesses worldwide. But the lesson isn't "multi-cloud or bust" and it definitely isn't "back to the data center." The lesson is:

Understand your SLAs (really understand them, with math)
Design for failure within your primary cloud
Implement proper multi-region architecture
Build systems that degrade gracefully
Only go multi-cloud when the business case is clear
Don't romanticise on-prem without doing the full cost analysis

Multi-cloud is a tool, not a religion. On-premises is a deployment model, not a time machine.

For most organisations, the complexity, cost, and engineering overhead of either multi-cloud or on-premises outweigh the marginal resilience benefits over proper single-cloud, multi-region architecture.

Your time and money are better spent building robust, well-architected systems on one cloud platform than either spreading yourself thin across multiple platforms you don't fully understand or rebuilding infrastructure capabilities that cloud providers have spent billions perfecting.

What's your take? Has your organisation seriously evaluated the true cost of multi-cloud or returning to on-prem? I'd love to hear your experiences in the comments.

Tags: #aws #azure #gcp #cloudcomputing #devops #architecture #multicloud #onpremises #datacenter #sre #cloudstrategy

DEV Community