DEV Community

Cover image for The Price of High Availability: Resilience in Architecture: Decisions Are Strategic, Not Just Technical
will peixoto for AWS Community Builders

Posted on • Originally published at willpeixoto.hashnode.dev

The Price of High Availability: Resilience in Architecture: Decisions Are Strategic, Not Just Technical

High Availability Has a Price — and Resilience Begins With Decisions, Not Stacks

🇧🇷 Read in Portuguese

After a major cloud outage — like the ones we’ve seen at AWS — the same questions always resurface in developer forums and company meetings:

“Should we move to a multi-region setup immediately?”

Or worse —

“Should we go back to on-premise?”

The correct answer is rarely technical.

It is strategic.

As Werner Vogels (AWS CTO) often says:

“Everything fails, all the time.”

The question is not if you will fail — it’s when you will fail, and how prepared you are when it happens.

Because it will happen, whether you are in the cloud, on-premise, or multi-cloud.

What truly separates resilient teams is not the absence of failure, but their speed, clarity, and effectiveness in response and recovery.

True architectural maturity isn't about choosing "multi-region" or "on-premise" — it’s about understanding the inherent risk, documenting the decision transparently, and reacting with a plan.


1. The Paradox of Visible Failure

Every time a major outage occurs, technical and executive teams tend to divide into two extreme reactions.

Crucial Reflection:

The cloud doesn’t fail more often than a traditional data center — it simply fails in a more visible, shared, and, ironically, democratic way.

In AWS, problems escalate globally and become trending topics in minutes.

Do you honestly believe your company has superior capacity to manage the resilience of infrastructure at a global scale compared to a major cloud provider?

The debate isn’t “Cloud vs. Data Center.”

It’s a strategic game of Conscious Resilience vs. The Comfort Zone.


2. Cost vs. Continuity: The 9s Game

In the world of infrastructure, every additional "9" in your SLA (Service Level Agreement) costs exponentially more.

To illustrate the real impact of each availability level, see the maximum allowed downtime per year:

SLA Level Downtime per Year
99% (Two 9s) ~3.6 days
99.9% (Three 9s) ~8 hours 43 minutes
99.99% (Four 9s) ~52 minutes
99.999% (Five 9s) ~5 minutes

Every jump in level requires not only doubling or tripling infrastructure but also operational sophistication.

The added cost must be justified by ROI (Return on Investment) — never by technical pride.

📢 The Non-Negotiable Factor: Regulation

For regulated sectors, the SLA choice is often mandated by law or industry standards.

The debate is how to achieve the legally binding SLA, as the cost of the regulatory fine far exceeds any technical savings.


3. Architectural Patterns: Naming Resilience

Resilience is codified through established patterns.

Resilience Pattern Use Case Failure Type
Protection Against Multi-AZ HA baseline (99.9% to 99.99%) Hardware, Data Center, or single AZ
Pilot Light RTO requirement of several hours Complete regional failure
Active-Active RTO/RPO near zero Regional failure and global balancing
Circuit Breaker Any microservice dependency Cascading failures

4. Conscious Decisions: The Virtue of ADRs

This is where ADRs (Architecture Decision Records) become crucial.

They capture the decision, the reason, and the accepted risk.

Here's an example of a well-formed ADR:

ADR-014: No Multi-Region Replication for MVP
Context: Traffic < 10 req/s. Replication cost is estimated at > 3x current cost.
Decision: Maintain single-region (Multi-AZ), with daily cross-region backup.
Review Trigger: Upon hitting 100 average req/s OR when SLA (99.95%) causes business impact.
Accepted Risk: Risk of total service downtime in case of a full regional outage (RTO estimated at 4 hours).


5. Selective Resilience: Not Everything Needs HA (and that’s fine)

Selective resilience is a virtue of efficiency.

Allocating finite resources (money and engineering attention) to unnecessary redundancy is one of the biggest wastes in architecture.

💡 Prioritize High Availability (HA) only for what truly matters:

  • Direct Revenue Functions: Components critical for financial transactions (e.g., checkout and payment APIs).
  • Critical Customer Journey: Functions that prevent customers from using the core value of the product (e.g., login or main catalog).
  • Regulatory & Legal Risk: Services where failure results in legal fines or breaks a penalizing contractual SLA.
  • Critical Data Integrity: Where data loss violates the acceptable RPO.

High availability without a purpose is like putting an airbag on a bicycle.


6. Managed ≠ Immune: The Serverless Mindset

A common mistake is believing that using serverless services (Lambda, DynamoDB, SQS) grants immunity to failure.

It does not.

Failure will still come — and often from where you least expect it.

Managed services reduce the operational surface area, but they do not replace sound design and preparation.

Real resilience does not come from the cloud provider; it comes from the architecture you design on top of it.


7. Maturity is in the Question: Architecture as Influence and Translation

The difference between having an opinion and having influence lies in your capacity for strategic clarity.

❓ Where is Your Team's Maturity?

Immature Teams Ask: Mature Teams Ask:
"What stack solves this?" "What risk are we willing to accept for this cost?"
"Should we use K8S or Lambda?" "What RTO/RPO does the customer expect?"
"What does Netflix do?" "What does our business need to survive a disaster?"

The Common Trap:

Your team defines the HOW (the stack), but the business defines the WHAT (the acceptable RTO and RPO).

It is your job to return the question so the risk decision lies with the business.


Translating Resilience Concepts for Leadership

Term Translation / Strategic Question
Failover Multi-Region "The catastrophe insurance." Reduces downtime from days to hours. Q: How many hours can we afford to be down?
RTO / RPO "Defining the loss limit." Q: How much data (RPO) can we afford to lose and how much time (RTO) to recover?
SPOF (Single Point Fail.) "The Achilles' Heel of Revenue." Q: What's the loss in 1h if this fails?

8. Conclusion

There is no such thing as a fail-proof architecture.

But there is such a thing as a surprise-proof organization.

It begins with conscious decisions, documented context (ADRs), and the technical humility to accept that error is part of the equation.

Teams that understand the why before the how build systems that not only scale — they survive.


9. Essential References

These are the documents that define the best practices for resilience in the cloud:


10. Glossary for Resilience

  • RTO (Recovery Time Objective): The maximum acceptable time a system can be down after a failure.
  • RPO (Recovery Point Objective): The maximum acceptable amount of data (measured in time) that can be lost during a disaster.
  • ADR (Architecture Decision Record): A short document recording a technical decision, the reason, and the accepted risk.
  • SPOF (Single Point of Failure): A single component that, if it fails, takes down the entire system or revenue stream.
  • Circuit Breaker: A software pattern that isolates a failing dependency to prevent cascading failures.

11. Want to Go Deeper?

Want to dive deeper into orchestration, resilience, and the strategic role of the developer in the serverless era?

🎤 I'll be speaking at ServerlessDays São Paulo on November 8th, at Cubo Itaú, discussing how to go beyond the stack and build systems that not only function — but thrive in chaos.

Come join the conversation! 🚀

Top comments (0)