Ntombizakhona Mabaso

for AWS Community Builders

Posted on Feb 5 • Edited on Feb 6

Design Highly Available And / Or Fault-Tolerant Architectures

#aws #certification #cloud #solutionsarchitect

Exam Guide: Solutions Architect - Associate
🧱 Domain 2: Design Resilient Architectures
📘 Task Statement 2.2

🎯 Designing Highly Available And Fault Tolerant Architectures is about keeping workloads running despite failures.

High Availability (HA): the system stays up through component failures
Fault Tolerance (FT): the system continues operating with no interruption

Highly Available usually means Multi-AZ + load balancing + managed services + no single points of failure.

Knowledge

1 | AWS Global Infrastructure

AZs, Regions, Route 53

1 Availability Zones (AZs): isolated failure domains within a Region

2 Regions: separate geographic areas for disaster recovery

3 Amazon Route 53: DNS-based routing and health checks (common for regional failover)

“Must survive an AZ failure” → Multi-AZ design.

“Must survive a regional outage” → Multi-region DR + Route 53 failover.

2 | AWS Managed Services With Appropriate Use Cases

This bullet exists because managed services often include built-in HA scaling and reduce your operational risk.

Even if services like Comprehend or Polly aren’t HA topics by themselves, the exam tests the principle:

Prefer managed services when you want higher reliability with less custom work.

3 | Basic Networking Concepts

Route Tables

HA and FT depend on correct routing:
1 Public subnets route to an Internet Gateway (IGW)
2 Private subnets may route outbound via NAT Gateway
3 Multi-AZ designs require correct subnet/routing per AZ

4 | Disaster Recovery Strategies

RPO/RTO, backup-restore, pilot light, warm standby, active-active

Know these by cost vs recovery speed:

DR strategy	What it is	Typical RTO/RPO	Cost
Backup & Restore	restore from backups into a new environment	Slow RTO, higher RPO	Lowest
Pilot Light	minimal core services running (e.g., DB + minimal infra)	Medium RTO, medium RPO	Low–Medium
Warm Standby	scaled-down but fully functional stack always running	Faster RTO, low RPO	Medium–High
Active-Active	both Regions serve traffic	Lowest RTO/RPO	Highest

If RTO/RPO are strict, the answer moves toward warm standby / active-active.

5 | Distributed Design Patterns

Common Resilience Patterns

1 Retry with backoff: avoid thundering herd
2 Timeouts: prevent resource exhaustion
3 Circuit breaker / bulkhead: limit cascade failures
4 Queue-based load leveling: SQS
5 Idempotency" safe retries
6 Multi-AZ deployment: for every critical tier

6 | Failover Strategies

Ways Failover Happens On AWS

1 Load balancer failover across targets in multiple AZs within a Region
2 Database failover: RDS Multi-AZ
3 DNS failover: Route 53 health checks across Regions
4 Client-side failover: apps try secondary endpoints

“Fail over between Regions” → Route 53 failover routing (or latency-based + health checks).

7 | Immutable Infrastructure

Immutable means you don’t patch servers in place, you replace them:
1 Build a new AMI/container image
2 Deploy new instances/tasks
3 Terminate old ones

Benefits:

Consistency
Faster recovery
Lower configuration drift

“Ensure infrastructure integrity and repeatability” → IaC + immutable deployments.

8 | Load Balancing Concepts

Application Load Balancer

ALB spreads traffic across targets in multiple AZs
Helps remove single-instance failure as a SPOF

9 | Proxy Concepts

Amazon RDS Proxy

RDS Proxy helps reliability especially for spiky/serverless workloads by:
1 Pooling and reusing DB connections
2 Reducing DB overload due to connection storms
3 Improving failover behavior for some patterns

“Lambda causes too many DB connections” → RDS Proxy.

10 | Service Quotas And Throttling

Standby Environments

In DR scenarios, your standby Region or account must have enough quota to scale up.

Know that you can:

Check and adjust Service Quotas
Design for throttling with retries or backoff and buffering

11 | Storage Options And Characteristics

Durability And Replication

Storage durability affects architecture choices:
1 S3 is highly durable and regional with options like versioning and replication
2 EBS is replicated within an AZ and you can send snapshots to S3 for durability
3 EFS is regional and multi-AZ within a Region

12 | Workload Visibility

AWS X-Ray

Visibility supports HA by helping you detect and diagnose failures:

CloudWatch metrics/alarms for health and scaling
X-Ray for tracing distributed requests and finding bottlenecks

Skills

A | Determine Automation Strategies To Ensure Infrastructure Integrity

Look for:
1 Infrastructure as Code (CloudFormation/CDK/Terraform)
2 Automated deployments (blue/green, rolling)
3 Auto Scaling + health checks
4 Automated recovery actions (replace unhealthy instances/tasks)

B | Determine Services Required For HA/FT Across Regions or AZs

Common choices:

Multi-AZ: ALB + Auto Scaling + Multi-AZ database (RDS Multi-AZ)
Multi-region: Route 53 + replicated data + standby/active environment

“AZ outage must not cause downtime” → Multi-AZ everything.

C | Identify Metrics Based On Business Requirements

Tie Monitoring To User-Impacting KPIs:

1 Availability / error rate (5xx)
2 Latency p95/p99
3 Queue depth / age (SQS)
4 CPU/memory/connections (compute/DB)
5 RPO/RTO compliance signals (backup success, replication lag)

D | Implement Designs To Mitigate Single Points Of Failure

Remove Single Points of Failure

1 Multi-AZ deployments
2 Redundant NAT Gateways (one per AZ for best practice)
3 Multi-AZ databases
4 Avoid single instance “pet” servers

E | Ensure Durability And Availability Of Data

Backups

1 Automated backups (RDS)
2 Snapshots (EBS, RDS)
3 S3 versioning + replication where required
4 AWS Backup policies when asked for centralized backup

F | Select An Appropriate DR Strategy To Meet Business Requirements

Use RTO/RPO to pick:
1 Backup/Restore (cheap, slow)
2 Pilot Light (medium)
3 Warm Standby (faster)
4 Active-Active (fastest, expensive)

G | Improve Reliability Of Legacy Apps

When app changes are not possible, use infrastructure patterns:

1 Put app behind ALB
2 Use Auto Scaling groups to replace failed instances
3 Use RDS Proxy to stabilize DB connections
4 Use caching to reduce backend load
5 Use DNS failover (Route 53) for regional DR

H | Use Purpose-Built AWS Services

Use managed services to reduce failure modes:

ALB, Auto Scaling, Route 53
RDS Multi-AZ, DynamoDB (managed HA)
SQS/SNS for decoupling spikes and failures
CloudFront for edge caching and origin protection

Cheat Sheet

Requirement	Direction
Survive an instance failure	Auto Scaling + health checks + ALB
Survive an AZ failure	Multi-AZ for each tier (ALB targets across AZs, Multi-AZ DB)
Survive a Region failure	DR strategy + Route 53 failover + replicated data
Strict RTO/RPO	Warm standby or active-active
Lambda overwhelms RDS with connections	RDS Proxy
Need to see bottlenecks across microservices	X-Ray (plus CloudWatch)
Standby must scale during failover	Plan Service Quotas + scaling policies

Recap Checklist ✅

1. [ ] Every critical tier is deployed across multiple AZs

2. [ ] Traffic is distributed via ALB/NLB and unhealthy targets are replaced automatically

3. [ ] Databases use HA features (e.g., RDS Multi-AZ or managed HA services)

4. [ ] DR strategy matches business RTO/RPO (backup/restore vs pilot light vs warm standby vs active-active)

5. [ ] Regional failover uses Route 53 health checks/routing (when required)

6. [ ] Data durability is addressed (backups, snapshots, replication)

7. [ ] Quotas and throttling are considered for failover/standby scaling

8. [ ] Monitoring and tracing exist (CloudWatch + X-Ray)

AWS Whitepapers and Official Documentation

These are the primary AWS documents behind Task Statement 2.2.

You do not need to memorize them, use them to understand why highly available and fault tolerant architectures work the way they do.

🚀