Exam Guide: Solutions Architect - Associate
🧱 Domain 2: Design Secure Architectures
📘 Task Statement 2.2
🎯 Designing Highly Available And Fault Tolerant Architectures is about keeping workloads running despite failures.
- High Availability (HA): the system stays up through component failures
- Fault Tolerance (FT): the system continues operating with no interruption
Highly Available usually means Multi-AZ + load balancing + managed services + no single points of failure.
Knowledge
1 | AWS Global Infrastructure
AZs, Regions, Route 53
1 Availability Zones (AZs): isolated failure domains within a Region
2 Regions: separate geographic areas for disaster recovery
3 Amazon Route 53: DNS-based routing and health checks (common for regional failover)
“Must survive an AZ failure” → Multi-AZ design.
“Must survive a regional outage” → Multi-region DR + Route 53 failover.
2 | AWS Managed Services With Appropriate Use Cases
This bullet exists because managed services often include built-in HA scaling and reduce your operational risk.
Even if services like Comprehend or Polly aren’t HA topics by themselves, the exam tests the principle:
Prefer managed services when you want higher reliability with less custom work.
3 | Basic Networking Concepts
Route Tables
HA and FT depend on correct routing:
1 Public subnets route to an Internet Gateway (IGW)
2 Private subnets may route outbound via NAT Gateway
3 Multi-AZ designs require correct subnet/routing per AZ
4 | Disaster Recovery Strategies
RPO/RTO, backup-restore, pilot light, warm standby, active-active
Know these by cost vs recovery speed:
| DR strategy | What it is | Typical RTO/RPO | Cost |
|---|---|---|---|
| Backup & Restore | restore from backups into a new environment | Slow RTO, higher RPO | Lowest |
| Pilot Light | minimal core services running (e.g., DB + minimal infra) | Medium RTO, medium RPO | Low–Medium |
| Warm Standby | scaled-down but fully functional stack always running | Faster RTO, low RPO | Medium–High |
| Active-Active | both Regions serve traffic | Lowest RTO/RPO | Highest |
If RTO/RPO are strict, the answer moves toward warm standby / active-active.
5 | Distributed Design Patterns
Common Resilience Patterns
1 Retry with backoff: avoid thundering herd
2 Timeouts: prevent resource exhaustion
3 Circuit breaker / bulkhead: limit cascade failures
4 Queue-based load leveling: SQS
5 Idempotency" safe retries
6 Multi-AZ deployment: for every critical tier
6 | Failover Strategies
Ways Failover Happens On AWS
1 Load balancer failover across targets in multiple AZs within a Region
2 Database failover: RDS Multi-AZ
3 DNS failover: Route 53 health checks across Regions
4 Client-side failover: apps try secondary endpoints
“Fail over between Regions” → Route 53 failover routing (or latency-based + health checks).
7 | Immutable Infrastructure
Immutable means you don’t patch servers in place, you replace them:
1 Build a new AMI/container image
2 Deploy new instances/tasks
3 Terminate old ones
Benefits:
- Consistency
- Faster recovery
- Lower configuration drift
“Ensure infrastructure integrity and repeatability” → IaC + immutable deployments.
8 | Load Balancing Concepts
Application Load Balancer
- ALB spreads traffic across targets in multiple AZs
- Helps remove single-instance failure as a SPOF
9 | Proxy Concepts
Amazon RDS Proxy
RDS Proxy helps reliability especially for spiky/serverless workloads by:
1 Pooling and reusing DB connections
2 Reducing DB overload due to connection storms
3 Improving failover behavior for some patterns
“Lambda causes too many DB connections” → RDS Proxy.
10 | Service Quotas And Throttling
Standby Environments
In DR scenarios, your standby Region or account must have enough quota to scale up.
Know that you can:
- Check and adjust Service Quotas
- Design for throttling with retries or backoff and buffering
11 | Storage Options And Characteristics
Durability And Replication
Storage durability affects architecture choices:
1 S3 is highly durable and regional with options like versioning and replication
2 EBS is replicated within an AZ and you can send snapshots to S3 for durability
3 EFS is regional and multi-AZ within a Region
12 | Workload Visibility
AWS X-Ray
Visibility supports HA by helping you detect and diagnose failures:
- CloudWatch metrics/alarms for health and scaling
- X-Ray for tracing distributed requests and finding bottlenecks
Skills
A | Determine Automation Strategies To Ensure Infrastructure Integrity
Look for:
1 Infrastructure as Code (CloudFormation/CDK/Terraform)
2 Automated deployments (blue/green, rolling)
3 Auto Scaling + health checks
4 Automated recovery actions (replace unhealthy instances/tasks)
B | Determine Services Required For HA/FT Across Regions or AZs
Common choices:
- Multi-AZ: ALB + Auto Scaling + Multi-AZ database (RDS Multi-AZ)
- Multi-region: Route 53 + replicated data + standby/active environment
“AZ outage must not cause downtime” → Multi-AZ everything.
C | Identify Metrics Based On Business Requirements
Tie Monitoring To User-Impacting KPIs:
1 Availability / error rate (5xx)
2 Latency p95/p99
3 Queue depth / age (SQS)
4 CPU/memory/connections (compute/DB)
5 RPO/RTO compliance signals (backup success, replication lag)
D | Implement Designs To Mitigate Single Points Of Failure
Remove Single Points of Failure
1 Multi-AZ deployments
2 Redundant NAT Gateways (one per AZ for best practice)
3 Multi-AZ databases
4 Avoid single instance “pet” servers
E | Ensure Durability And Availability Of Data
Backups
1 Automated backups (RDS)
2 Snapshots (EBS, RDS)
3 S3 versioning + replication where required
4 AWS Backup policies when asked for centralized backup
F | Select An Appropriate DR Strategy To Meet Business Requirements
Use RTO/RPO to pick:
1 Backup/Restore (cheap, slow)
2 Pilot Light (medium)
3 Warm Standby (faster)
4 Active-Active (fastest, expensive)
G | Improve Reliability Of Legacy Apps
When app changes are not possible, use infrastructure patterns:
1 Put app behind ALB
2 Use Auto Scaling groups to replace failed instances
3 Use RDS Proxy to stabilize DB connections
4 Use caching to reduce backend load
5 Use DNS failover (Route 53) for regional DR
H | Use Purpose-Built AWS Services
Use managed services to reduce failure modes:
- ALB, Auto Scaling, Route 53
- RDS Multi-AZ, DynamoDB (managed HA)
- SQS/SNS for decoupling spikes and failures
- CloudFront for edge caching and origin protection
Cheat Sheet
| Requirement | Direction |
|---|---|
| Survive an instance failure | Auto Scaling + health checks + ALB |
| Survive an AZ failure | Multi-AZ for each tier (ALB targets across AZs, Multi-AZ DB) |
| Survive a Region failure | DR strategy + Route 53 failover + replicated data |
| Strict RTO/RPO | Warm standby or active-active |
| Lambda overwhelms RDS with connections | RDS Proxy |
| Need to see bottlenecks across microservices | X-Ray (plus CloudWatch) |
| Standby must scale during failover | Plan Service Quotas + scaling policies |
Recap Checklist ✅
1. [ ] Every critical tier is deployed across multiple AZs
2. [ ] Traffic is distributed via ALB/NLB and unhealthy targets are replaced automatically
3. [ ] Databases use HA features (e.g., RDS Multi-AZ or managed HA services)
4. [ ] DR strategy matches business RTO/RPO (backup/restore vs pilot light vs warm standby vs active-active)
5. [ ] Regional failover uses Route 53 health checks/routing (when required)
6. [ ] Data durability is addressed (backups, snapshots, replication)
7. [ ] Quotas and throttling are considered for failover/standby scaling
8. [ ] Monitoring and tracing exist (CloudWatch + X-Ray)
AWS Whitepapers and Official Documentation
These are the primary AWS documents behind Task Statement 2.2.
You do not need to memorize them, use them to understand why highly available and fault tolerant architectures work the way they do.
Global Infrastructure and DNS
Disaster Recovery
Networking Foundations
Load Balancing and Reliability
1. Application Load Balancer
2. Auto Scaling (EC2)
Database Reliability
1. RDS Multi-AZ
2. RDS Proxy
Quotas and Limits
Storage Durability / Replication
1. S3 Replication
2. EBS Snapshots:
3. EFS Overview
Visibility
1. AWS X-Ray
2. CloudWatch
Managed AI Services (examples from blueprint)
1. Amazon Comprehend
2. Amazon Polly
🚀
Top comments (0)