Aws Active-Active & Active-Passive Failover

#aws

🔄 Active-Passive Failover

Setup:
- You have one primary (active) system handling all the traffic.
- A secondary (passive/standby) system is running in the background (or partially idle), ready to take over if the active one fails.
Failover process:
- If the active system fails, DNS or a load balancer redirects traffic to the passive system.
- Example in AWS:
- Two EC2 instances behind an Elastic IP. Normally only one EC2 is serving traffic.
- Route 53 health checks detect failure → switch traffic to the standby.
Pros:
- Simpler to set up.
- Costs less, since the passive system may run at reduced capacity.
Cons:
- Failover introduces downtime (seconds to minutes) while switching.
- The passive system may not be fully utilized until failover happens.

Setup:
- Multiple systems are all active and handling traffic simultaneously.
- Traffic is distributed across regions/instances with load balancers or Route 53 policies.
Failover process:
- If one system fails, traffic is automatically rerouted to the remaining active systems.
- Example in AWS:
- Multi-Region architecture with Route 53 latency-based routing.
- Both Regions (e.g., us-east-1 and eu-west-1) serve traffic concurrently.
Pros:
- No downtime — failover is seamless because other systems are already serving traffic.
- Better performance (users are served by the nearest region).
Cons:
- More complex to design (need global data replication, conflict resolution).
- More expensive (all systems run at full capacity).

Feature	Active-Passive	Active-Active
Traffic during normal ops	Only primary handles requests	All nodes handle requests
Failover time	Seconds to minutes	Near-instant (seamless)
Resource utilization	Passive mostly idle	All resources used
Cost	Lower	Higher
Complexity	Simpler	More complex (sync, routing, conflicts)
AWS Example	Route 53 failover policy with standby region	Route 53 latency/geolocation policy with multiple regions

Workload:
- E-commerce app → EC2 Auto Scaling Group behind an ALB
- Database → Aurora PostgreSQL DB cluster
Goal:
- Prepare for Region-wide outages
- RTO = 30 minutes
- DR infra doesn’t need to run unless failover is needed (so cost efficiency is important)

Compute layer (EC2 + ALB)
- Deploy the same stack (ALB + Auto Scaling group) in a secondary AWS Region, but set:
- Desired capacity = 0
- Maximum capacity = 0
- This ensures no running instances unless you trigger failover, which keeps costs low.
Scaling up during DR:
- In a DR event, you can adjust the Auto Scaling group desired capacity to launch the app servers.

Convert the primary Aurora PostgreSQL DB cluster into an Aurora Global Database:
- Primary Region = read/write cluster
- Secondary Region = read-only cluster (replicates with ~1 second lag)
In a failover scenario:
- Promote the secondary DB cluster to be read/write
- Point the app to it

Use Amazon Route 53 with active-passive failover routing policy:
- Primary endpoint = ALB in the primary Region
- Secondary endpoint = ALB in the secondary Region
- Configure health checks so Route 53 directs traffic to the secondary ALB only when the primary Region fails.

RTO (Recovery Time Objective)
- 30 minutes can be met because:
- Route 53 failover = a few minutes
- ASG scale-up + app bootstrap = within minutes
- Aurora Global DB promotion = a few minutes
RPO (Recovery Point Objective)
- Aurora Global Database provides low-latency replication, so data loss is minimal (seconds).

Primary Region:
- Active EC2 ASG + ALB
- Aurora PostgreSQL (writer)
Secondary Region:
- Standby EC2 ASG + ALB (capacity=0)
- Aurora Global Database (reader, promoted on failover)
Failover:
- Trigger Route 53 to switch DNS to the secondary ALB
- Scale up EC2 ASG in the secondary Region
- Promote Aurora Global Database to writer

✅ This solution meets the requirements: