π Active-Passive Failover
-
Setup:
- You have one primary (active) system handling all the traffic.
- A secondary (passive/standby) system is running in the background (or partially idle), ready to take over if the active one fails.
-
Failover process:
- If the active system fails, DNS or a load balancer redirects traffic to the passive system.
- Example in AWS:
- Two EC2 instances behind an Elastic IP. Normally only one EC2 is serving traffic.
- Route 53 health checks detect failure β switch traffic to the standby.
-
Pros:
- Simpler to set up.
- Costs less, since the passive system may run at reduced capacity.
-
Cons:
- Failover introduces downtime (seconds to minutes) while switching.
- The passive system may not be fully utilized until failover happens.
π Active-Active Failover
-
Setup:
- Multiple systems are all active and handling traffic simultaneously.
- Traffic is distributed across regions/instances with load balancers or Route 53 policies.
-
Failover process:
- If one system fails, traffic is automatically rerouted to the remaining active systems.
- Example in AWS:
- Multi-Region architecture with Route 53 latency-based routing.
- Both Regions (e.g., us-east-1 and eu-west-1) serve traffic concurrently.
-
Pros:
- No downtime β failover is seamless because other systems are already serving traffic.
- Better performance (users are served by the nearest region).
-
Cons:
- More complex to design (need global data replication, conflict resolution).
- More expensive (all systems run at full capacity).
β Key Difference
Feature | Active-Passive | Active-Active |
---|---|---|
Traffic during normal ops | Only primary handles requests | All nodes handle requests |
Failover time | Seconds to minutes | Near-instant (seamless) |
Resource utilization | Passive mostly idle | All resources used |
Cost | Lower | Higher |
Complexity | Simpler | More complex (sync, routing, conflicts) |
AWS Example | Route 53 failover policy with standby region | Route 53 latency/geolocation policy with multiple regions |
π Example case
-
Workload:
- E-commerce app β EC2 Auto Scaling Group behind an ALB
- Database β Aurora PostgreSQL DB cluster
-
Goal:
- Prepare for Region-wide outages
- RTO = 30 minutes
- DR infra doesnβt need to run unless failover is needed (so cost efficiency is important)
β Recommended DR Solution (Active-Passive)
1. Secondary Region Infrastructure
-
Compute layer (EC2 + ALB)
- Deploy the same stack (ALB + Auto Scaling group) in a secondary AWS Region, but set:
- Desired capacity =
0
- Maximum capacity =
0
- This ensures no running instances unless you trigger failover, which keeps costs low.
-
Scaling up during DR:
- In a DR event, you can adjust the Auto Scaling group desired capacity to launch the app servers.
2. Database Layer (Aurora)
-
Convert the primary Aurora PostgreSQL DB cluster into an Aurora Global Database:
- Primary Region = read/write cluster
- Secondary Region = read-only cluster (replicates with ~1 second lag)
-
In a failover scenario:
- Promote the secondary DB cluster to be read/write
- Point the app to it
3. DNS Failover
-
Use Amazon Route 53 with active-passive failover routing policy:
- Primary endpoint = ALB in the primary Region
- Secondary endpoint = ALB in the secondary Region
- Configure health checks so Route 53 directs traffic to the secondary ALB only when the primary Region fails.
4. RTO & RPO Considerations
-
RTO (Recovery Time Objective)
- 30 minutes can be met because:
- Route 53 failover = a few minutes
- ASG scale-up + app bootstrap = within minutes
- Aurora Global DB promotion = a few minutes
-
RPO (Recovery Point Objective)
- Aurora Global Database provides low-latency replication, so data loss is minimal (seconds).
π― Final Architecture
-
Primary Region:
- Active EC2 ASG + ALB
- Aurora PostgreSQL (writer)
-
Secondary Region:
- Standby EC2 ASG + ALB (capacity=0)
- Aurora Global Database (reader, promoted on failover)
-
Failover:
- Trigger Route 53 to switch DNS to the secondary ALB
- Scale up EC2 ASG in the secondary Region
- Promote Aurora Global Database to writer
β This solution meets the requirements:
- 30-min RTO β achievable
- Low operational overhead β DR infra mostly idle until needed
- Cost-optimized β only Aurora replication runs continuously
Top comments (0)