DEV Community

Wakeup Flower
Wakeup Flower

Posted on

Aws Active-Active & Active-Passive Failover

πŸ”„ Active-Passive Failover

  • Setup:

    • You have one primary (active) system handling all the traffic.
    • A secondary (passive/standby) system is running in the background (or partially idle), ready to take over if the active one fails.
  • Failover process:

    • If the active system fails, DNS or a load balancer redirects traffic to the passive system.
    • Example in AWS:
    • Two EC2 instances behind an Elastic IP. Normally only one EC2 is serving traffic.
    • Route 53 health checks detect failure β†’ switch traffic to the standby.
  • Pros:

    • Simpler to set up.
    • Costs less, since the passive system may run at reduced capacity.
  • Cons:

    • Failover introduces downtime (seconds to minutes) while switching.
    • The passive system may not be fully utilized until failover happens.

πŸ”„ Active-Active Failover

  • Setup:

    • Multiple systems are all active and handling traffic simultaneously.
    • Traffic is distributed across regions/instances with load balancers or Route 53 policies.
  • Failover process:

    • If one system fails, traffic is automatically rerouted to the remaining active systems.
    • Example in AWS:
    • Multi-Region architecture with Route 53 latency-based routing.
    • Both Regions (e.g., us-east-1 and eu-west-1) serve traffic concurrently.
  • Pros:

    • No downtime β€” failover is seamless because other systems are already serving traffic.
    • Better performance (users are served by the nearest region).
  • Cons:

    • More complex to design (need global data replication, conflict resolution).
    • More expensive (all systems run at full capacity).

βœ… Key Difference

Feature Active-Passive Active-Active
Traffic during normal ops Only primary handles requests All nodes handle requests
Failover time Seconds to minutes Near-instant (seamless)
Resource utilization Passive mostly idle All resources used
Cost Lower Higher
Complexity Simpler More complex (sync, routing, conflicts)
AWS Example Route 53 failover policy with standby region Route 53 latency/geolocation policy with multiple regions

πŸ“ Example case

  • Workload:

    • E-commerce app β†’ EC2 Auto Scaling Group behind an ALB
    • Database β†’ Aurora PostgreSQL DB cluster
  • Goal:

    • Prepare for Region-wide outages
    • RTO = 30 minutes
    • DR infra doesn’t need to run unless failover is needed (so cost efficiency is important)

βœ… Recommended DR Solution (Active-Passive)

1. Secondary Region Infrastructure

  • Compute layer (EC2 + ALB)

    • Deploy the same stack (ALB + Auto Scaling group) in a secondary AWS Region, but set:
    • Desired capacity = 0
    • Maximum capacity = 0
    • This ensures no running instances unless you trigger failover, which keeps costs low.
  • Scaling up during DR:

    • In a DR event, you can adjust the Auto Scaling group desired capacity to launch the app servers.

2. Database Layer (Aurora)

  • Convert the primary Aurora PostgreSQL DB cluster into an Aurora Global Database:

    • Primary Region = read/write cluster
    • Secondary Region = read-only cluster (replicates with ~1 second lag)
  • In a failover scenario:

    • Promote the secondary DB cluster to be read/write
    • Point the app to it

3. DNS Failover

  • Use Amazon Route 53 with active-passive failover routing policy:

    • Primary endpoint = ALB in the primary Region
    • Secondary endpoint = ALB in the secondary Region
    • Configure health checks so Route 53 directs traffic to the secondary ALB only when the primary Region fails.

4. RTO & RPO Considerations

  • RTO (Recovery Time Objective)

    • 30 minutes can be met because:
    • Route 53 failover = a few minutes
    • ASG scale-up + app bootstrap = within minutes
    • Aurora Global DB promotion = a few minutes
  • RPO (Recovery Point Objective)

    • Aurora Global Database provides low-latency replication, so data loss is minimal (seconds).

🎯 Final Architecture

  • Primary Region:

    • Active EC2 ASG + ALB
    • Aurora PostgreSQL (writer)
  • Secondary Region:

    • Standby EC2 ASG + ALB (capacity=0)
    • Aurora Global Database (reader, promoted on failover)
  • Failover:

    • Trigger Route 53 to switch DNS to the secondary ALB
    • Scale up EC2 ASG in the secondary Region
    • Promote Aurora Global Database to writer

βœ… This solution meets the requirements:

  • 30-min RTO β†’ achievable
  • Low operational overhead β†’ DR infra mostly idle until needed
  • Cost-optimized β†’ only Aurora replication runs continuously

Top comments (0)