Ali-Funk

Posted on Feb 8 • Edited on Feb 12

Building a Multi-Region Disaster Recovery Strategy with Terraform and AWS Route 53

#aws #devops #terraform #cloud

High Availability (HA) and Disaster Recovery (DR) are often discussed in theory, but rarely tested until a real incident occurs. Following a conversation with Marcel Lücht about resilient cloud architectures, I decided to engineer and validate a cross-continental failover solution using Infrastructure as Code (IaC).

The objective was clear: Build a static website that remains online even if an entire AWS region goes offline.

The Architecture

I chose an Active-Passive Failover strategy. This is a cost-effective pattern where resources in a secondary region are only used when the primary region fails.

Primary Region: eu-central-1 (Frankfurt) – Serving the live traffic.

Secondary Region:_ us-east-1_ (N. Virginia) – Serving a static maintenance/failover page.

Traffic Director: AWS Route 53 with Health Checks.

Automation: HashiCorp Terraform.

Step 1: Infrastructure as Code (Terraform)

Instead of clicking through the AWS Console, I defined the entire stack using Terraform. This ensures that the disaster recovery logic is reproducible and version-controlled.

Key components in the main.tf and route53.tf files included:

Two EC2 Instances: One in Europe, one in the US.

Route 53 Health Check: Configured to ping the IP address of the Frankfurt server every 10 seconds.

Failover Routing Policy:

The Primary Record points to Frankfurt.

The Secondary Record points to Virginia and is only activated if the Health Check returns unhealthy.

Defining the Failover Routing Policy in Terraform.

Step 2: The Simulation

Once the infrastructure was deployed (terraform apply), the site was live and serving content from Frankfurt. To validate the engineering, I had to simulate a catastrophic failure.

I manually stopped the EC2 instance in eu-central-1. This simulates a scenario where the server crashes or the availability zone becomes unreachable.

Within seconds, the AWS Route 53 Health Check detected the timeout.

Route 53 successfully detects the failure in the primary region.

Step 3: The Result

The DNS propagation took effect almost immediately. Route 53 updated the A-Record for the domain, directing all incoming traffic to the secondary server in us-east-1.

Users accessing the domain were served the fallback content from the US server without any manual intervention required from my side.

Traffic is automatically rerouted to the DR region.

Conclusion

Building this project reinforced a core principle of Cloud Engineering: Design for Failure.

By using Terraform, I could build, test, and destroy this environment quickly. By using Route 53, I ensured that the application remains available globally, even during a regional outage.

Cost Analysis (FinOps)

One final observation regarding costs: I set up an AWS Budget alert to track the spend for this experiment.
The total infrastructure cost for deploying this global, multi-region setup was just $0.08 (excluding the domain registration).

This proves that learning enterprise-grade reliability doesn't require an enterprise budget just smart architecture and automation using Terraform destroy

References & Documentation

1.AWS Documentation- Configuring Failover in Route 53:

https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html

2.Terraform Registry - AWS Route 53 Health Check Resource:

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route53_health_check

3.AWS Architecture Blog - Disaster Recovery (DR) Architecture on AWS: https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/

Top comments (2)

Aryan Choudhary • Feb 8

This is just incredible, the idea of creating a failover strategy that kicks in automatically when an entire region goes dark is just mind-blowing. I mean, the fact that you simulated a catastrophic failure and it worked like a charm is just amazing. It's this kind of thinking, designing for failure from the start, that I think is really key to making these systems reliable. Very impressive Ali!

Ali-Funk • Feb 8

Thank you, Aryan. I completely agree with you. Shifting the mindset from trying to prevent failure to actually 'designing for it' makes a huge difference. Watching the automation take over in real-time was a great experience
It was very cheap to run but it took me a while to complete this project.
I can highly recommend it.