Your team runs a critical REST API (API Gateway → Lambda → DynamoDB) in the us-east-1 region.
To meet a 99.99% availability SLA you must design a multi‑region failover solution that automatically routes traffic to a standby deployment in eu-west-1 if the primary region experiences an outage.
- Automatic health‑checking of the primary endpoint.
- Seamless DNS‑based failover with minimal client impact.
- Data consistency between the DynamoDB tables in both regions.
Include any trade‑offs (e.g., latency, cost, eventual consistency) and a brief diagram description.
Scenario‑Based Lambda Question 4
Title: Multi‑Region Failover Architecture for a Lambda‑Backed API
Answer:
1. Automatic Health‑Checking
-
Route 53 Health Checks targeting the API Gateway endpoint (
https://api-id.execute-api.us-east-1.amazonaws.com/prod/health). - Health check type HTTPS, request interval 30 s, failure threshold 3.
- Enable CloudWatch alarm on the health‑check status for visibility.
2. DNS‑Based Failover
| Component | Configuration |
|---|---|
| Route 53 Hosted Zone | Create an A (Alias) record for api.example.com. |
| Primary record | Alias → API Gateway in us-east-1. Set Routing Policy = Failover, Failover Type = Primary. |
| Secondary record | Alias → API Gateway in eu-west-1. Set Failover Type = Secondary. |
| TTL | Low value (e.g., 30 s) to allow quick client re‑resolution after failover. |
When the health check fails, Route 53 automatically switches DNS responses to the secondary alias, causing clients to resolve to the standby API with minimal delay.
3. Data Consistency Between DynamoDB Tables
-
DynamoDB Global Tables (Version 2) spanning
us-east-1andeu-west-1.- Provides multi‑master, active‑active replication with eventual consistency (typically < 1 s replication latency).
- No additional application code needed; writes in either region are replicated automatically.
Alternative (if Global Tables not viable):
- Use DynamoDB Streams + Lambda cross‑region replication: stream changes from the primary table to a Lambda in the secondary region that writes to the standby table. This adds extra latency and operational overhead.
4. Trade‑offs
| Aspect | Consideration |
|---|---|
| Latency | Clients in Europe will see lower latency when routed to eu-west-1. During normal operation, us-east-1 may be farther for EU users, but DNS failover only occurs on outage. |
| Cost | Global Tables incur additional write/read capacity charges for cross‑region replication. Two API Gateways, Lambdas, and CloudWatch metrics double the baseline cost. |
| Consistency | Global Tables are eventually consistent; a write may not be visible in the other region for up to a few seconds. If strict strong consistency is required, a single‑region design with active‑passive replication (e.g., manual snapshot restore) would be needed, but availability would suffer. |
| Failover Time | DNS TTL of 30 s means most clients will switch within ~30 s after health‑check failure. Some long‑lived TCP connections may need to be re‑established. |
| Complexity | Using Global Tables is the simplest for data sync; custom stream‑Lambda replication adds operational complexity and monitoring overhead. |
5. Operational Steps Summary
- Deploy the API (API GW + Lambda) in both regions.
- Enable DynamoDB Global Table across the two regions.
- Create Route 53 health check for the primary API’s
/healthendpoint. - Set up failover alias records (
api.example.com) with low TTL. - Test failover by manually disabling the primary API (e.g., remove its permission) and verify DNS switches and requests succeed in the secondary region.
- Monitor CloudWatch metrics for health‑check status, Route 53 failover events, and DynamoDB replication lag.
This architecture satisfies the 99.99 % availability goal with automatic detection, rapid DNS‑based traffic shift, and near‑real‑time data replication.
Top comments (0)