DEV Community

Cover image for Designing Active-Active Multi-Region for a Payment Gateway: An Architecture Decision Record
Anthony Uketui
Anthony Uketui

Posted on

Designing Active-Active Multi-Region for a Payment Gateway: An Architecture Decision Record

TL;DR: I designed a multi-region Active-Active architecture for a fintech payment gateway to survive full AWS region failures. This ADR documents why we chose Active-Active over Active-Passive, the technical approach using Route 53 + ECS Fargate + Aurora Global Database, and the trade-offs we accepted.


1. Context: Why This Decision Matters

Our payment gateway is a critical-path service. A prolonged outage — especially a full AWS region failure — directly impacts revenue and customer trust.

Our posture at the time:

  • Largely single-region
  • Reactive: we recover from incidents, but aren't designed to withstand a full regional failure without manual intervention
  • Production in one region, staging in another

What we needed:

  • Payment gateway stays available during an AWS regional outage
  • Recovery time and data loss reduced to defined RTO/RPO targets
  • Improved latency by serving users from geographically closer regions
  • Fits within our existing tooling: Terraform/Terragrunt, ECS, Aurora

2. The Two Options

Option A: Active-Passive (Disaster Recovery)

A secondary region exists but is idle or minimally used. Failover is manual or semi-automated.

Pros Cons
Lower cost (secondary region is idle) Downtime during failover (minutes to hours)
Simpler to implement Manual intervention required
Less operational complexity Risk of "cold start" failures in untested DR region

Option B: Active-Active (High Availability) ← Chosen

Multiple regions actively serve traffic. If one region fails, others continue serving with minimal or no downtime.

Pros Cons
Minimal/zero downtime on failover Higher cost (duplicate resources)
Lower latency (serve from nearest region) Increased operational complexity
Protection against full regional failures Requires stronger CI/CD and observability discipline

Decision: Given the criticality of the payment gateway and the business impact of downtime, we chose Active-Active despite its higher complexity and cost.


3. Architecture Overview

3.1 Traffic Routing — Route 53

  • Latency-Based Routing for the payment gateway's public API DNS record
  • Each region exposes an independent endpoint (ALB) registered under the same DNS name
  • Route 53 directs clients to the lowest-latency region
  • Health checks against a regional health endpoint (/healthz) detect regional or service failure
  • Automatic failover: 100% of traffic routes to the surviving region when one is marked unhealthy

3.2 Compute Layer — ECS Fargate

  • Identical ECS clusters and services in each target region
  • Terraform/Terragrunt modules parameterized by region ensure infrastructure parity
  • Same container images and versions deployed via pinned ECR image digests
  • This eliminates the "it works in Region A but not Region B" class of issues

3.3 Data Layer — Aurora Global Database

This is the hardest part. Payment data must be consistent across regions.

  • Aurora Global Database with a single primary write region
  • One or more secondary read replicas in other regions
  • Write forwarding from secondary regions: reads are local, writes are forwarded to primary
  • Replication lag: typically <1 second

RTO/RPO Targets:

Metric Target How
RTO (Recovery Time Objective) < 1 minute Route 53 health checks + automatic failover
RPO (Recovery Point Objective) < 1 second Aurora Global Database replication lag

4. Key Design Decisions

Why Not Active-Passive?

For a payment gateway, even 5 minutes of downtime during manual failover is unacceptable. Active-Passive also carries the risk of "cold start" — the DR region hasn't been serving real traffic, so untested configurations, expired credentials, or capacity issues surface at the worst possible moment.

Active-Active eliminates this: both regions are always warm, always serving traffic, always tested.

Why Aurora Global Database Over DynamoDB Global Tables?

Our application is built on relational data models (MySQL/Aurora). Migrating to DynamoDB would require rewriting core application logic. Aurora Global Database provides multi-region replication without changing the data layer.

Why Latency-Based Routing Over Failover Routing?

Latency-based routing provides two benefits: performance (lowest latency) AND resilience (automatic failover via health checks). Pure failover routing only provides the second.


5. Implementation Phases

Phase 1: Database Replication (Foundation)

  • Set up Aurora Global Database with secondary cluster
  • Validate replication lag and write forwarding behavior
  • Run DR drills to measure actual RTO/RPO

Phase 2: Compute Symmetry

  • Deploy identical ECS services in secondary region via Terraform modules
  • Validate container image parity across regions
  • Configure ALBs and health check endpoints

Phase 3: Traffic Management

  • Configure Route 53 latency-based routing
  • Set up health checks with appropriate thresholds
  • Gradual traffic shifting (10% → 50% → 100%)

Phase 4: Operational Readiness

  • Quarterly DR drills (mandatory)
  • Cross-region observability (New Relic dashboards per region)
  • Runbooks for manual failover scenarios

6. Trade-offs We Accepted

Higher Cost

Duplicate compute and database resources across regions. We estimated roughly 1.5–1.8x the single-region cost (not 2x, because the secondary region handles real traffic that would otherwise need scaling in the primary).

Increased Complexity

More moving parts = more potential failure modes. We mitigated this with:

  • Infrastructure as Code (Terraform) for consistency
  • Pinned image digests for deployment determinism
  • Regional health dashboards for visibility

Write Forwarding Latency

Writes from the secondary region incur cross-region latency (typically 30–100ms). For a payment gateway, this is acceptable — payment processing already involves multiple external API calls with higher latency.


7. What I'd Do Differently

  1. Start DR drills immediately. Don't wait until the architecture is "complete." Drill with whatever you have — it exposes assumptions faster than any design review.

  2. Define RTO/RPO with the business, not engineering. The business cares about revenue impact per minute of downtime, not technical recovery metrics. Translate RTO/RPO into dollars.

  3. Plan for split-brain. What happens if both regions think they're primary? Aurora Global Database handles this at the data layer, but application-level state (caches, queues) needs explicit handling.

  4. Budget for observability. Multi-region is only as good as your visibility into it. Without per-region dashboards, you're flying blind.


8. When Active-Active Is Overkill

Not every service needs this. Active-Active makes sense when:

  • The service is revenue-critical (payment processing, checkout)
  • Downtime is measured in dollars-per-minute
  • You have the engineering capacity to maintain two regions

For internal tools, staging environments, or low-criticality services, Active-Passive (or even single-region with good backups) is the right call.


I designed this architecture for a fintech payment gateway processing transactions across Africa. If you're building multi-region systems for financial services, I'd love to exchange notes on the data consistency challenges.

Top comments (0)