Anthony Uketui

Posted on Jul 2

Designing Active-Active Multi-Region for a Payment Gateway: An Architecture Decision Record

#architecture #aws #fintech #systemdesign

TL;DR: I designed a multi-region Active-Active architecture for a fintech payment gateway to survive full AWS region failures. This ADR documents why we chose Active-Active over Active-Passive, the technical approach using Route 53 + ECS Fargate + Aurora Global Database, and the trade-offs we accepted.

1. Context: Why This Decision Matters

Our payment gateway is a critical-path service. A prolonged outage — especially a full AWS region failure — directly impacts revenue and customer trust.

Our posture at the time:

Largely single-region
Reactive: we recover from incidents, but aren't designed to withstand a full regional failure without manual intervention
Production in one region, staging in another

What we needed:

Payment gateway stays available during an AWS regional outage
Recovery time and data loss reduced to defined RTO/RPO targets
Improved latency by serving users from geographically closer regions
Fits within our existing tooling: Terraform/Terragrunt, ECS, Aurora

2. The Two Options

Option A: Active-Passive (Disaster Recovery)

A secondary region exists but is idle or minimally used. Failover is manual or semi-automated.

Pros	Cons
Lower cost (secondary region is idle)	Downtime during failover (minutes to hours)
Simpler to implement	Manual intervention required
Less operational complexity	Risk of "cold start" failures in untested DR region

Option B: Active-Active (High Availability) ← Chosen

Multiple regions actively serve traffic. If one region fails, others continue serving with minimal or no downtime.

Pros	Cons
Minimal/zero downtime on failover	Higher cost (duplicate resources)
Lower latency (serve from nearest region)	Increased operational complexity
Protection against full regional failures	Requires stronger CI/CD and observability discipline

Decision: Given the criticality of the payment gateway and the business impact of downtime, we chose Active-Active despite its higher complexity and cost.

3. Architecture Overview

3.1 Traffic Routing — Route 53

Latency-Based Routing for the payment gateway's public API DNS record
Each region exposes an independent endpoint (ALB) registered under the same DNS name
Route 53 directs clients to the lowest-latency region
Health checks against a regional health endpoint (/healthz) detect regional or service failure
Automatic failover: 100% of traffic routes to the surviving region when one is marked unhealthy

3.2 Compute Layer — ECS Fargate

Identical ECS clusters and services in each target region
Terraform/Terragrunt modules parameterized by region ensure infrastructure parity
Same container images and versions deployed via pinned ECR image digests
This eliminates the "it works in Region A but not Region B" class of issues

3.3 Data Layer — Aurora Global Database

This is the hardest part. Payment data must be consistent across regions.

Aurora Global Database with a single primary write region
One or more secondary read replicas in other regions
Write forwarding from secondary regions: reads are local, writes are forwarded to primary
Replication lag: typically <1 second

RTO/RPO Targets:

Metric	Target	How
RTO (Recovery Time Objective)	< 1 minute	Route 53 health checks + automatic failover
RPO (Recovery Point Objective)	< 1 second	Aurora Global Database replication lag

4. Key Design Decisions

Why Not Active-Passive?

For a payment gateway, even 5 minutes of downtime during manual failover is unacceptable. Active-Passive also carries the risk of "cold start" — the DR region hasn't been serving real traffic, so untested configurations, expired credentials, or capacity issues surface at the worst possible moment.

Active-Active eliminates this: both regions are always warm, always serving traffic, always tested.

Why Aurora Global Database Over DynamoDB Global Tables?

Our application is built on relational data models (MySQL/Aurora). Migrating to DynamoDB would require rewriting core application logic. Aurora Global Database provides multi-region replication without changing the data layer.

Why Latency-Based Routing Over Failover Routing?

Latency-based routing provides two benefits: performance (lowest latency) AND resilience (automatic failover via health checks). Pure failover routing only provides the second.

5. Implementation Phases

Phase 1: Database Replication (Foundation)

Set up Aurora Global Database with secondary cluster
Validate replication lag and write forwarding behavior
Run DR drills to measure actual RTO/RPO

Phase 2: Compute Symmetry

Deploy identical ECS services in secondary region via Terraform modules
Validate container image parity across regions
Configure ALBs and health check endpoints

Phase 3: Traffic Management

Configure Route 53 latency-based routing
Set up health checks with appropriate thresholds
Gradual traffic shifting (10% → 50% → 100%)

Phase 4: Operational Readiness

Quarterly DR drills (mandatory)
Cross-region observability (New Relic dashboards per region)
Runbooks for manual failover scenarios

6. Trade-offs We Accepted

Higher Cost

Duplicate compute and database resources across regions. We estimated roughly 1.5–1.8x the single-region cost (not 2x, because the secondary region handles real traffic that would otherwise need scaling in the primary).

Increased Complexity

More moving parts = more potential failure modes. We mitigated this with:

Infrastructure as Code (Terraform) for consistency
Pinned image digests for deployment determinism
Regional health dashboards for visibility

Write Forwarding Latency

Writes from the secondary region incur cross-region latency (typically 30–100ms). For a payment gateway, this is acceptable — payment processing already involves multiple external API calls with higher latency.

7. What I'd Do Differently

Start DR drills immediately. Don't wait until the architecture is "complete." Drill with whatever you have — it exposes assumptions faster than any design review.
Define RTO/RPO with the business, not engineering. The business cares about revenue impact per minute of downtime, not technical recovery metrics. Translate RTO/RPO into dollars.
Plan for split-brain. What happens if both regions think they're primary? Aurora Global Database handles this at the data layer, but application-level state (caches, queues) needs explicit handling.
Budget for observability. Multi-region is only as good as your visibility into it. Without per-region dashboards, you're flying blind.

8. When Active-Active Is Overkill

Not every service needs this. Active-Active makes sense when:

The service is revenue-critical (payment processing, checkout)
Downtime is measured in dollars-per-minute
You have the engineering capacity to maintain two regions

For internal tools, staging environments, or low-criticality services, Active-Passive (or even single-region with good backups) is the right call.

I designed this architecture for a fintech payment gateway processing transactions across Africa. If you're building multi-region systems for financial services, I'd love to exchange notes on the data consistency challenges.

DEV Community