Jeancy Joachim Mukaka

Posted on Jun 4

How a Single NAT Gateway Can Silently Kill Your AWS High Availability

#aws #terraform #sre #devops

A real-world lesson from a production-like AWS lab challenge

The Scenario That Should Scare You

Imagine this: your AWS environment has two Availability Zones, public and private subnets, an Application Load Balancer, Auto Scaling. Your architecture diagram looks solid. Then one Availability Zone goes down, your ALB fails over instantly, your EC2 instances in AZ-B are running fine. But your application is still broken.

Because every private subnet instance, including those in AZ-B, is routing outbound traffic through one NAT Gateway sitting in AZ-A. Which is now unreachable.

You didn't have a highly available architecture. You had the illusion of one.

Understanding the Problem: NAT Gateways Are Zonal

A NAT Gateway is not a regional resource. It lives in a specific Availability Zone.

When you create a NAT Gateway, you place it in a specific subnet, which belongs to a specific AZ. If that AZ goes down, your NAT Gateway goes down with it.

Many teams create a single NAT Gateway to save costs, then route all private subnet traffic across all AZs through that one gateway:

Private Subnet AZ-A → 0.0.0.0/0 → nat-09xxxxx (AZ-A) ✅
Private Subnet AZ-B → 0.0.0.0/0 → nat-09xxxxx (AZ-A) ❌

The private subnet in AZ-B is routing through a NAT Gateway in AZ-A. This is a cross-AZ dependency, and a silent Single Point of Failure.

What I Found in the Lab

The lab presented a VPC with this structure:

Resource	CIDR / Details
VPC	10.0.0.0/16
Public Subnet AZ-A	10.0.128.0/20
Public Subnet AZ-B	10.0.144.0/20
Private Subnet 1A (AZ-A)	10.0.0.0/19
Private Subnet 1B (AZ-A)	10.0.192.0/21
Private Subnet 2A (AZ-B)	10.0.32.0/19
Private Subnet 2B (AZ-B)	10.0.200.0/21

Two NAT Gateways existed: one in AZ-A, one in AZ-B. At first glance, this looked correct.

But when I inspected the Route Tables, the problem was immediately visible. All four private subnet Route Tables had the same entry:

Destination: 0.0.0.0/0 → Target: nat-09xxxxxxxx (AZ-A)

The NAT Gateway in AZ-B existed, but nobody was using it. It was provisioned but completely disconnected from the routing logic. The two private subnets in AZ-B were silently depending on the NAT Gateway in AZ-A for all outbound internet traffic.

Why This Happens

There are two common causes:

1. Cost-cutting gone wrong
Teams create one NAT Gateway to reduce costs, then forget that high availability requires one per AZ. A NAT Gateway costs approximately $0.045/hour plus data transfer charges. Running two instead of one adds roughly $32/month, a small price compared to the cost of an outage.

2. Infrastructure drift
The architecture was correct at some point, then someone modified the Route Tables manually, or via a flawed IaC change, and the second NAT Gateway became orphaned without anyone noticing. No alerts, no errors, no warnings. Everything looks fine until AZ-A goes down.

This is what makes this particular SPOF so dangerous: it is completely invisible during normal operations.

The Fix: One NAT Gateway Per AZ, One Route Table Per Private Subnet

The solution is straightforward: each private subnet must route its outbound internet traffic through the NAT Gateway in its own Availability Zone.

Correct routing after the fix:

Private Subnet 1A (AZ-A) → 0.0.0.0/0 → nat-AZ-A ✅
Private Subnet 1B (AZ-A) → 0.0.0.0/0 → nat-AZ-A ✅
Private Subnet 2A (AZ-B) → 0.0.0.0/0 → nat-AZ-B ✅
Private Subnet 2B (AZ-B) → 0.0.0.0/0 → nat-AZ-B ✅

Step 1 — Identify which NAT Gateway belongs to which AZ

Go to VPC → NAT Gateways, click each NAT Gateway and check the Subnet field, this tells you which AZ it belongs to.

Step 2 — Fix the Route Tables for AZ-B private subnets

Go to VPC → Route Tables
Find the Route Table associated with Private Subnet 2A (AZ-B)
Click Edit routes
Change 0.0.0.0/0 from nat-AZ-A → nat-AZ-B
Save changes
Repeat for Private Subnet 2B (AZ-B)

Step 3 — Verify

All four private subnet Route Tables should now point exclusively to the NAT Gateway in their own AZ. If AZ-A goes down, AZ-B is completely self-sufficient.

Getting It Right From the Start: Terraform

If you're provisioning your VPC with Infrastructure as Code, which you should be, here's how to enforce this pattern correctly with Terraform from day one.

# NAT Gateway in AZ-A
resource "aws_eip" "nat_a" {
  domain = "vpc"
}

resource "aws_nat_gateway" "nat_a" {
  allocation_id = aws_eip.nat_a.id
  subnet_id     = aws_subnet.public_a.id

  tags = {
    Name = "nat-gateway-az-a"
  }
}

# NAT Gateway in AZ-B
resource "aws_eip" "nat_b" {
  domain = "vpc"
}

resource "aws_nat_gateway" "nat_b" {
  allocation_id = aws_eip.nat_b.id
  subnet_id     = aws_subnet.public_b.id

  tags = {
    Name = "nat-gateway-az-b"
  }
}

# Route Table — AZ-A private subnets
resource "aws_route_table" "private_a" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat_a.id
  }

  tags = { Name = "private-rt-az-a" }
}

# Route Table — AZ-B private subnets
resource "aws_route_table" "private_b" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat_b.id
  }

  tags = { Name = "private-rt-az-b" }
}

# Associations — AZ-A
resource "aws_route_table_association" "private_1a" {
  subnet_id      = aws_subnet.private_1a.id
  route_table_id = aws_route_table.private_a.id
}

resource "aws_route_table_association" "private_1b" {
  subnet_id      = aws_subnet.private_1b.id
  route_table_id = aws_route_table.private_a.id
}

# Associations — AZ-B
resource "aws_route_table_association" "private_2a" {
  subnet_id      = aws_subnet.private_2a.id
  route_table_id = aws_route_table.private_b.id
}

resource "aws_route_table_association" "private_2b" {
  subnet_id      = aws_subnet.private_2b.id
  route_table_id = aws_route_table.private_b.id
}

The beauty of this approach: the correct pattern is enforced by design. Each AZ has its own NAT Gateway, its own Route Table, and explicit associations. Infrastructure drift becomes impossible, any change goes through code review.

The Broader Lesson: Designing for Failure

AWS high availability is built on one fundamental principle:

Assume everything will fail. Design so that the failure of any single component does not bring down the entire system.

A NAT Gateway is a component. An Availability Zone is a failure domain. When you route cross-AZ traffic through a single NAT Gateway, you create an invisible dependency that violates this principle, and the worst part is that everything looks fine until the moment it isn't.

The AWS Well-Architected Framework's Reliability Pillar specifically calls for eliminating Single Points of Failure. A shared NAT Gateway is a textbook SPOF, easy to miss precisely because the architecture looks correct at first glance.

Key Takeaways

A NAT Gateway is zonal, it belongs to one specific Availability Zone
Routing all private subnet traffic through a single NAT Gateway creates a hidden Single Point of Failure
The fix: one NAT Gateway per AZ, one Route Table per AZ
Use Terraform to enforce this pattern by design and prevent infrastructure drift
The cost of two NAT Gateways (~$32/month extra) is nothing compared to the cost of an outage

This article is part of my AWS Solutions Architect Associate (SAA-C03) preparation series. I document real hands-on lab experiences, architecture challenges, and lessons learned along the way.

Follow along for more practical AWS architecture and Infrastructure as Code content.

DEV Community