Jeancy Joachim Mukaka

Posted on Jun 30 • Edited on Jul 1

When VPC Peering Looks Fine But Nothing Works: A 3-Day Debugging Story

#aws #networking #peering #infrastructure

A real-world lesson from a production-like AWS lab

Imagine this: two servers, two VPCs, a peering connection marked as Active, DNS enabled, routes in place. Your colleague tries to reach the PeerServer from the ApiServer. Timeout.

You check the peering connection. Active. You check the routes. Present. You check the Security Groups. Looks fine. Still timing out.
That was me, for 3 days, stuck on a single challenge while the other five were already solved.
This is the story of two misconfigurations that are easy to miss, and that most checklists forget to mention.

The Lab Scenario

The challenge was straightforward on paper.
Two servers. Two VPCs. One peering connection between them.

ApiServer lives inside ApiVPC (CIDR: 10.201.0.0/16)
PeerServer lives inside PeerVPC (CIDR: 10.202.0.0/16)
The two VPCs are connected via AWS VPC Peering

The requirement: both servers must communicate with each other over private DNS, using all ports. And any other server launched in the same subnet as the ApiServer must have the same level of access automatically.One warning was explicit: "Make sure the relevant CIDR range is restricted as much as possible." Simple enough. Except it wasn't.

When my colleague attempted to reach the PeerServer from within the ApiServer, the response was always the same: timeout.

Day 1: Flying Solo

My first instinct was to follow the classic VPC Peering troubleshooting checklist: peering status, route tables.
The peering connection was Active. No issue there.
The route tables looked broken at first, only local routes, nothing pointing to the peering connection. But I couldn't edit them; the lab didn't allow it. Digging further, I found 6 route tables across both VPCs, not just the two main ones I had initially seen. Two of them already had the correct routes in place.
The routing was fine all along. I had just spent a day looking at the wrong tables.

End of Day 1: still timing out.

Day 2: Even AI Couldn't Find It

On Day 2, I brought in AI assistants to speed things up. The suggestions were consistent: peering status, DNS resolution, Security Group rules.

I worked through all of it. DNS resolution enabled on both sides, Requester and Accepter. Security Groups verified and restricted to the right CIDR.
Still timing out. Every suggestion felt right. None of them mentioned one entire layer of AWS networking.

(See the kind of checklist I was working with below)

End of Day 2: DNS enabled, SGs adjusted, routes confirmed. Still timing out.

Day 3: The Two Real Culprits

On Day 3, I changed my approach. Instead of applying suggestions, I decided to go through every single networking layer systematically, one by one, and verify each one with my own eyes before moving to the next.
That's when the two real problems revealed themselves.

Culprit #1 — DNS Resolution Was Disabled

Yes, I had been told to check DNS on Day 2. But what I hadn't fully verified was the exact state of both sides of the peering connection.
In VPC Peering, DNS resolution must be explicitly enabled on both sides independently:

Allow accepter VPC to resolve DNS of hosts in requester VPC → Enabled ✅
Allow requester VPC to resolve DNS of hosts in accepter VPC → Enabled ✅

Once both were confirmed active, private hostnames could finally resolve to private IP addresses across the peering connection. Without this, even with perfect routing and open Security Groups, the servers simply couldn't find each other by name.

This was the first fix.

Culprit #2 — The NACL Nobody Mentioned

This is where it gets interesting.

After confirming DNS, I went deeper and looked at something that had never appeared in any checklist I had received over two days: Network ACLs.
The PeerServer's subnet was associated with a NACL called PrivateACL2. When I opened its inbound rules, this is what I found:

Rule	Type	Protocol	Port Range	Source	Allow/Deny
*	All traffic	All	All	0.0.0.0/0	❌ Deny

One single rule. A catch-all Deny. Zero Allow rules.

Every single packet arriving at the PeerServer's subnet from the ApiServer was being silently dropped at the NACL level, before it could even reach the instance or the Security Group.

This is the critical difference between NACLs and Security Groups that is easy to forget:

Security Groups are stateful → if outbound is allowed, the return traffic is automatically allowed
NACLs are stateless → every direction must be explicitly allowed, inbound AND outbound, independently
NACLs apply to the entire subnet → every server launched in that subnet is automatically subject to the same rules, without needing to touch individual instances

That last point was actually the key to satisfying the challenge requirement: "any other server launched in the same subnet must have the same level of access automatically." A Security Group change on one instance would never achieve that. A NACL rule would.

The fix: I added one inbound rule to PrivateACL2:

Rule	Type	Protocol	Port Range	Source	Allow/Deny
100	All traffic	All	All	10.201.0.0/16	✅ Allow

Source restricted to exactly 10.201.0.0/16 — the ApiVPC CIDR — and nothing else. Respecting the warning about keeping CIDR ranges as restricted as possible.

Challenge validated. ✅

The Key Lesson: Always Check the Full Stack

Three days. Two misconfigurations. One layer that nobody mentioned.
Looking back, the debugging process taught me something more valuable than the fix itself: in AWS networking, a timeout doesn't tell you where the problem is. It only tells you that something, somewhere in the stack, is blocking traffic.
And that stack has more layers than most checklists cover.

Why NACLs Are Always Forgotten

Security Groups get all the attention. They are instance-level, they are stateful, they are the first thing everyone checks. And because they handle return traffic automatically, they feel complete.
NACLs are different. They are subnet-level, stateless, and silent. They don't send back an error. They just drop the packet. Which is exactly why a NACL misconfiguration produces a timeout, not a rejection message.
And because they sit at the subnet level, they are invisible when you are focused on individual instances.

The Complete VPC Peering Troubleshooting Checklist

Next time you face a VPC Peering connectivity issue, go through this list in order:

1. Peering Connection

Status is Active
Both VPCs are in compatible regions and accounts

2. DNS Resolution

Enabled on the Requester VPC side
Enabled on the Accepter VPC side
Both must be explicitly enabled independently

3. Route Tables

Subnet of Server A has a route to VPC-B CIDR via the peering connection
Subnet of Server B has a route to VPC-A CIDR via the peering connection
Check all route tables, not just the Main one

4. Network ACLs ← the one everyone forgets

Inbound rules on Server A's subnet allow traffic from VPC-B CIDR
Outbound rules on Server A's subnet allow traffic to VPC-B CIDR
Inbound rules on Server B's subnet allow traffic from VPC-A CIDR
Outbound rules on Server B's subnet allow traffic to VPC-A CIDR
Always use the specific VPC CIDR, never 0.0.0.0/0

5. Security Groups

Server B's SG allows inbound traffic from VPC-A CIDR on required ports
Server A's SG allows outbound traffic to VPC-B CIDR
Restrict CIDR ranges as much as possible

The Subnet-Level Requirement

One last thing worth highlighting. The challenge required that any server launched in the same subnet as the ApiServer automatically inherits the same level of access.

This is precisely why the NACL was the right tool here, not the Security Group. A Security Group is attached per instance. A NACL covers the entire subnet. Any new server launched in that subnet automatically inherits the NACL rules, with zero additional configuration.

If you solve a connectivity requirement at the Security Group level only, you will need to manually replicate that configuration for every new instance. The NACL approach enforces it by design.

Codify It So It Never Happens Again

This entire debugging story raises an obvious question: why was any of this discoverable only by clicking through the console for three days?
The answer is that both misconfigurations, DNS resolution disabled, NACL missing an Allow rule, are exactly the kind of settings that get silently skipped during manual setup, and silently missed during manual review. If this infrastructure had been defined in Terraform from the start, both issues would have been visible in a pull request, not buried three clicks deep in the console.

1. Force DNS resolution at the peering connection level

resource "aws_vpc_peering_connection" "api_to_peer" {
  vpc_id      = aws_vpc.api_vpc.id
  peer_vpc_id = aws_vpc.peer_vpc.id
  auto_accept = true

  tags = {
    Name = "api-to-peer"
  }
}

resource "aws_vpc_peering_connection_options" "api_to_peer_options" {
  vpc_peering_connection_id = aws_vpc_peering_connection.api_to_peer.id

  requester {
    allow_remote_vpc_dns_resolution = true
  }

  accepter {
    allow_remote_vpc_dns_resolution = true
  }
}

With this in code, DNS resolution on both sides is no longer an optional checkbox someone might forget to tick in the console. It's an explicit, reviewable, enforced setting. If a teammate ever tries to remove it, the change shows up in a diff.

2. Make NACL rules explicit, not implicit

resource "aws_network_acl_rule" "allow_inbound_from_api_vpc" {
  network_acl_id = aws_network_acl.private_acl_2.id
  rule_number     = 100
  egress          = false
  protocol        = "-1"
  rule_action     = "allow"
  cidr_block      = var.api_vpc_cidr   # 10.201.0.0/16
  from_port       = 0
  to_port         = 0
}

resource "aws_network_acl_rule" "allow_outbound_to_api_vpc" {
  network_acl_id = aws_network_acl.private_acl_2.id
  rule_number     = 100
  egress          = true
  protocol        = "-1"
  rule_action     = "allow"
  cidr_block      = var.api_vpc_cidr
  from_port       = 0
  to_port         = 0
}

Notice the CIDR is a variable, not a hardcoded value and definitely not 0.0.0.0/0. This keeps the "restrict the CIDR range as much as possible" requirement enforced by design, not by memory.

3. Catch drift before it becomes a 3-day debugging session

The real value of this approach isn't the code itself, it's what it prevents. A terraform plan run in CI on every pull request would have flagged a missing NACL rule or a disabled DNS option immediately, as a visible diff, instead of a silent timeout discovered days later in production or in a lab.

NAT Gateways, NACLs, peering DNS options, these are exactly the settings that survive for months unnoticed because nobody is actively looking at them. Infrastructure as Code doesn't just make deployments repeatable. It makes the invisible parts of your network visible again.

This article is part of my AWS Solutions Architect Associate (SAA-C03) preparation series. I document real hands-on lab experiences, networking challenges, and lessons learned along the way.

Follow along for more practical AWS architecture and networking content.

DEV Community