DEV Community: Jeancy Joachim Mukaka

When VPC Peering Looks Fine But Nothing Works: A 3-Day Debugging Story

Jeancy Joachim Mukaka — Tue, 30 Jun 2026 00:47:22 +0000

A real-world lesson from a production-like AWS lab

Imagine this: two servers, two VPCs, a peering connection marked as Active, DNS enabled, routes in place. Your colleague tries to reach the PeerServer from the ApiServer. Timeout.

You check the peering connection. Active. You check the routes. Present. You check the Security Groups. Looks fine. Still timing out.
That was me, for 3 days, stuck on a single challenge while the other five were already solved.
This is the story of two misconfigurations that are easy to miss, and that most checklists forget to mention.

The Lab Scenario

The challenge was straightforward on paper.
Two servers. Two VPCs. One peering connection between them.

ApiServer lives inside ApiVPC (CIDR: 10.201.0.0/16)
PeerServer lives inside PeerVPC (CIDR: 10.202.0.0/16)
The two VPCs are connected via AWS VPC Peering

The requirement: both servers must communicate with each other over private DNS, using all ports. And any other server launched in the same subnet as the ApiServer must have the same level of access automatically.One warning was explicit: "Make sure the relevant CIDR range is restricted as much as possible." Simple enough. Except it wasn't.

When my colleague attempted to reach the PeerServer from within the ApiServer, the response was always the same: timeout.

Day 1: Flying Solo

My first instinct was to follow the classic VPC Peering troubleshooting checklist: peering status, route tables.
The peering connection was Active. No issue there.
The route tables looked broken at first, only local routes, nothing pointing to the peering connection. But I couldn't edit them; the lab didn't allow it. Digging further, I found 6 route tables across both VPCs, not just the two main ones I had initially seen. Two of them already had the correct routes in place.
The routing was fine all along. I had just spent a day looking at the wrong tables.

End of Day 1: still timing out.

Day 2: Even AI Couldn't Find It

On Day 2, I brought in AI assistants to speed things up. The suggestions were consistent: peering status, DNS resolution, Security Group rules.

I worked through all of it. DNS resolution enabled on both sides, Requester and Accepter. Security Groups verified and restricted to the right CIDR.
Still timing out. Every suggestion felt right. None of them mentioned one entire layer of AWS networking.

(See the kind of checklist I was working with below)

End of Day 2: DNS enabled, SGs adjusted, routes confirmed. Still timing out.

Day 3: The Two Real Culprits

On Day 3, I changed my approach. Instead of applying suggestions, I decided to go through every single networking layer systematically, one by one, and verify each one with my own eyes before moving to the next.
That's when the two real problems revealed themselves.

Culprit #1 — DNS Resolution Was Disabled

Yes, I had been told to check DNS on Day 2. But what I hadn't fully verified was the exact state of both sides of the peering connection.
In VPC Peering, DNS resolution must be explicitly enabled on both sides independently:

Allow accepter VPC to resolve DNS of hosts in requester VPC → Enabled ✅
Allow requester VPC to resolve DNS of hosts in accepter VPC → Enabled ✅

Once both were confirmed active, private hostnames could finally resolve to private IP addresses across the peering connection. Without this, even with perfect routing and open Security Groups, the servers simply couldn't find each other by name.

This was the first fix.

Culprit #2 — The NACL Nobody Mentioned

This is where it gets interesting.

After confirming DNS, I went deeper and looked at something that had never appeared in any checklist I had received over two days: Network ACLs.
The PeerServer's subnet was associated with a NACL called PrivateACL2. When I opened its inbound rules, this is what I found:

Rule	Type	Protocol	Port Range	Source	Allow/Deny
*	All traffic	All	All	0.0.0.0/0	❌ Deny

One single rule. A catch-all Deny. Zero Allow rules.

Every single packet arriving at the PeerServer's subnet from the ApiServer was being silently dropped at the NACL level, before it could even reach the instance or the Security Group.

This is the critical difference between NACLs and Security Groups that is easy to forget:

Security Groups are stateful → if outbound is allowed, the return traffic is automatically allowed
NACLs are stateless → every direction must be explicitly allowed, inbound AND outbound, independently
NACLs apply to the entire subnet → every server launched in that subnet is automatically subject to the same rules, without needing to touch individual instances

That last point was actually the key to satisfying the challenge requirement: "any other server launched in the same subnet must have the same level of access automatically." A Security Group change on one instance would never achieve that. A NACL rule would.

The fix: I added one inbound rule to PrivateACL2:

Rule	Type	Protocol	Port Range	Source	Allow/Deny
100	All traffic	All	All	10.201.0.0/16	✅ Allow

Source restricted to exactly 10.201.0.0/16 — the ApiVPC CIDR — and nothing else. Respecting the warning about keeping CIDR ranges as restricted as possible.

Challenge validated. ✅

The Key Lesson: Always Check the Full Stack

Three days. Two misconfigurations. One layer that nobody mentioned.
Looking back, the debugging process taught me something more valuable than the fix itself: in AWS networking, a timeout doesn't tell you where the problem is. It only tells you that something, somewhere in the stack, is blocking traffic.
And that stack has more layers than most checklists cover.

Why NACLs Are Always Forgotten

Security Groups get all the attention. They are instance-level, they are stateful, they are the first thing everyone checks. And because they handle return traffic automatically, they feel complete.
NACLs are different. They are subnet-level, stateless, and silent. They don't send back an error. They just drop the packet. Which is exactly why a NACL misconfiguration produces a timeout, not a rejection message.
And because they sit at the subnet level, they are invisible when you are focused on individual instances.

The Complete VPC Peering Troubleshooting Checklist

Next time you face a VPC Peering connectivity issue, go through this list in order:

1. Peering Connection

Status is Active
Both VPCs are in compatible regions and accounts

2. DNS Resolution

Enabled on the Requester VPC side
Enabled on the Accepter VPC side
Both must be explicitly enabled independently

3. Route Tables

Subnet of Server A has a route to VPC-B CIDR via the peering connection
Subnet of Server B has a route to VPC-A CIDR via the peering connection
Check all route tables, not just the Main one

4. Network ACLs ← the one everyone forgets

Inbound rules on Server A's subnet allow traffic from VPC-B CIDR
Outbound rules on Server A's subnet allow traffic to VPC-B CIDR
Inbound rules on Server B's subnet allow traffic from VPC-A CIDR
Outbound rules on Server B's subnet allow traffic to VPC-A CIDR
Always use the specific VPC CIDR, never 0.0.0.0/0

5. Security Groups

Server B's SG allows inbound traffic from VPC-A CIDR on required ports
Server A's SG allows outbound traffic to VPC-B CIDR
Restrict CIDR ranges as much as possible

The Subnet-Level Requirement

One last thing worth highlighting. The challenge required that any server launched in the same subnet as the ApiServer automatically inherits the same level of access.

This is precisely why the NACL was the right tool here, not the Security Group. A Security Group is attached per instance. A NACL covers the entire subnet. Any new server launched in that subnet automatically inherits the NACL rules, with zero additional configuration.

If you solve a connectivity requirement at the Security Group level only, you will need to manually replicate that configuration for every new instance. The NACL approach enforces it by design.

Codify It So It Never Happens Again

This entire debugging story raises an obvious question: why was any of this discoverable only by clicking through the console for three days?
The answer is that both misconfigurations, DNS resolution disabled, NACL missing an Allow rule, are exactly the kind of settings that get silently skipped during manual setup, and silently missed during manual review. If this infrastructure had been defined in Terraform from the start, both issues would have been visible in a pull request, not buried three clicks deep in the console.

1. Force DNS resolution at the peering connection level

resource "aws_vpc_peering_connection" "api_to_peer" {
  vpc_id      = aws_vpc.api_vpc.id
  peer_vpc_id = aws_vpc.peer_vpc.id
  auto_accept = true

  tags = {
    Name = "api-to-peer"
  }
}

resource "aws_vpc_peering_connection_options" "api_to_peer_options" {
  vpc_peering_connection_id = aws_vpc_peering_connection.api_to_peer.id

  requester {
    allow_remote_vpc_dns_resolution = true
  }

  accepter {
    allow_remote_vpc_dns_resolution = true
  }
}

With this in code, DNS resolution on both sides is no longer an optional checkbox someone might forget to tick in the console. It's an explicit, reviewable, enforced setting. If a teammate ever tries to remove it, the change shows up in a diff.

2. Make NACL rules explicit, not implicit

resource "aws_network_acl_rule" "allow_inbound_from_api_vpc" {
  network_acl_id = aws_network_acl.private_acl_2.id
  rule_number     = 100
  egress          = false
  protocol        = "-1"
  rule_action     = "allow"
  cidr_block      = var.api_vpc_cidr   # 10.201.0.0/16
  from_port       = 0
  to_port         = 0
}

resource "aws_network_acl_rule" "allow_outbound_to_api_vpc" {
  network_acl_id = aws_network_acl.private_acl_2.id
  rule_number     = 100
  egress          = true
  protocol        = "-1"
  rule_action     = "allow"
  cidr_block      = var.api_vpc_cidr
  from_port       = 0
  to_port         = 0
}

Notice the CIDR is a variable, not a hardcoded value and definitely not 0.0.0.0/0. This keeps the "restrict the CIDR range as much as possible" requirement enforced by design, not by memory.

3. Catch drift before it becomes a 3-day debugging session

The real value of this approach isn't the code itself, it's what it prevents. A terraform plan run in CI on every pull request would have flagged a missing NACL rule or a disabled DNS option immediately, as a visible diff, instead of a silent timeout discovered days later in production or in a lab.

NAT Gateways, NACLs, peering DNS options, these are exactly the settings that survive for months unnoticed because nobody is actively looking at them. Infrastructure as Code doesn't just make deployments repeatable. It makes the invisible parts of your network visible again.

This article is part of my AWS Solutions Architect Associate (SAA-C03) preparation series. I document real hands-on lab experiences, networking challenges, and lessons learned along the way.

Follow along for more practical AWS architecture and networking content.

How a Single NAT Gateway Can Silently Kill Your AWS High Availability

Jeancy Joachim Mukaka — Thu, 04 Jun 2026 15:47:31 +0000

A real-world lesson from a production-like AWS lab challenge

The Scenario That Should Scare You

Imagine this: your AWS environment has two Availability Zones, public and private subnets, an Application Load Balancer, Auto Scaling. Your architecture diagram looks solid. Then one Availability Zone goes down, your ALB fails over instantly, your EC2 instances in AZ-B are running fine. But your application is still broken.

Because every private subnet instance, including those in AZ-B, is routing outbound traffic through one NAT Gateway sitting in AZ-A. Which is now unreachable.

You didn't have a highly available architecture. You had the illusion of one.

Understanding the Problem: NAT Gateways Are Zonal

A NAT Gateway is not a regional resource. It lives in a specific Availability Zone.

When you create a NAT Gateway, you place it in a specific subnet, which belongs to a specific AZ. If that AZ goes down, your NAT Gateway goes down with it.

Many teams create a single NAT Gateway to save costs, then route all private subnet traffic across all AZs through that one gateway:

Private Subnet AZ-A → 0.0.0.0/0 → nat-09xxxxx (AZ-A) ✅
Private Subnet AZ-B → 0.0.0.0/0 → nat-09xxxxx (AZ-A) ❌

The private subnet in AZ-B is routing through a NAT Gateway in AZ-A. This is a cross-AZ dependency, and a silent Single Point of Failure.

What I Found in the Lab

The lab presented a VPC with this structure:

Resource	CIDR / Details
VPC	10.0.0.0/16
Public Subnet AZ-A	10.0.128.0/20
Public Subnet AZ-B	10.0.144.0/20
Private Subnet 1A (AZ-A)	10.0.0.0/19
Private Subnet 1B (AZ-A)	10.0.192.0/21
Private Subnet 2A (AZ-B)	10.0.32.0/19
Private Subnet 2B (AZ-B)	10.0.200.0/21

Two NAT Gateways existed: one in AZ-A, one in AZ-B. At first glance, this looked correct.

But when I inspected the Route Tables, the problem was immediately visible. All four private subnet Route Tables had the same entry:

Destination: 0.0.0.0/0 → Target: nat-09xxxxxxxx (AZ-A)

The NAT Gateway in AZ-B existed, but nobody was using it. It was provisioned but completely disconnected from the routing logic. The two private subnets in AZ-B were silently depending on the NAT Gateway in AZ-A for all outbound internet traffic.

Why This Happens

There are two common causes:

1. Cost-cutting gone wrong
Teams create one NAT Gateway to reduce costs, then forget that high availability requires one per AZ. A NAT Gateway costs approximately $0.045/hour plus data transfer charges. Running two instead of one adds roughly $32/month, a small price compared to the cost of an outage.

2. Infrastructure drift
The architecture was correct at some point, then someone modified the Route Tables manually, or via a flawed IaC change, and the second NAT Gateway became orphaned without anyone noticing. No alerts, no errors, no warnings. Everything looks fine until AZ-A goes down.

This is what makes this particular SPOF so dangerous: it is completely invisible during normal operations.

The Fix: One NAT Gateway Per AZ, One Route Table Per Private Subnet

The solution is straightforward: each private subnet must route its outbound internet traffic through the NAT Gateway in its own Availability Zone.

Correct routing after the fix:

Private Subnet 1A (AZ-A) → 0.0.0.0/0 → nat-AZ-A ✅
Private Subnet 1B (AZ-A) → 0.0.0.0/0 → nat-AZ-A ✅
Private Subnet 2A (AZ-B) → 0.0.0.0/0 → nat-AZ-B ✅
Private Subnet 2B (AZ-B) → 0.0.0.0/0 → nat-AZ-B ✅

Step 1 — Identify which NAT Gateway belongs to which AZ

Go to VPC → NAT Gateways, click each NAT Gateway and check the Subnet field, this tells you which AZ it belongs to.

Step 2 — Fix the Route Tables for AZ-B private subnets

Go to VPC → Route Tables
Find the Route Table associated with Private Subnet 2A (AZ-B)
Click Edit routes
Change 0.0.0.0/0 from nat-AZ-A → nat-AZ-B
Save changes
Repeat for Private Subnet 2B (AZ-B)

Step 3 — Verify

All four private subnet Route Tables should now point exclusively to the NAT Gateway in their own AZ. If AZ-A goes down, AZ-B is completely self-sufficient.

Getting It Right From the Start: Terraform

If you're provisioning your VPC with Infrastructure as Code, which you should be, here's how to enforce this pattern correctly with Terraform from day one.

# NAT Gateway in AZ-A
resource "aws_eip" "nat_a" {
  domain = "vpc"
}

resource "aws_nat_gateway" "nat_a" {
  allocation_id = aws_eip.nat_a.id
  subnet_id     = aws_subnet.public_a.id

  tags = {
    Name = "nat-gateway-az-a"
  }
}

# NAT Gateway in AZ-B
resource "aws_eip" "nat_b" {
  domain = "vpc"
}

resource "aws_nat_gateway" "nat_b" {
  allocation_id = aws_eip.nat_b.id
  subnet_id     = aws_subnet.public_b.id

  tags = {
    Name = "nat-gateway-az-b"
  }
}

# Route Table — AZ-A private subnets
resource "aws_route_table" "private_a" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat_a.id
  }

  tags = { Name = "private-rt-az-a" }
}

# Route Table — AZ-B private subnets
resource "aws_route_table" "private_b" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat_b.id
  }

  tags = { Name = "private-rt-az-b" }
}

# Associations — AZ-A
resource "aws_route_table_association" "private_1a" {
  subnet_id      = aws_subnet.private_1a.id
  route_table_id = aws_route_table.private_a.id
}

resource "aws_route_table_association" "private_1b" {
  subnet_id      = aws_subnet.private_1b.id
  route_table_id = aws_route_table.private_a.id
}

# Associations — AZ-B
resource "aws_route_table_association" "private_2a" {
  subnet_id      = aws_subnet.private_2a.id
  route_table_id = aws_route_table.private_b.id
}

resource "aws_route_table_association" "private_2b" {
  subnet_id      = aws_subnet.private_2b.id
  route_table_id = aws_route_table.private_b.id
}

The beauty of this approach: the correct pattern is enforced by design. Each AZ has its own NAT Gateway, its own Route Table, and explicit associations. Infrastructure drift becomes impossible, any change goes through code review.

The Broader Lesson: Designing for Failure

AWS high availability is built on one fundamental principle:

Assume everything will fail. Design so that the failure of any single component does not bring down the entire system.

A NAT Gateway is a component. An Availability Zone is a failure domain. When you route cross-AZ traffic through a single NAT Gateway, you create an invisible dependency that violates this principle, and the worst part is that everything looks fine until the moment it isn't.

The AWS Well-Architected Framework's Reliability Pillar specifically calls for eliminating Single Points of Failure. A shared NAT Gateway is a textbook SPOF, easy to miss precisely because the architecture looks correct at first glance.

Key Takeaways

A NAT Gateway is zonal, it belongs to one specific Availability Zone
Routing all private subnet traffic through a single NAT Gateway creates a hidden Single Point of Failure
The fix: one NAT Gateway per AZ, one Route Table per AZ
Use Terraform to enforce this pattern by design and prevent infrastructure drift
The cost of two NAT Gateways (~$32/month extra) is nothing compared to the cost of an outage

This article is part of my AWS Solutions Architect Associate (SAA-C03) preparation series. I document real hands-on lab experiences, architecture challenges, and lessons learned along the way.

Follow along for more practical AWS architecture and Infrastructure as Code content.

Stop Putting Everything in One Terraform State: Use Terragrunt Dependency Blocks

Jeancy Joachim Mukaka — Wed, 29 Apr 2026 15:40:49 +0000

Prerequisites

Before getting started, make sure you have the following:

Basic knowledge of Terraform (HCL syntax, resources, variables, remote state)
Terraform >= 1.11 installed - Download
Terragrunt installed - Installation guide
An AWS CLI configured with sufficient permissions to create S3 buckets and EC2 instances
Visual Studio Code with the HashiCorp Terraform extension for syntax hightlighting and autocompletion
Read Part 1 of this series: Stop Copy-Pasting Terraform State Configs: Use Terragrunt Instead

Introduction

In Part 1 of this series, we saw how Terragrunt eliminates the repetition of remote state backend configurations across environments. if you haven't read it yet, I recommend starting there - Stop Copy-Pasting Terraform State Configs: Use Terragrunt Instead.
Today, we go one step further.
Most of Terraform projects start the same way: everything in one state file. Your VPC, your security groups, your EC2 instances, your RDS database, all managed together. It feels simple and convenient at first. But as your infrastructure grows, this approach becomes a hidden risk.
Imagine this: a developer runs terraform apply to redeploy an EC2 instance that is rebuilt multiple times a day. Because everything is in the same state file, that single command now has access to your VPC configuration, your production database, and your security groups, resources that should never be touched during a routine EC2 redeployment.
One wrong move, one bad variable, one interrupted apply, and you could accidentally destroy or corrupt critical infrastructure that takes hours to rebuild.
In this article, we'll explore how Terragrunt's dependency blocks allow you to split your Terraform state between infrastructure components, so that frequently changed resources never put your critical infrastructure at risk.

The Problem: One State File to Rule Them All

When everything lives in a single Terraform state file, your infrastructure looks like this:

single state file
├── VPC                ← modified once a month
├── Subnets            ← modified once a month
├── Security Groups    ← modified occasionally
├── RDS Database       ← critical, rarely modified
└── EC2 Instances      ← modified 10x per day

Every terraform apply, no matter how small, touches this single state file. This creates three serious problems:

Problem 1: Blast radius: If something goes wrong during a routine EC2 redeployment, the entire state file is at risk. A corrupted state means Terraform loses track of all your resources, VPC, database, everything.
Problem 2: No separation of concerns: A junior developer redeploying an EC2 instance has the same Terraform access as a senior engineer modifying the VPC. There is no natural boundary between critical and non-critical infrastructure.
Problem 3: Slow operations As your infrastructure grows: Terraform has to refresh the state of every single resource on every terraform plan or terraform apply, even if you're only changing one EC2 instance. This makes operations increasingly slow. The solution is to split your state between infrastructure components, and Terragrunt dependency blocks make this both simple and elegant.

The Solution: Separate States with Dependency Blocks

Instead of one monolithic state file, Terragrunt allows you to give each infrastructure component its own isolated state:

vpc/                    ← state 1 — modified rarely
security-groups/        ← state 2 — modified occasionally
rds/                    ← state 3 — critical, rarely modified
ec2/                    ← state 4 — modified daily

Each component lives in its own folder with its own terragrunt.hcl file and its own state file in S3:

s3://my-terraform-state/
├── dev/vpc/terraform.tfstate
├── dev/security-groups/terraform.tfstate
├── dev/rds/terraform.tfstate
└── dev/ec2/terraform.tfstate

Now when a developer runs terraform apply on the EC2 component, only the EC2 state is touched. The VPC, the database, and the security groups are completely isolated and protected.
But here's the challenge: if components are separated, how does the EC2 module know the subnet ID from the VPC module? How does the security group know the VPC ID?
This is where Terragrunt's dependency block comes in.
The dependency block allows a component to read the outputs of another component without sharing the same state file:

# ec2/terragrunt.hcl

include "root" {
  path = find_in_parent_folders()
}

# Declare dependency on VPC component
dependency "vpc" {
  config_path = "../vpc"

  mock_outputs = {
    subnet_id = "subnet-00000000"
  }
}

# Declare dependency on security groups component
dependency "security_groups" {
  config_path = "../security-groups"

  mock_outputs = {
    sg_id = "sg-00000000"
  }
}

# Use outputs from dependencies as inputs
inputs = {
  subnet_id      = dependency.vpc.outputs.subnet_id
  security_group = dependency.security_groups.outputs.sg_id
  instance_type  = "t2.micro"
  environment    = "dev"
}

Two things to notice here:
First, config_path points to the folder of the dependency, not a specific file. Terragrunt knows where to find the outputs.
Second, mock_outpouts provides fake values for when you run terragrunt plan without the dependencies being deployed yet. this allows you to validate your configuration before deploying anything.

The Complete Project Structure

Here is the complete project structure for a dev environment with separated state files:

project/
├── terragrunt.hcl                    # Root — remote state defined once
└── dev/
    ├── vpc/
    │   ├── terragrunt.hcl
    │   └── main.tf                   # VPC + Subnets
    ├── security-groups/
    │   ├── terragrunt.hcl
    │   └── main.tf                   # Security Groups
    ├── rds/
    │   ├── terragrunt.hcl
    │   └── main.tf                   # RDS Database
    └── ec2/
        ├── terragrunt.hcl
        └── main.tf                   # EC2 Instances

Each component exposes its key values through Terraform outputs, which are then consumed by dependent components via the dependency block.
Here is how the dependency chain flows:

vpc/
  └── outputs: vpc_id, subnet_id
        │
        ├─────────────────────────┐
        ▼                         ▼
security-groups/                 ec2/
  inputs: vpc_id            inputs: subnet_id
  outputs: sg_id                  │
        │                         │
        └─────────────────────────┘
                    ▼
                  ec2/
             inputs: sg_id

Terragrunt reads this dependency graph automatically and deploys components in the correct order: VPC first, then security groups, then EC2. You never have to think about the deployment order manually.
The VPC component is the simplest, it has no dependencies:

# dev/vpc/terragrunt.hcl

include "root" {
  path = find_in_parent_folders()
}

inputs = {
  environment = "dev"
  vpc_cidr    = "10.0.0.0/16"
}

And its main.tf exposes the values that other components need:

# dev/vpc/main.tf

output "vpc_id" {
  value = aws_vpc.main.id
}

output "subnet_id" {
  value = aws_subnet.public.id
}

The outputs are what the dependency block reads when EC2 asks for dependency.vpc.outputs.subnet_id.

The full code for this project structure is available in the GitHub repository

Deploying with Dependency Blocks

Once your structure is in place, deploying is as simple as one command from the dev/ folder:

terragrunt run-all apply

Terragrunt automatically:

Reads the dependency graph across all components
Deploys in the correct order — VPC → Security Groups → EC2
Passes outputs from one component as inputs to the next
Creates a separate state file in S3 for each component

You can also target individual components:

# Redeploy only EC2 — VPC and Security Groups are untouched
cd dev/ec2/
terragrunt apply

# Check outputs of a specific component
cd dev/vpc/
terragrunt output

# Destroy in reverse order automatically
cd dev/
terragrunt run-all destroy

Notice the power here: when you run terragrunt apply in dev/ec2/ only, Terraform touches only the EC2 state file. Your VPC and database state files are completely safe, even if something goes wrong.

The mock_outputs - Why They Matter

When you run terragrunt plan on EC2 component before the VPC is deployed, Terragrunt needs values for subnet.id and sg.id to validate the configuration. Since the real values don't exist yet, mock_outputs provides temporary placeholders:

dependency "vpc" {
  config_path = "../vpc"

  mock_outputs = {
    subnet_id = "subnet-00000000"
  }

  mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}

The mock_outputs_allowed_terraform_commands parameter ensures that mock values are only used during validate and plan, never during apply. This prevents accidental deployments with fake values.

A Note on Security
Before wrapping up, a quick but important note on security, raised by Paul Marcelin in the comments of Part 1.
When your state files are separated by component, you have a natural opportunity to apply different IAM permissions per component. For example:

Developers can have read/write access to the EC2 state file
Only senior engineers or CI/CD pipelines can access the VPC and RDS state files
Production state files can be encrypted with dedicated KMS keys per environment.

This is a significant security improvement over a single state file where everyone has access to everything. Separating state files is the first step, securing them with IAM policies and KMS encryption is the natural next step.
For a deep dive on Terraform state file security, I recommend this LinkedIn post by Yaroslav Naumenko: "Your Terraform state file is a secret. Most teams don't treat it that way."

Conclusion

Let's recap what we covered in this article:

The problem: a monolithic state file creates a dangerous blast radius where routine operations can accidentally affect critical infrastructure
The solution: Terragrunt dependency blocks allow each component to have its own isolated state file
The dependency block reads outputs from other components without sharing their state
mock_outputs allow you to validate configurations before dependencies are deployed
terragrunt run-all apply automatically respects the dependency order
Separating state files is also the foundation for better security, different IAM permissions per component.

Together, Part 1 and Part 2 give you a complete Terragrunt workflow:

Part 1 → One root terragrunt.hcl    = No repeated backend configs
Part 2 → Dependency blocks          = No more monolithic state files

If you found this helpful, share it and follow me for the next article in the series.
The code for this article is available on my GitHub.

Stop Copy-Pasting Terraform State Configs: Use Terragrunt instead

Jeancy Joachim Mukaka — Mon, 13 Apr 2026 14:31:31 +0000

Prerequisites
Before getting started, make sure you have the following:

Basic knowledge of Terraform (HCL Syntax, resources, variables, remote state), the full prerequisite code is available in the GitHub repository
Terraform installed on your machine (v0.12 or higher)
Terragrunt installed, check the official installation guide
An AWS account with suffficient permissions to create S3 buckets and DynamoDB tables
AWS CLI configured with your credentials (aws configure)
Visual Studio Codes as your code editor, with the HashiCorp Terraform extension for syntax highlighting and autocompletion.

Introduction

If you have been working with Terraform for a while, you have probably faced this situation: you have a working configuration for your dev environment, and now you need to deploy the same infrastructure to staging and prod. So you copy the folder, update a few values, including the remote state backend configuration, and repeat. It works, but something feels wrong.
That "something" is a violation of the DRY principle, don't repeat yourself. Every time you duplicate your backend configuration, you create a new opportunity for error and a new file to maintain.
In this article, we will explore how Terragrunt solves this problem by allowing you to define your remote state configuration once and reuse it across all your environments.
If you are new to terraform, I recommend exploring prerequisite code on GitHub before diving in.

The Problem: Repeat Remote State

When managing multiple environments with Terraform, most developers end up with a structure like this:

environments/
├── dev/
│   └── main.tf
├── staging/
│   └── main.tf
└── prod/
    └── main.tf

And inside each main.tf, the same backend block appears with only one line changing:

# dev/main.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "dev/terraform.tfstate"
    region         = "us-west-2"
    use_lockfile   = true 
    encrypt        = true
  }
}

The same block is then copy-pasted into staging/main.tf and prod/main.tf, with only the key value changing (staging/terraform.tfstate, prod/terraform.tfsate). That's three files, three times the same configuration. And if you ever need to change the bucket name, the region, or encryption, you have to update every single file manually. This is exactly the kind of repetition that leads to human error and maintenance nightmares.

What is Terragrunt?

Terragrunt is a thin wrapper around Terraform, developed by Gruntwork, It doesn't replace Terraform, it enhances it by providing additional tools to keep your configurations DRY, manageable, and consistent across environments.
Think of it this way: Terraform is the engine, and Terragrunt is the intelligent framework built around it. You still write the same HCL code you know, but Terragrunt handless the repetitive parts for you.
With Terragrunt you can:

Define your remote state configuration once and reuse it across all environments
Automatically create your S3 bucket and DynamoDB table if they don't exist
Deploy multiple environments with a single command
Keep your codebase clean, readable, and easy to maintain The key concept we'll focus on in this article is the remote_state block — the feature that eliminates repeated backend configurations across environments.

The Solution: Centralized Remote State with Terragrunt

Instead of repeating the backend configuration in every environment, Terragrunt lets you define it once in a root terragrunt hcl file:

project/
├── terragrunt.hcl        ← defined once here
├── dev/
│   └── terragrunt.hcl    ← only what changes
├── staging/
│   └── terragrunt.hcl    ← only what changes
└── prod/
    └── terragrunt.hcl    ← only what changes

The root terragrunt.hcl contains the full remote state configuration:

# terragrunt.hcl (root)
remote_state {
  backend = "s3"

  config = {
  bucket       = "your-terraform-state-bucket"
  key          = "${path_relative_to_include()}/terraform.tfstate"
  region       = "us-west-2"
  encrypt      = true
  use_lockfile = true   # Native S3 locking — replaces DynamoDB (Terraform v1.11+)
}

  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
}

Update: As of Terraform v1.11, DynamoDB-based state
locking is deprecated. This example uses native S3 locking
via use_lockfile = true. Thanks to Paul Marcelin for
pointing this out!

Each environment file simply inherits from the root. Here is the dev example:

# dev/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

inputs = {
  environment   = "dev"
  instance_type = "t2.micro"
}

The staging and prod files follow the exact same structure, only the environment and instance_type values change. That's it. Three environments, three small files, each containing only what is unique to that environment. The backend configuration lives in one place and is never repeated.
The full project structure with all environments is available in the GitHub repository.

Key Terragrunt Functions Explained

Two functions make all of this possible. Understanding them is the key to mastering Terragrunt.

find_in_parent_folders() This funtcion automatically searches parent directories for the root terragrunt.hcl file. It allows each environment file to inherit the root configuration without hardcoding the path.

include "root" {
  path = find_in_parent_folders()  # finds ../../terragrunt.hcl automatically
}

No matter how deeply nested your environment folder is, Terragrunt will always find the root configuration.

path_relative_to_include() This is the function that makes the state key dynamic. It returns the relative path of the current environment folder from the root.

key = "${path_relative_to_include()}/terraform.tfstate"

Concretely, this means:

| Environment folder | Generated state key |
| :--- | :--- |
| `dev/` | `dev/terraform.tfstate` |
| `staging/` | `staging/terraform.tfstate` |
| `prod/` | `prod/terraform.tfstate` |

Each environment automatically gets its own isolated state file in S3, with zero manual configuration.

The generate Block

This block is often overlooked but extremely powerful. It tells Terragrunt to automatically generate a backend.tf file in each environment folder before running Terraform.

generate = {
  path      = "backend.tf"
  if_exists = "overwrite_terragrunt"
}

This means you never have to manually write a backend.tf file again. Terragrunt generates it for you, every time, with the correct values.

Deploying All Environments

once your configuration is in place, deploying all environments is as simple as running a single command from the root folder:

# Deploy all environments at once
terragrunt run-all apply

Terragrunt will automatically:

Detect all terragrunt.hcl files in subdirectories
Run terraform init for each environment
Deploy each environment in the correct order
Create the S3 bucket and DynamoDB table if they don't exist yet you can also target a specific environment:

# Deploy only dev
cd dev/
terragrunt apply

# Check outputs across all environments
terragrunt run-all output

# Destroy all environments
terragrunt run-all destroy

Compare this to the old approach where you had to navigate into each folder manually, run terraform init, then terraform apply, and repeat for every environment. With Terragrunt, that entire workflow collapses into one command.

Conclusion

Managing Terraform remote state across multiple environments doesn't have to be painful. With Terragrunt's remotestate block, find_in_parent_folders(), and path_relative_to_include(), you can define your backend configuration once and let Terragrunt handle the rest.
Let's recap what we covered:

The problem: repeated backend configuration across environments violate DRY principle
The solution: a single root terragrunt.hcl that centralizes the remote state configuration
The magic functions: find_in_parent_folders() and path_relative_to_include()that make everything dynamic
The power of Terragrunt run-all apply: deploy all environments in one command. This is just the beginning of what Terragrunt can do. In the next article, Part 2, we will go deeper and explore how to split your Terraform dependency blocks. You will learn why putting your VPC, your security groups, and your EC2 instances in the same state file is a risk and how to fix it.

If you found this article helpful, feel free to share it and follow me for Part 2. The code for this article is available on my GitHub.