Prince Ayiku

Posted on Mar 31

I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up

#terraform #aws #iac #devops

I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up

I thought I understood Terraform. Then I tried to inject a database endpoint that didn't exist yet into a server that hadn't booted yet, and I stared at my screen for a solid hour.

That moment taught me more about Infrastructure as Code than any tutorial had.

This is the story of building a production-style 3-tier AWS architecture from scratch — what I built, what broke, and what I'd do differently.

Why I Built This

I'm on a DevOps learning path at AmaliTech, and I'd been doing the usual things: tutorials, small scripts, single-instance deployments. But I kept noticing that production systems don't look like that. They have layers. They isolate things. The database is never directly reachable from the internet.

So I decided to build one — not a toy, but an architecture that actually reflects how real workloads run. The application I chose to deploy on top of it was a Pharma AI assistant: Next.js frontend, Python FastAPI backend, Clerk authentication, Paystack payments, and Groq for the LLM layer.

The goal wasn't to ship the app. It was to build the infrastructure correctly, document the decisions, and understand why each piece exists.

What I Built

The architecture has three layers, each in its own network zone:

Tier 1 — Public (ALB + Bastion Host)
The Application Load Balancer receives traffic from the internet and forwards it to the app tier. The Bastion Host is the only way to SSH into anything — and it's locked down to my IP only.

Tier 2 — Private Application (EC2 Auto Scaling Group)
EC2 instances running Docker containers. They live in private subnets — no public IPs, no direct internet exposure. The only traffic that reaches them comes through the ALB. They can reach the internet outbound through a NAT Gateway (for pulling Docker images), but nothing can reach them directly.

Tier 3 — Private Database (RDS PostgreSQL)
The database sits in private subnets with no route table attached to an internet gateway. Not "protected by a security group." Structurally unreachable from the internet.

The Terraform Structure

Everything is modular. Five modules: networking, security, database, alb, compute. The root main.tf just orchestrates them.

modules/
├── networking/   # VPC, subnets, IGW, NAT Gateway, route tables
├── security/     # 4 security groups (ALB, Bastion, App, DB)
├── database/     # RDS PostgreSQL in private DB subnets
├── alb/          # Application Load Balancer + target group + listener
└── compute/      # Launch template, Auto Scaling Group, Bastion Host

Each module exposes outputs that the next module depends on. Networking outputs flow into security, security flows into compute and database, everything flows into compute.

The Problem That Stopped Me Cold

Here's where things got interesting.

My EC2 instances boot by running a user_data.sh script. That script pulls a Docker image from Docker Hub and runs it with environment variables — including the database connection string.

The database connection string includes the RDS endpoint. Like this:

postgresql://username:password@mydb.abc123xyz.us-east-1.rds.amazonaws.com:5432/myapp

The problem: that endpoint only exists after Terraform creates the RDS instance. Which happens before the EC2 instances boot. Which means I need to pass a value that doesn't exist at the start of terraform apply into a script that runs at the end of it.

I tried a few approaches that didn't work:

Hardcoding the endpoint — completely defeats the purpose of IaC. Next deploy on a fresh account, it breaks immediately.
Passing it as a plain string variable — still needs the actual value upfront.
Running a second script after apply — works once, but now you have manual steps that live outside your Terraform state.

The solution was templatefile().

# In the compute module, launch template user_data:
user_data = base64encode(
  templatefile("${path.module}/scripts/user_data.sh", {
    database_url     = "postgresql://${var.db_username}:${var.db_password}@${var.db_endpoint}/${var.db_name}?sslmode=require"
    direct_url       = "postgresql://${var.db_username}:${var.db_password}@${var.db_endpoint}/${var.db_name}?sslmode=require"
    docker_username  = var.dockerhub_username
    docker_password  = var.dockerhub_token
    clerk_secret_key = var.clerk_secret_key
    # ... other vars
  })
)

Terraform resolves module outputs in dependency order. By the time it runs the compute module, the database module has already completed — and its endpoint output is available. The templatefile() function substitutes all the variables into the shell script before base64-encoding it. The EC2 instance boots with a fully rendered startup script that has the real database URL already baked in.

No manual steps. No hardcoded values. Works on every fresh deploy.

The Other Gotcha: Docker Hub from a Private Subnet

After solving the database injection problem, I ran terraform apply and watched everything provision cleanly. Then I checked the EC2 logs:

sudo cat /var/log/user-data.log

The Docker pull was failing. The instance couldn't reach Docker Hub.

I had set up private subnets and a NAT Gateway, but I'd missed connecting them. The private route table didn't have a route for 0.0.0.0/0 pointing at the NAT Gateway. Private subnets need that explicit route — they don't inherit it from anywhere.

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id  # This is what I was missing
  }
}

Fix applied. Docker pull worked. App came up.

Security Groups That Reference Each Other

One thing I'm actually proud of in this project is how the security groups are set up.

Instead of allowing traffic from IP address ranges, each security group references another security group as the source:

# App security group — only allows HTTP from the ALB security group
ingress {
  from_port       = 80
  to_port         = 80
  protocol        = "tcp"
  security_groups = [aws_security_group.alb.id]
}

# Database security group — only allows PostgreSQL from the App security group
ingress {
  from_port       = 5432
  to_port         = 5432
  protocol        = "tcp"
  security_groups = [aws_security_group.app.id]
}

Why does this matter? Because EC2 instances in an Auto Scaling Group come and go. Their IP addresses change. If you allow traffic from 10.0.2.0/24, you need to keep that CIDR accurate forever. If you allow traffic from the App security group ID, any instance in that group is automatically covered — regardless of its IP.

The CI/CD Pipeline

I set up a GitHub Actions workflow that triggers on every push to main:

Checkout the code
Log in to Docker Hub
Build the Docker image from ./pharma_app/Dockerfile
Push it with the latest tag

When new instances launch (via ASG), they pull latest from Docker Hub automatically through the user_data script. So a code push → Docker Hub → next instance launch picks up the new image.

What Actually Running Looks Like

After terraform apply completes, the output gives you the ALB DNS name. Hit that in a browser and the app loads — served through the load balancer, from a private EC2 instance, talking to a database that has no internet exposure.

Key Learnings

templatefile() is how you inject dynamic values into user_data. Terraform resolves the value after the dependency it comes from is complete. Use it — don't work around it.
Private subnets are not automatically NAT'd. You need to create the NAT Gateway, create a private route table, add a 0.0.0.0/0 route pointing at the NAT, and associate that route table with your private subnets. All four steps. No shortcuts.
Security groups that reference each other are more resilient than CIDR-based rules. Especially in environments where instances scale dynamically.
Add retry logic to user_data scripts. Instance networking isn't always ready the second the script runs. Five retries with 15-second delays costs nothing and prevents a class of flaky failures.
RDS subnet groups require at least two AZs. This is an AWS requirement, not a recommendation. Design your subnets for multi-AZ from the start.

Lessons for Other Learners

If you're working through a similar architecture and something isn't connecting, nine times out of ten it's a routing issue or a security group that's too restrictive. Check your route tables before you start doubting your application code.

And don't skip the modular structure because it feels like overhead. When something breaks, knowing that the networking module is isolated from the compute module makes debugging dramatically faster.

Resources & Next Steps

Next: I'm building out a full CI/CD pipeline with Jenkins, GitHub Actions, SonarCloud code analysis, and Trivy image scanning. Follow along if that's useful.

Have you built a 3-tier architecture before? What was the part that gave you the most trouble? Drop it in the comments — I'm genuinely curious whether the NAT Gateway thing trips everyone up or just me. 👇

DEV Community

I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up

I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up

Why I Built This

What I Built

The Terraform Structure

The Problem That Stopped Me Cold

The Other Gotcha: Docker Hub from a Private Subnet

Security Groups That Reference Each Other

The CI/CD Pipeline

What Actually Running Looks Like

Key Learnings

Lessons for Other Learners

Resources & Next Steps

Top comments (0)