For the last three weeks, I've been building a production-style AWS infrastructure using Terraform, ECS (Fargate), Docker, and Jenkins.
β οΈ Important Clarification
I did not use community Terraform modules.
Every Terraform module in this project was written from scratch by me.
This was not an accident. It was a conscious decision to trade speed for understanding.
I didn't want to learn how to use Terraform.
I wanted to learn how infrastructure actually behaves under real constraints.
This article documents:
- The architecture I built
- How the system works end-to-end
- The real problems I ran into
- Why those problems mattered
- What this project fundamentally changed in how I think about infrastructure
π― Motivation: Why Build Everything From Scratch?
Terraform community modules are powerful. They are also abstractions.
After using them in the past, I realized something uncomfortable:
I could deploy fairly complex infrastructure without truly understanding:
- Why certain IAM permissions were required
- How traffic actually flowed through the network
- What Terraform needed during
planvsapply - How ECS, ALBs, and IAM interact internally
So I imposed a hard rule on myself:
- β No community modules
- β No copy-pasting large IAM policies without understanding them
- β No "just works" defaults
If something broke, I wanted to know why it broke.
This decision turned a "simple ECS project" into a deep learning exercise.
ποΈ High-Level System Overview
At a high level, this is a two-tier containerized application deployed on AWS ECS using Fargate.
The architecture is intentionally private by default.
π Frontend Layer
- Publicly accessible only through a Public Application Load Balancer
- Runs as ECS Fargate tasks inside private subnets
- No public IPs on ECS tasks
π Backend Layer
- Completely private
- Accessible only via an Internal Application Load Balancer
- Runs as ECS Fargate tasks inside private subnets
- Zero direct internet exposure
The frontend never talks directly to the backend container.
All communication flows through load balancers.
This wasn't just architectural purityβit simplified security reasoning and debugging.
π Architecture at a Glance
| Component | Details |
|---|---|
| Frontend | Public ALB β ECS Fargate (private subnet) |
| Backend | Internal ALB β ECS Fargate (private subnet) |
| CI/CD | Jenkins EC2 β Docker β ECR β Terraform β ECS |
| State | S3 + DynamoDB locking |
| Networking | Custom VPC, NAT Gateway, multi-AZ |
π Networking Design: What "Private" Actually Means
I created a custom VPC and treated networking as a first-class concern.
VPC Layout
π’ Public Subnets
- Public Application Load Balancer
- Jenkins EC2 instance
- Internet Gateway attached
π΄ Private Subnets
- Frontend ECS tasks
- Backend ECS tasks
- Internal Application Load Balancer
π Ingress Flow
Internet β Public ALB β Frontend ECS (private subnet) β Internal ALB β Backend ECS (private subnet)
No other ingress paths exist.
β οΈ Egress Reality Check
One of the earliest failures I hit:
ECS tasks couldn't pull images from ECR.
The reason wasn't ECS or IAM.
It was networking.
Private subnets do not magically have outbound internet access.
Fixing this forced me to understand:
- β NAT Gateways
- β Route tables
- β Why "private subnet" doesn't mean "isolated from the world" by default
This single issue reshaped how I think about AWS networking.
π³ Containerization and Image Flow
Docker is used for packaging both services.
Key decisions:
- Images are tagged with Git commit SHA
- No
latesttags - Every deployment is traceable
Image flow looks like this:
- Jenkins builds Docker images
- Images are pushed to Amazon ECR
- ECS pulls images using the task execution role
This strict immutability made rollbacks and debugging significantly easier.
π οΈ Terraform Architecture: Everything Modularized
The entire infrastructure is provisioned using Terraform.
But instead of one massive configuration, I designed small, focused modules, each with a single responsibility.
π¦ Custom Modules Include
| Module | Purpose |
|---|---|
| VPC | Network foundation |
| Subnets | Public/Private isolation |
| Route Tables | Traffic routing |
| Security Groups | Firewall rules |
| Public ALB | Internet-facing load balancer |
| Internal ALB | Private load balancer |
| ECS Cluster | Container orchestration |
| ECS Task Definitions | Container specifications |
| ECS Services | Service management |
| ECR Repository | Image storage |
| IAM Roles & Policies | Permissions |
| Jenkins EC2 | CI/CD server |
| Remote Backend | S3 + DynamoDB |
Each module:
- β Exposes only required outputs
- β Avoids leaking internal resource details
- β Enforces clear dependency boundaries
This made the system easier to reason aboutβand easier to break in controlled ways.
πΎ Remote State and Locking
Terraform state is stored remotely:
- S3 for state storage
- DynamoDB for state locking
This became critical once CI/CD entered the picture.
Without locking, concurrent applies from Jenkins would have been a disaster.
π CI/CD Design with Jenkins
Jenkins runs on an EC2 instance inside the VPC.
I intentionally separated CI and CD responsibilities.
π¨ CI Pipeline (Build & Package)
Triggered on GitHub push:
- Checkout code
- Build frontend and backend Docker images
- Tag images with Git SHA
- Push images to ECR
- Export image tags as artifacts
β CI has no infrastructure permissions.
π CD Pipeline (Deploy via Terraform)
Triggered after CI completion:
- Fetch image tag artifacts
- Assume AWS IAM role via STS
- Run terraform init
- Run terraform apply
- Update ECS task definitions with new image versions
β Terraform is the only deployment mechanism.
No manual ECS changes.
No clicking in the console.
π IAM: The True Difficulty of the Project
IAM was the hardest and most educational part of this project.
Because I didn't use community modules, I had to learn that:
π What I Learned About IAM
Terraform needs read permissions even during creation:
-
terraform planfails withoutDescribe*andGet*permissions -
Missing permissions that broke my plans:
iam:GetPolicyVersionec2:DescribeVpcAttributeelasticloadbalancing:DescribeLoadBalancerAttributes
ECS failures were often IAM issues:
- β Incorrect
iam:PassRoleconfiguration - β Confusion between execution role vs task role
Most of my "Terraform errors" were actually IAM design errors.
This project forced me to deeply understand:
- β Trust relationships
- β Role assumption via STS
- β Least-privilege policy design
- β How AWS services act on behalf of other services
π ECS Debugging: Nothing Is Isolated
ECS failures required system-level thinking.
A task failing could be caused by:
- β Image pull failures
- β Missing IAM permissions
- β Incorrect security group rules
- β ALB health check mismatch
- β Networking misconfiguration
There is no single log that tells the full story.
You need to understand how:
- ECS
- ALB
- IAM
- Networking
work together.
This project forced me to build that mental model.
β οΈ Key Mistakes I Made (So You Don't Have To)
1οΈβ£ Assuming private subnets = no internet access
- Reality: Private subnets need NAT Gateway + route table configuration
- Cost me hours debugging ECR pull failures
2οΈβ£ Treating IAM as an afterthought
- IAM should be designed first, not patched later
- Most "Terraform errors" were actually IAM design errors
3οΈβ£ Not separating CI and CD early
- Initially mixed build and deploy logic
- Separation made debugging and security much cleaner
4οΈβ£ Underestimating security group complexity
- Had to trace traffic flow through multiple layers
- One missing rule broke the entire deployment
π Project by the Numbers
| Metric | Count |
|---|---|
| Duration | 3 weeks |
| Custom Terraform modules written | 12+ |
| AWS resources managed | 50+ |
| IAM policies debugged | Too many to count |
| Docker images built | 50+ |
| Failed deployments before success | 15+ |
π What This Project Gave Me
- β Confidence writing Terraform modules from scratch
- β Strong understanding of IAM trust and permission boundaries
- β Practical experience debugging ECS + ALB + networking
- β Clear mental separation of CI vs CD
- β Comfort with production-style AWS constraints
More importantly, it taught me how to reason about infrastructure instead of guessing.
π Why I'm Stopping Here
This project achieved its learning goal.
Continuing to polish it would bring diminishing returns.
The biggest gap in my skill set now is Kubernetes, and I'm moving there nextβwith the same approach:
- β No shortcuts
- β No blind abstractions
- β Build it, break it, debug it
π Final Thought
If you're learning DevOps:
Don't just use Terraform modules.
Write them. Break them. Fix them.
That's where real understanding comes from.
π¬ Questions for the Community
- π€ What's the most painful IAM issue you've debugged?
- π€ Do you prefer community modules or custom modules for learning?
- π€ What infrastructure topic should I tackle next?
Drop a commentβI read and respond to all of them.
Tags: #aws #terraform #devops #ecs #docker #jenkins #infrastructure #cicd #learning #iac
If you found this helpful, give it a β€οΈ and follow for more deep-dive DevOps content!


Top comments (0)