DEV Community

Adil Khan
Adil Khan

Posted on

Building a Production-Style AWS ECS Platform with Terraform (Without Community Modules)

For the last three weeks, I've been building a production-style AWS infrastructure using Terraform, ECS (Fargate), Docker, and Jenkins.


⚠️ Important Clarification

I did not use community Terraform modules.

Every Terraform module in this project was written from scratch by me.


This was not an accident. It was a conscious decision to trade speed for understanding.

I didn't want to learn how to use Terraform.
I wanted to learn how infrastructure actually behaves under real constraints.

This article documents:

  • The architecture I built
  • How the system works end-to-end
  • The real problems I ran into
  • Why those problems mattered
  • What this project fundamentally changed in how I think about infrastructure

🎯 Motivation: Why Build Everything From Scratch?

Terraform community modules are powerful. They are also abstractions.

After using them in the past, I realized something uncomfortable:

I could deploy fairly complex infrastructure without truly understanding:

  • Why certain IAM permissions were required
  • How traffic actually flowed through the network
  • What Terraform needed during plan vs apply
  • How ECS, ALBs, and IAM interact internally

So I imposed a hard rule on myself:

  • ❌ No community modules
  • ❌ No copy-pasting large IAM policies without understanding them
  • ❌ No "just works" defaults

If something broke, I wanted to know why it broke.

This decision turned a "simple ECS project" into a deep learning exercise.


πŸ—οΈ High-Level System Overview

At a high level, this is a two-tier containerized application deployed on AWS ECS using Fargate.

diagram

The architecture is intentionally private by default.

🌐 Frontend Layer

  • Publicly accessible only through a Public Application Load Balancer
  • Runs as ECS Fargate tasks inside private subnets
  • No public IPs on ECS tasks

πŸ”’ Backend Layer

  • Completely private
  • Accessible only via an Internal Application Load Balancer
  • Runs as ECS Fargate tasks inside private subnets
  • Zero direct internet exposure

The frontend never talks directly to the backend container.
All communication flows through load balancers.

This wasn't just architectural purityβ€”it simplified security reasoning and debugging.


πŸ“‹ Architecture at a Glance

Component Details
Frontend Public ALB β†’ ECS Fargate (private subnet)
Backend Internal ALB β†’ ECS Fargate (private subnet)
CI/CD Jenkins EC2 β†’ Docker β†’ ECR β†’ Terraform β†’ ECS
State S3 + DynamoDB locking
Networking Custom VPC, NAT Gateway, multi-AZ

🌐 Networking Design: What "Private" Actually Means

I created a custom VPC and treated networking as a first-class concern.

VPC Layout

🟒 Public Subnets

  • Public Application Load Balancer
  • Jenkins EC2 instance
  • Internet Gateway attached

πŸ”΄ Private Subnets

  • Frontend ECS tasks
  • Backend ECS tasks
  • Internal Application Load Balancer

πŸ”„ Ingress Flow

Internet β†’ Public ALB β†’ Frontend ECS (private subnet) β†’ Internal ALB β†’ Backend ECS (private subnet)

No other ingress paths exist.


⚠️ Egress Reality Check

One of the earliest failures I hit:

ECS tasks couldn't pull images from ECR.

The reason wasn't ECS or IAM.
It was networking.

Private subnets do not magically have outbound internet access.

Fixing this forced me to understand:

  • βœ… NAT Gateways
  • βœ… Route tables
  • βœ… Why "private subnet" doesn't mean "isolated from the world" by default

This single issue reshaped how I think about AWS networking.


🐳 Containerization and Image Flow

Docker is used for packaging both services.

Key decisions:

  • Images are tagged with Git commit SHA
  • No latest tags
  • Every deployment is traceable

Image flow looks like this:

  1. Jenkins builds Docker images
  2. Images are pushed to Amazon ECR
  3. ECS pulls images using the task execution role

This strict immutability made rollbacks and debugging significantly easier.


πŸ› οΈ Terraform Architecture: Everything Modularized

The entire infrastructure is provisioned using Terraform.

But instead of one massive configuration, I designed small, focused modules, each with a single responsibility.

πŸ“¦ Custom Modules Include

Module Purpose
VPC Network foundation
Subnets Public/Private isolation
Route Tables Traffic routing
Security Groups Firewall rules
Public ALB Internet-facing load balancer
Internal ALB Private load balancer
ECS Cluster Container orchestration
ECS Task Definitions Container specifications
ECS Services Service management
ECR Repository Image storage
IAM Roles & Policies Permissions
Jenkins EC2 CI/CD server
Remote Backend S3 + DynamoDB

Each module:

  • βœ… Exposes only required outputs
  • βœ… Avoids leaking internal resource details
  • βœ… Enforces clear dependency boundaries

This made the system easier to reason aboutβ€”and easier to break in controlled ways.


terraform-resources


πŸ’Ύ Remote State and Locking

Terraform state is stored remotely:

  • S3 for state storage
  • DynamoDB for state locking

This became critical once CI/CD entered the picture.

Without locking, concurrent applies from Jenkins would have been a disaster.


πŸ”„ CI/CD Design with Jenkins

Jenkins runs on an EC2 instance inside the VPC.

I intentionally separated CI and CD responsibilities.

πŸ”¨ CI Pipeline (Build & Package)

Triggered on GitHub push:

  1. Checkout code
  2. Build frontend and backend Docker images
  3. Tag images with Git SHA
  4. Push images to ECR
  5. Export image tags as artifacts

βœ… CI has no infrastructure permissions.


πŸš€ CD Pipeline (Deploy via Terraform)

Triggered after CI completion:

  1. Fetch image tag artifacts
  2. Assume AWS IAM role via STS
  3. Run terraform init
  4. Run terraform apply
  5. Update ECS task definitions with new image versions

βœ… Terraform is the only deployment mechanism.

No manual ECS changes.
No clicking in the console.


πŸ” IAM: The True Difficulty of the Project

IAM was the hardest and most educational part of this project.

Because I didn't use community modules, I had to learn that:

πŸ“š What I Learned About IAM

Terraform needs read permissions even during creation:

  • terraform plan fails without Describe* and Get* permissions
  • Missing permissions that broke my plans:
    • iam:GetPolicyVersion
    • ec2:DescribeVpcAttribute
    • elasticloadbalancing:DescribeLoadBalancerAttributes

ECS failures were often IAM issues:

  • ❌ Incorrect iam:PassRole configuration
  • ❌ Confusion between execution role vs task role

Most of my "Terraform errors" were actually IAM design errors.

This project forced me to deeply understand:

  • βœ… Trust relationships
  • βœ… Role assumption via STS
  • βœ… Least-privilege policy design
  • βœ… How AWS services act on behalf of other services

πŸ› ECS Debugging: Nothing Is Isolated

ECS failures required system-level thinking.

A task failing could be caused by:

  • ❌ Image pull failures
  • ❌ Missing IAM permissions
  • ❌ Incorrect security group rules
  • ❌ ALB health check mismatch
  • ❌ Networking misconfiguration

There is no single log that tells the full story.

You need to understand how:

  • ECS
  • ALB
  • IAM
  • Networking

work together.

This project forced me to build that mental model.


⚠️ Key Mistakes I Made (So You Don't Have To)

1️⃣ Assuming private subnets = no internet access

  • Reality: Private subnets need NAT Gateway + route table configuration
  • Cost me hours debugging ECR pull failures

2️⃣ Treating IAM as an afterthought

  • IAM should be designed first, not patched later
  • Most "Terraform errors" were actually IAM design errors

3️⃣ Not separating CI and CD early

  • Initially mixed build and deploy logic
  • Separation made debugging and security much cleaner

4️⃣ Underestimating security group complexity

  • Had to trace traffic flow through multiple layers
  • One missing rule broke the entire deployment

πŸ“Š Project by the Numbers

Metric Count
Duration 3 weeks
Custom Terraform modules written 12+
AWS resources managed 50+
IAM policies debugged Too many to count
Docker images built 50+
Failed deployments before success 15+

πŸŽ“ What This Project Gave Me

  • βœ… Confidence writing Terraform modules from scratch
  • βœ… Strong understanding of IAM trust and permission boundaries
  • βœ… Practical experience debugging ECS + ALB + networking
  • βœ… Clear mental separation of CI vs CD
  • βœ… Comfort with production-style AWS constraints

More importantly, it taught me how to reason about infrastructure instead of guessing.


πŸ›‘ Why I'm Stopping Here

This project achieved its learning goal.

Continuing to polish it would bring diminishing returns.

The biggest gap in my skill set now is Kubernetes, and I'm moving there nextβ€”with the same approach:

  • βœ… No shortcuts
  • βœ… No blind abstractions
  • βœ… Build it, break it, debug it

πŸ’­ Final Thought

If you're learning DevOps:

Don't just use Terraform modules.

Write them. Break them. Fix them.

That's where real understanding comes from.


πŸ’¬ Questions for the Community

  • πŸ€” What's the most painful IAM issue you've debugged?
  • πŸ€” Do you prefer community modules or custom modules for learning?
  • πŸ€” What infrastructure topic should I tackle next?

Drop a commentβ€”I read and respond to all of them.


Tags: #aws #terraform #devops #ecs #docker #jenkins #infrastructure #cicd #learning #iac


If you found this helpful, give it a ❀️ and follow for more deep-dive DevOps content!

Top comments (0)