MapDevops

Posted on Mar 23

Stop Fighting AWS Networking — Deploy Your Container in 3 Steps

#aws #devops #docker #terraform

You Just Want to Deploy a Docker Container. AWS Has Other Plans.

You've got a Dockerfile. It works on your machine. It works in CI. You just want to put it on the internet.

So you open the AWS console and within 15 minutes you're reading about:

VPCs, CIDR blocks, and subnet math
Internet Gateways vs. NAT Gateways
Route tables (public vs. private, and why they're different)
Application Load Balancers, target groups, listener rules
Security groups that reference other security groups
ECS task definitions, services, execution roles, task roles
Auto Scaling policies, CloudWatch alarms, Container Insights

You wanted docker run. AWS handed you a 200-page networking textbook.

I've been there. Multiple times. And after the third time I rebuilt this from scratch for a new project, I decided to actually do it right — and never do it again.

But first, let me show you the mistake almost everyone makes on their first try.

The Trap: Fargate in a Public Subnet

Here's what most tutorials teach you (and what the AWS "Getting Started" wizard defaults to):

Create a VPC with public subnets
Put your Fargate tasks in those public subnets
Set assign_public_ip = true so the tasks can pull images from ECR
Attach a security group that allows inbound traffic on your container port

It works. Your container is reachable. You ship it. You move on.

But here's what you just did:

Your containers have public IP addresses. They are directly addressable from the entire internet.
That security group you wrote? It's the only thing between a bad actor and your application process.
Every container is an attack surface. Not just the ALB — every running task.
If someone finds a vulnerability in your app (or its dependencies), they have a direct network path to exploit it.
You skipped the NAT Gateway to save ~$32/month. Your containers are now exposed to save the cost of a nice dinner.

This isn't a theoretical risk. Port scanners hit every public IP on AWS continuously. If your container has a debug endpoint, an unpatched dependency, or even a misconfigured health check — it's findable.

The "save money, skip the NAT" shortcut is a security incident waiting to happen.

The Correct Architecture

Here's what a production Fargate deployment should look like:

                          Internet
                              |
                         +----v----+
                         |   IGW   |
                         +----+----+
                              | HTTP/HTTPS
              +---------------v----------------------+
              |              VPC  10.0.0.0/16         |
              |                                       |
              |   +-----------------------------+     |
              |   |  ALB  (SG: 80,443 inbound)  |     |
              |   +------+-------------+--------+     |
              |          |             |              |
              |  +-------v------+ +---v-----------+  |
              |  | Public Subnet| | Public Subnet |  |
              |  |   us-east-1a | |   us-east-1b  |  |
              |  |  [NAT GW]   | |               |  |
              |  +--------------+ +---------------+  |
              |          |             |              |
              |  +-------v------+ +---v-----------+  |
              |  |Private Subnet| |Private Subnet |  |
              |  |   us-east-1a | |   us-east-1b  |  |
              |  |  [Fargate]   | |  [Fargate]    |  |
              |  |  SG:ALB only | |  SG:ALB only  |  |
              |  +--------------+ +---------------+  |
              +--------------------------------------+
                     |                    |
              CloudWatch Logs          ECR / IAM

The flow:

Internet -> ALB (lives in public subnets, accepts HTTP/HTTPS)
ALB -> Fargate tasks (live in private subnets, accept traffic only from the ALB's security group)
Fargate -> Internet (outbound only, through the NAT Gateway — for pulling images, calling APIs, etc.)

Your containers have no public IPs. They are unreachable from the internet. The ALB is the only entry point, and it only forwards traffic that matches your listener rules.

The Security Model

ALB Security Group: Inbound 80/443 from 0.0.0.0/0. That's it.
ECS Task Security Group: Inbound from the ALB security group only, on the container port. Zero other ingress.
assign_public_ip = false on every task. Non-negotiable.
Task Execution Role: Scoped to AmazonECSTaskExecutionRolePolicy — pull images, write logs, nothing more.
Task Role: Empty by default. You add only what your app needs.

The Cost Reality

Yes, the NAT Gateway costs ~$32/month. Here's the cost breakdown you should actually care about:

Variable	Default	What It Controls
`single_nat_gateway`	`true`	One NAT GW (~$32/mo). Set `false` for HA across AZs.
`task_cpu` / `task_memory`	256 / 512	Fargate bills per vCPU-second and GB-second.
`min_capacity`	1	Auto Scaling floor.

$32/month is the cost of doing this correctly. It's less than a single hour of incident response when your public-subnet containers get probed.

The Real Problem: Building This in Terraform

Understanding the architecture is one thing. Implementing it in Terraform is another.

Here's what you actually need to write:

VPC with calculated CIDR blocks for public and private subnets across multiple AZs
Internet Gateway + NAT Gateway + Elastic IP
Route tables (separate for public and private) with correct associations
ALB with target group, listener, and health check configuration
ECS cluster with Container Insights enabled
Task definition with proper CPU/memory combinations (and they have to match — see the matrix)
Two IAM roles (execution + task) with trust policies
Two security groups with cross-references
Auto Scaling target, plus policies for CPU and memory
CloudWatch log group with configurable retention
Variable validation (Terraform won't stop you from setting task_cpu = 999)

The first time I built this, it took me roughly 25 hours — including debugging subnet routing, figuring out why tasks couldn't pull images (missing NAT), and learning that Fargate CPU/memory values aren't arbitrary.

The second time took about 10 hours because I still forgot half the edge cases.

The third time, I turned it into a boilerplate.

The Shortcut

I packaged everything above into a production-ready Terraform boilerplate that deploys in 3 steps:

Step 1 — Edit terraform.tfvars:

project_name    = "myapp"
container_image = "nginx:latest"
aws_region      = "us-east-1"

Step 2 — Run the deploy script:

./deploy.sh

Step 3 — Get your URL:

terraform output alb_dns_name

That's it. Secure VPC, ALB, NAT Gateway, private Fargate tasks, Auto Scaling, CloudWatch — all wired up and validated.

It also includes full LocalStack support, so you can test the entire infrastructure locally without spending a cent on AWS:

localstack start -d
source .env
./deploy.sh

What's included

Modular, documented Terraform — not a giant main.tf with 500 uncommented lines
Input validation — Terraform will reject invalid CPU/memory combos before you deploy
Deploy and destroy scripts with confirmation prompts
Security best practices enforced by default (private subnets, least-privilege IAM, no public IPs)
Cost optimization built in (single NAT Gateway toggle, right-sized defaults)
CloudWatch Container Insights for observability
Secrets management guidance (SSM / Secrets Manager, not env vars)
Architecture diagram (draw.io, editable)

Get the boilerplate

You'll save 20+ hours of Terraform debugging and get an architecture you can actually defend in a security review. I wish I'd had this the first three times I built it.

Questions about the architecture or how to extend it? Drop a comment — happy to help.

DEV Community