I want to be upfront about something before we get into the technical stuff: this pipeline took me three attempts to get right. The first version worked locally and fell apart in staging. The second version made it to production and immediately caused an incident on a Friday afternoon. The third version — the one I'm going to walk you through — has been running cleanly for several months now.
I'm writing this because when I was building this, I couldn't find a single resource that went end-to-end and talked honestly about what actually goes wrong. Most tutorials show you the happy path. This one shows you the cliffs on either side of it.
Let's build something real.
What We're Actually Building
Before I dump architecture diagrams on you, here's the plain-English version of what this pipeline does:
A developer pushes code to a GitHub branch. Within a few minutes, that code is tested, built into a container image, pushed to ECR, and deployed to ECS Fargate — all without anyone SSHing into anything or clicking buttons in the console. Terraform manages every piece of infrastructure. If something goes wrong at any stage, the pipeline stops and notifies the team. If a deployment goes sideways, ECS rolls back automatically.
That's it. That's the whole goal.
Here's what the stack looks like:
- Source control: GitHub (connected via CodeStar Connections)
- CI/CD orchestration: AWS CodePipeline
- Build: AWS CodeBuild
- Infrastructure as Code: Terraform (with remote state in S3 + DynamoDB locking)
- Container registry: Amazon ECR
- Compute: Amazon ECS on Fargate
- Secrets: AWS Secrets Manager
- Notifications: SNS + Slack webhook
Prerequisites
Before anything else, make sure you have:
- AWS CLI configured with a profile that has sufficient permissions
- Terraform >= 1.5 installed
- Docker installed locally (for testing builds)
- A GitHub repo with your application code
- An AWS account where you can create IAM roles, VPCs, ECS clusters, etc.
I'm going to assume you're comfortable with Terraform basics and have touched ECS at least once. I won't explain what a task definition is from scratch.
Part 1: Terraform Project Structure
One of the biggest mistakes I made early on was treating Terraform as an afterthought — writing pipeline config first and then trying to retrofit IaC around it. Don't do that. Start with Terraform.
Here's the directory layout I settled on after a lot of trial and error:
infrastructure/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── modules/
│ ├── ecr/
│ ├── ecs/
│ ├── codepipeline/
│ ├── iam/
│ └── networking/
└── backend.tf
The environments folder separates dev and prod so they're completely independent state files. This saved me twice when I fat-fingered something in dev and was glad it couldn't touch prod.
Setting Up Remote State First
Do this before anything else. Local state in a team environment is how you end up with two people applying Terraform at the same time and corrupting your infrastructure.
# backend.tf
terraform {
backend "s3" {
bucket = "your-company-terraform-state"
key = "gitops-demo/dev/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Create the S3 bucket and DynamoDB table manually before running terraform init. The DynamoDB table needs a LockID string attribute as its primary key.
# One-time bootstrap
aws s3api create-bucket \
--bucket your-company-terraform-state \
--region us-east-1
aws s3api put-bucket-versioning \
--bucket your-company-terraform-state \
--versioning-configuration Status=Enabled
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-east-1
Enable versioning on the bucket. I learned this the hard way when a botched apply wiped out a resource and I needed to roll back the state file.
Part 2: Networking Module
Public subnets for the load balancer, private subnets for the Fargate tasks. The tasks should never be directly internet-accessible.
# modules/networking/main.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.project_name}-${var.environment}-vpc"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.project_name}-${var.environment}-public-${count.index + 1}"
}
}
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index + length(var.availability_zones))
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.project_name}-${var.environment}-private-${count.index + 1}"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = { Name = "${var.project_name}-${var.environment}-igw" }
}
resource "aws_eip" "nat" {
count = length(var.availability_zones)
domain = "vpc"
}
resource "aws_nat_gateway" "main" {
count = length(var.availability_zones)
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
depends_on = [aws_internet_gateway.main]
tags = { Name = "${var.project_name}-${var.environment}-nat-${count.index + 1}" }
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[count.index].id
}
}
resource "aws_route_table_association" "public" {
count = length(var.availability_zones)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private" {
count = length(var.availability_zones)
subnet_id = aws_subnet.private[count.index].id
route_table_id = aws_route_table.private[count.index].id
}
⚠️ Mistake #1: I originally used a single NAT gateway to save money. Fine for dev, but in prod this became a single point of failure when the AZ hosting the NAT had a brief issue. Use one NAT gateway per AZ in production.
Part 3: ECR Repository
# modules/ecr/main.tf
resource "aws_ecr_repository" "app" {
name = "${var.project_name}-${var.environment}"
image_tag_mutability = "IMMUTABLE"
image_scanning_configuration {
scan_on_push = true
}
encryption_configuration {
encryption_type = "AES256"
}
tags = {
Name = "${var.project_name}-${var.environment}"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_ecr_lifecycle_policy" "app" {
repository = aws_ecr_repository.app.name
policy = jsonencode({
rules = [
{
rulePriority = 1
description = "Keep last 10 production images"
selection = {
tagStatus = "tagged"
tagPrefixList = ["prod-"]
countType = "imageCountMoreThan"
countNumber = 10
}
action = { type = "expire" }
},
{
rulePriority = 2
description = "Remove untagged images after 1 day"
selection = {
tagStatus = "untagged"
countType = "sinceImagePushed"
countUnit = "days"
countNumber = 1
}
action = { type = "expire" }
}
]
})
}
The lifecycle policy is something most tutorials skip entirely. Without it, your ECR storage costs quietly creep up as old images pile up. The IMMUTABLE tag setting is also non-negotiable if you care about knowing exactly what's running in production — mutable tags let someone overwrite latest and suddenly your rollback target is a mystery.
Part 4: IAM Roles
IAM is where most people either lock things down too tight (and spend two hours debugging permission errors) or give everything AdministratorAccess and move on. Neither is good. Here are the three roles you need.
# modules/iam/main.tf
# CodePipeline Role
resource "aws_iam_role" "codepipeline" {
name = "${var.project_name}-${var.environment}-codepipeline-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "codepipeline.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "codepipeline" {
name = "codepipeline-policy"
role = aws_iam_role.codepipeline.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:GetObjectVersion", "s3:PutObject", "s3:GetBucketVersioning"]
Resource = ["${aws_s3_bucket.artifacts.arn}", "${aws_s3_bucket.artifacts.arn}/*"]
},
{
Effect = "Allow"
Action = ["codebuild:BatchGetBuilds", "codebuild:StartBuild"]
Resource = "*"
},
{
Effect = "Allow"
Action = [
"ecs:DescribeServices", "ecs:DescribeTaskDefinition",
"ecs:DescribeTasks", "ecs:ListTasks",
"ecs:RegisterTaskDefinition", "ecs:UpdateService"
]
Resource = "*"
},
{
Effect = "Allow"
Action = ["iam:PassRole"]
Resource = aws_iam_role.ecs_task_execution.arn
},
{
Effect = "Allow"
Action = ["codestar-connections:UseConnection"]
Resource = var.codestar_connection_arn
}
]
})
}
# CodeBuild Role
resource "aws_iam_role" "codebuild" {
name = "${var.project_name}-${var.environment}-codebuild-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "codebuild.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "codebuild" {
name = "codebuild-policy"
role = aws_iam_role.codebuild.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
Resource = "arn:aws:logs:${var.region}:${var.account_id}:log-group:/aws/codebuild/*"
},
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:GetBucketVersioning"]
Resource = ["${aws_s3_bucket.artifacts.arn}", "${aws_s3_bucket.artifacts.arn}/*"]
},
{
Effect = "Allow"
Action = [
"ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage",
"ecr:InitiateLayerUpload", "ecr:UploadLayerPart",
"ecr:CompleteLayerUpload", "ecr:PutImage"
]
Resource = "*"
},
{
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue"]
Resource = "arn:aws:secretsmanager:${var.region}:${var.account_id}:secret:${var.project_name}/*"
}
]
})
}
# ECS Task Execution Role
resource "aws_iam_role" "ecs_task_execution" {
name = "${var.project_name}-${var.environment}-ecs-execution-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "ecs-tasks.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
role = aws_iam_role.ecs_task_execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
resource "aws_iam_role_policy" "ecs_task_execution_secrets" {
name = "secrets-access"
role = aws_iam_role.ecs_task_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue", "kms:Decrypt"]
Resource = ["arn:aws:secretsmanager:${var.region}:${var.account_id}:secret:${var.project_name}/*"]
}]
})
}
⚠️ Mistake #2: My first CodeBuild role had
ecr:*ands3:*as wildcard resources. It worked, but our security team flagged it immediately. The version above scopes ECR actions correctly —ecr:GetAuthorizationTokendoesn't support resource-level restrictions, so*is unavoidable there, but Secrets Manager is scoped to a specific path prefix. Always do this.
Part 5: ECS Cluster, Task Definition, and Service
# modules/ecs/main.tf
resource "aws_ecs_cluster" "main" {
name = "${var.project_name}-${var.environment}"
setting {
name = "containerInsights"
value = "enabled"
}
tags = {
Name = "${var.project_name}-${var.environment}"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_ecs_cluster_capacity_providers" "main" {
cluster_name = aws_ecs_cluster.main.name
capacity_providers = ["FARGATE", "FARGATE_SPOT"]
default_capacity_provider_strategy {
base = 1
weight = 100
capacity_provider = "FARGATE"
}
}
resource "aws_cloudwatch_log_group" "app" {
name = "/ecs/${var.project_name}-${var.environment}"
retention_in_days = 30
}
resource "aws_ecs_task_definition" "app" {
family = "${var.project_name}-${var.environment}"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.task_cpu
memory = var.task_memory
execution_role_arn = var.execution_role_arn
task_role_arn = var.task_role_arn
container_definitions = jsonencode([
{
name = var.container_name
image = "${var.ecr_repository_url}:${var.image_tag}"
essential = true
portMappings = [{ containerPort = var.container_port, protocol = "tcp" }]
environment = [
{ name = "APP_ENV", value = var.environment },
{ name = "PORT", value = tostring(var.container_port) }
]
# NOTE: Use secrets block, NOT environment block for sensitive values
secrets = [
{
name = "DATABASE_URL"
valueFrom = "arn:aws:secretsmanager:${var.region}:${var.account_id}:secret:${var.project_name}/${var.environment}/database-url"
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/${var.project_name}-${var.environment}"
"awslogs-region" = var.region
"awslogs-stream-prefix" = "ecs"
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:${var.container_port}/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}
])
lifecycle {
ignore_changes = [container_definitions]
}
}
resource "aws_lb" "main" {
name = "${var.project_name}-${var.environment}-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = var.public_subnet_ids
enable_deletion_protection = var.environment == "prod" ? true : false
}
resource "aws_lb_target_group" "app" {
name = "${var.project_name}-${var.environment}-tg"
port = var.container_port
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip"
health_check {
enabled = true
healthy_threshold = 2
interval = 30
matcher = "200"
path = "/health"
timeout = 5
unhealthy_threshold = 3
}
lifecycle { create_before_destroy = true }
}
resource "aws_lb_listener" "http" {
load_balancer_arn = aws_lb.main.arn
port = "80"
protocol = "HTTP"
default_action {
type = "redirect"
redirect {
port = "443"
protocol = "HTTPS"
status_code = "HTTP_301"
}
}
}
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = var.certificate_arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.app.arn
}
}
resource "aws_ecs_service" "app" {
name = "${var.project_name}-${var.environment}"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.desired_count
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 100
base = 1
}
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = var.container_name
container_port = var.container_port
}
deployment_controller { type = "ECS" }
deployment_circuit_breaker {
enable = true
rollback = true
}
deployment_maximum_percent = 200
deployment_minimum_healthy_percent = 100
# Critical: let CodePipeline own the task definition after initial deploy
lifecycle {
ignore_changes = [task_definition, desired_count]
}
depends_on = [aws_lb_listener.https]
}
The deployment_circuit_breaker with rollback = true is the safety net I wish I'd had from day one. If a deployment starts failing health checks, ECS automatically rolls back to the previous task definition — no human intervention needed at 2am.
Part 6: CodeBuild — The buildspec.yml
This lives in your application repository, not your infrastructure repo.
# buildspec.yml
version: 0.2
env:
variables:
DOCKER_BUILDKIT: "1"
secrets-manager:
DOCKER_HUB_TOKEN: myapp/docker-hub-token
phases:
install:
runtime-versions:
docker: 20
pre_build:
commands:
- echo "Logging in to Amazon ECR..."
- aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $ECR_REGISTRY
# Pull cache image first - without this, --cache-from does nothing
- docker pull $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest || true
- IMAGE_TAG=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c1-7)
- FULL_IMAGE_URI=$ECR_REGISTRY/$ECR_REPOSITORY_NAME:$IMAGE_TAG
- echo "Image will be tagged as $FULL_IMAGE_URI"
build:
commands:
- echo "Running tests..."
- docker build --target test -t app-test .
- docker run --rm app-test
- echo "Build started on $(date)"
- |
docker build \
--build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
--build-arg GIT_COMMIT=$CODEBUILD_RESOLVED_SOURCE_VERSION \
--cache-from $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest \
-t $FULL_IMAGE_URI \
-t $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest \
.
post_build:
commands:
- docker push $FULL_IMAGE_URI
- docker push $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest
- printf '[{"name":"%s","imageUri":"%s"}]' $CONTAINER_NAME $FULL_IMAGE_URI > imagedefinitions.json
- cat imagedefinitions.json
artifacts:
files:
- imagedefinitions.json
cache:
paths:
- '/root/.gradle/caches/**/*'
- '/root/.m2/**/*'
The imagedefinitions.json at the end is what CodePipeline's ECS deploy action reads to know which image to deploy. The format must be exactly right — an array with a single object containing name (matching your container name in the task definition exactly) and imageUri. A lot of people waste time debugging this.
Part 7: CodePipeline
# modules/codepipeline/main.tf
resource "aws_s3_bucket" "artifacts" {
bucket = "${var.project_name}-${var.environment}-pipeline-artifacts"
force_destroy = var.environment != "prod"
}
resource "aws_s3_bucket_versioning" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_server_side_encryption_configuration" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
rule {
apply_server_side_encryption_by_default { sse_algorithm = "AES256" }
}
}
resource "aws_s3_bucket_public_access_block" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_codebuild_project" "app" {
name = "${var.project_name}-${var.environment}-build"
build_timeout = 20
service_role = var.codebuild_role_arn
artifacts { type = "CODEPIPELINE" }
cache {
type = "S3"
location = "${aws_s3_bucket.artifacts.bucket}/build-cache"
}
environment {
compute_type = "BUILD_GENERAL1_SMALL"
image = "aws/codebuild/standard:7.0"
type = "LINUX_CONTAINER"
image_pull_credentials_type = "CODEBUILD"
privileged_mode = true # Required for Docker builds
environment_variable { name = "ECR_REGISTRY", value = var.ecr_registry }
environment_variable { name = "ECR_REPOSITORY_NAME", value = var.ecr_repository_name }
environment_variable { name = "CONTAINER_NAME", value = var.container_name }
environment_variable { name = "ENVIRONMENT", value = var.environment }
}
logs_config {
cloudwatch_logs {
group_name = "/aws/codebuild/${var.project_name}-${var.environment}"
stream_name = "build-log"
}
}
source { type = "CODEPIPELINE" }
}
resource "aws_codepipeline" "main" {
name = "${var.project_name}-${var.environment}-pipeline"
role_arn = var.codepipeline_role_arn
artifact_store {
location = aws_s3_bucket.artifacts.bucket
type = "S3"
}
stage {
name = "Source"
action {
name = "Source"
category = "Source"
owner = "AWS"
provider = "CodeStarSourceConnection"
version = "1"
output_artifacts = ["source_output"]
configuration = {
ConnectionArn = var.codestar_connection_arn
FullRepositoryId = var.github_repository
BranchName = var.branch_name
DetectChanges = true
}
}
}
stage {
name = "Build"
action {
name = "Build"
category = "Build"
owner = "AWS"
provider = "CodeBuild"
version = "1"
input_artifacts = ["source_output"]
output_artifacts = ["build_output"]
configuration = { ProjectName = aws_codebuild_project.app.name }
}
}
stage {
name = "Deploy"
action {
name = "Deploy"
category = "Deploy"
owner = "AWS"
provider = "ECS"
version = "1"
input_artifacts = ["build_output"]
configuration = {
ClusterName = var.ecs_cluster_name
ServiceName = var.ecs_service_name
FileName = "imagedefinitions.json"
}
}
}
}
Connecting to GitHub — Don't Skip This
The CodeStar Connection requires a one-time manual authorization in the AWS console. Terraform creates the connection resource, but it'll be in PENDING status until you go to Developer Tools → Settings → Connections and click Update pending connection.
resource "aws_codestarconnections_connection" "github" {
name = "${var.project_name}-github"
provider_type = "GitHub"
}
output "codestar_connection_arn" {
value = aws_codestarconnections_connection.github.arn
}
After terraform apply, go activate the connection manually. This is a security control — AWS will not let Terraform automatically authorize access to your GitHub account.
Part 8: Pipeline Notifications
resource "aws_sns_topic" "pipeline_notifications" {
name = "${var.project_name}-${var.environment}-pipeline-notifications"
}
resource "aws_sns_topic_subscription" "slack" {
topic_arn = aws_sns_topic.pipeline_notifications.arn
protocol = "https"
endpoint = var.slack_webhook_url
}
resource "aws_cloudwatch_event_rule" "pipeline_state_change" {
name = "${var.project_name}-${var.environment}-pipeline-state-change"
description = "Capture CodePipeline state changes"
event_pattern = jsonencode({
source = ["aws.codepipeline"]
detail-type = ["CodePipeline Pipeline Execution State Change"]
detail = {
pipeline = [aws_codepipeline.main.name]
state = ["FAILED", "SUCCEEDED", "STARTED"]
}
})
}
resource "aws_cloudwatch_event_target" "pipeline_sns" {
rule = aws_cloudwatch_event_rule.pipeline_state_change.name
target_id = "SendToSNS"
arn = aws_sns_topic.pipeline_notifications.arn
input_transformer {
input_paths = {
pipeline = "$.detail.pipeline"
state = "$.detail.state"
time = "$.time"
}
input_template = "\"Pipeline <pipeline> changed state to <state> at <time>\""
}
}
Part 9: Putting It All Together
# environments/dev/main.tf
module "networking" {
source = "../../modules/networking"
project_name = var.project_name
environment = "dev"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b"]
}
module "ecr" {
source = "../../modules/ecr"
project_name = var.project_name
environment = "dev"
}
module "iam" {
source = "../../modules/iam"
project_name = var.project_name
environment = "dev"
region = var.region
account_id = var.account_id
codestar_connection_arn = var.codestar_connection_arn
artifacts_bucket_arn = module.codepipeline.artifacts_bucket_arn
}
module "ecs" {
source = "../../modules/ecs"
project_name = var.project_name
environment = "dev"
region = var.region
account_id = var.account_id
vpc_id = module.networking.vpc_id
public_subnet_ids = module.networking.public_subnet_ids
private_subnet_ids = module.networking.private_subnet_ids
execution_role_arn = module.iam.ecs_task_execution_role_arn
task_role_arn = module.iam.ecs_task_role_arn
ecr_repository_url = module.ecr.repository_url
container_name = var.container_name
container_port = 8080
task_cpu = 256
task_memory = 512
desired_count = 2
certificate_arn = var.certificate_arn
}
module "codepipeline" {
source = "../../modules/codepipeline"
project_name = var.project_name
environment = "dev"
codebuild_role_arn = module.iam.codebuild_role_arn
codepipeline_role_arn = module.iam.codepipeline_role_arn
ecr_registry = "${var.account_id}.dkr.ecr.${var.region}.amazonaws.com"
ecr_repository_name = module.ecr.repository_name
container_name = var.container_name
ecs_cluster_name = module.ecs.cluster_name
ecs_service_name = module.ecs.service_name
codestar_connection_arn = var.codestar_connection_arn
github_repository = var.github_repository
branch_name = "develop"
slack_webhook_url = var.slack_webhook_url
}
Run these in order the first time:
cd environments/dev
terraform init
terraform plan -out=tfplan
terraform apply tfplan
The Mistakes Section
I promised honesty. Here's what actually broke.
Mistake 3 — Docker layer caching wasn't working in CodeBuild
Builds were taking 12–15 minutes because nothing was being cached. privileged_mode = true is required for Docker-in-Docker, but even with that, --cache-from only works if the image exists locally in the build environment. You need to explicitly pull it first:
pre_build:
commands:
- docker pull $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest || true
The || true prevents failure on the very first run when no cache exists yet. After adding this, builds dropped to under 4 minutes.
Mistake 4 — ECS tasks couldn't pull from ECR (CannotPullContainerError)
This one wasted an entire afternoon. The tasks were in private subnets with NAT gateways and had internet access — but ECR pulls were still failing.
The actual cause: an org-level SCP required ECR to be accessed via VPC endpoints. Once I added the three required endpoints, everything worked:
resource "aws_vpc_endpoint" "ecr_dkr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.dkr"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
# ECR uses S3 internally for layer storage — this one is easy to miss
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
}
Even if your org doesn't require VPC endpoints, adding them reduces egress costs and latency. Worth doing regardless.
Mistake 5 — Terraform fighting CodePipeline over the task definition
Every terraform plan after a deployment wanted to reset the ECS service back to the original task definition ARN from .tfvars. The fix is the ignore_changes = [task_definition] lifecycle block on the ECS service. Without it, you have two systems fighting to own the same resource and you'll keep accidentally reverting deployments.
Mistake 6 — The CodeStar Connection was never activated
I spent 45 minutes staring at a pipeline that failed immediately at the Source stage with a cryptic error. The connection showed up in Terraform state as created successfully. Eventually I found it: connections start in PENDING state and need manual authorization in the console.
Always verify the connection status after creating it:
aws codestar-connections list-connections \
--provider-type GitHub \
--query 'Connections[*].{Name:ConnectionName,Status:ConnectionStatus}'
Mistake 7 — Secrets in plain environment variables
My first version passed database credentials as plain environment variables in the task definition. Those show up in the ECS console, in CloudTrail, and in any container inspection. Always use the secrets block (shown in Part 5) which references Secrets Manager ARNs. The difference looks small in the code but is significant from a security standpoint.
The Day-to-Day Deployment Workflow
Once everything is provisioned, here's what a normal deployment looks like:
- Developer opens a PR against the
developbranch - After review and merge, CodePipeline triggers automatically
- CodeBuild runs tests and builds the Docker image
- Image is pushed to ECR tagged with the commit SHA (first 7 chars)
-
imagedefinitions.jsonis written with the new image URI - CodePipeline's ECS deploy action registers a new task definition revision
- ECS performs a rolling deployment (min 100% healthy, max 200%)
- Health checks confirm new tasks are healthy
- Old tasks are drained and stopped
- Slack notification fires with SUCCEEDED status
If step 8 fails, the deployment circuit breaker rolls back to the previous task definition automatically.
For production, I added a manual approval stage between Build and Deploy:
stage {
name = "Approve"
action {
name = "ManualApproval"
category = "Approval"
owner = "AWS"
provider = "Manual"
version = "1"
configuration = {
NotificationArn = aws_sns_topic.pipeline_notifications.arn
CustomData = "Approve deployment to production for ${var.project_name}"
ExternalEntityLink = "https://github.com/${var.github_repository}/compare/main...develop"
}
}
}
Monitoring and Observability
The pipeline is only as good as your ability to know when something's wrong.
Check pipeline execution history:
aws codepipeline list-pipeline-executions \
--pipeline-name your-pipeline-name \
--max-results 20 \
--query 'pipelineExecutionSummaries[*].{Status:status,Start:startTime,Trigger:trigger.triggerType}'
First place to look when a deployment is stuck:
aws ecs describe-services \
--cluster your-cluster-name \
--services your-service-name \
--query 'services[0].events[:10]'
CloudWatch Container Insights — enabling this on the ECS cluster (shown in Part 5) gives you CPU, memory, and network metrics per task without any additional instrumentation. Free observability.
Final Thoughts
Three things I'd tell myself before starting this:
IAM permissions are worth getting right from the start. It takes maybe two extra hours to scope things properly, and it saves you from security findings and the awkward conversation about why your build role had admin access.
The circuit breaker and rollback settings on the ECS service are not optional. At some point you will deploy something that fails health checks in production. You want that to roll back automatically, not at 2am when someone pages you.
ignore_changes on your ECS service is what makes the Terraform + CodePipeline combination actually work. Without it, you have two systems fighting over the same resource and you'll keep silently reverting deployments you just shipped.
The full Terraform code for this post is available on GitHub at [link to your repo]. If you run into something I didn't cover, drop a comment — I'm genuinely curious what edge cases show up in different environments and org setups.
If this helped you, share it with your team. The program is worth the effort.
Top comments (0)