Rahul Pandya

Posted on May 23

GitOps on AWS: End-to-End CI/CD Pipeline with CodePipeline + Terraform + ECS — Mistakes, Fixes & Lessons Learned

#aws #devops #productivity #tutorial

I want to be upfront about something before we get into the technical stuff: this pipeline took me three attempts to get right. The first version worked locally and fell apart in staging. The second version made it to production and immediately caused an incident on a Friday afternoon. The third version — the one I'm going to walk you through — has been running cleanly for several months now.

I'm writing this because when I was building this, I couldn't find a single resource that went end-to-end and talked honestly about what actually goes wrong. Most tutorials show you the happy path. This one shows you the cliffs on either side of it.

Let's build something real.

What We're Actually Building

Before I dump architecture diagrams on you, here's the plain-English version of what this pipeline does:

A developer pushes code to a GitHub branch. Within a few minutes, that code is tested, built into a container image, pushed to ECR, and deployed to ECS Fargate — all without anyone SSHing into anything or clicking buttons in the console. Terraform manages every piece of infrastructure. If something goes wrong at any stage, the pipeline stops and notifies the team. If a deployment goes sideways, ECS rolls back automatically.

That's it. That's the whole goal.

Here's what the stack looks like:

Source control: GitHub (connected via CodeStar Connections)
CI/CD orchestration: AWS CodePipeline
Build: AWS CodeBuild
Infrastructure as Code: Terraform (with remote state in S3 + DynamoDB locking)
Container registry: Amazon ECR
Compute: Amazon ECS on Fargate
Secrets: AWS Secrets Manager
Notifications: SNS + Slack webhook

Prerequisites

Before anything else, make sure you have:

AWS CLI configured with a profile that has sufficient permissions
Terraform >= 1.5 installed
Docker installed locally (for testing builds)
A GitHub repo with your application code
An AWS account where you can create IAM roles, VPCs, ECS clusters, etc.

I'm going to assume you're comfortable with Terraform basics and have touched ECS at least once. I won't explain what a task definition is from scratch.

Part 1: Terraform Project Structure

One of the biggest mistakes I made early on was treating Terraform as an afterthought — writing pipeline config first and then trying to retrofit IaC around it. Don't do that. Start with Terraform.

Here's the directory layout I settled on after a lot of trial and error:

infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── modules/
│   ├── ecr/
│   ├── ecs/
│   ├── codepipeline/
│   ├── iam/
│   └── networking/
└── backend.tf

The environments folder separates dev and prod so they're completely independent state files. This saved me twice when I fat-fingered something in dev and was glad it couldn't touch prod.

Setting Up Remote State First

Do this before anything else. Local state in a team environment is how you end up with two people applying Terraform at the same time and corrupting your infrastructure.

# backend.tf
terraform {
  backend "s3" {
    bucket         = "your-company-terraform-state"
    key            = "gitops-demo/dev/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Create the S3 bucket and DynamoDB table manually before running terraform init. The DynamoDB table needs a LockID string attribute as its primary key.

# One-time bootstrap
aws s3api create-bucket \
  --bucket your-company-terraform-state \
  --region us-east-1

aws s3api put-bucket-versioning \
  --bucket your-company-terraform-state \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

Enable versioning on the bucket. I learned this the hard way when a botched apply wiped out a resource and I needed to roll back the state file.

Part 2: Networking Module

Public subnets for the load balancer, private subnets for the Fargate tasks. The tasks should never be directly internet-accessible.

# modules/networking/main.tf

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "${var.project_name}-${var.environment}-vpc"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_subnet" "public" {
  count                   = length(var.availability_zones)
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(var.vpc_cidr, 4, count.index)
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.project_name}-${var.environment}-public-${count.index + 1}"
  }
}

resource "aws_subnet" "private" {
  count             = length(var.availability_zones)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 4, count.index + length(var.availability_zones))
  availability_zone = var.availability_zones[count.index]

  tags = {
    Name = "${var.project_name}-${var.environment}-private-${count.index + 1}"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  tags   = { Name = "${var.project_name}-${var.environment}-igw" }
}

resource "aws_eip" "nat" {
  count  = length(var.availability_zones)
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  count         = length(var.availability_zones)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  depends_on    = [aws_internet_gateway.main]

  tags = { Name = "${var.project_name}-${var.environment}-nat-${count.index + 1}" }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table" "private" {
  count  = length(var.availability_zones)
  vpc_id = aws_vpc.main.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }
}

resource "aws_route_table_association" "public" {
  count          = length(var.availability_zones)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count          = length(var.availability_zones)
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

⚠️ Mistake #1: I originally used a single NAT gateway to save money. Fine for dev, but in prod this became a single point of failure when the AZ hosting the NAT had a brief issue. Use one NAT gateway per AZ in production.

Part 3: ECR Repository

# modules/ecr/main.tf

resource "aws_ecr_repository" "app" {
  name                 = "${var.project_name}-${var.environment}"
  image_tag_mutability = "IMMUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }

  encryption_configuration {
    encryption_type = "AES256"
  }

  tags = {
    Name        = "${var.project_name}-${var.environment}"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_ecr_lifecycle_policy" "app" {
  repository = aws_ecr_repository.app.name

  policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Keep last 10 production images"
        selection = {
          tagStatus     = "tagged"
          tagPrefixList = ["prod-"]
          countType     = "imageCountMoreThan"
          countNumber   = 10
        }
        action = { type = "expire" }
      },
      {
        rulePriority = 2
        description  = "Remove untagged images after 1 day"
        selection = {
          tagStatus   = "untagged"
          countType   = "sinceImagePushed"
          countUnit   = "days"
          countNumber = 1
        }
        action = { type = "expire" }
      }
    ]
  })
}

The lifecycle policy is something most tutorials skip entirely. Without it, your ECR storage costs quietly creep up as old images pile up. The IMMUTABLE tag setting is also non-negotiable if you care about knowing exactly what's running in production — mutable tags let someone overwrite latest and suddenly your rollback target is a mystery.

Part 4: IAM Roles

IAM is where most people either lock things down too tight (and spend two hours debugging permission errors) or give everything AdministratorAccess and move on. Neither is good. Here are the three roles you need.

# modules/iam/main.tf

# CodePipeline Role
resource "aws_iam_role" "codepipeline" {
  name = "${var.project_name}-${var.environment}-codepipeline-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "codepipeline.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "codepipeline" {
  name = "codepipeline-policy"
  role = aws_iam_role.codepipeline.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["s3:GetObject", "s3:GetObjectVersion", "s3:PutObject", "s3:GetBucketVersioning"]
        Resource = ["${aws_s3_bucket.artifacts.arn}", "${aws_s3_bucket.artifacts.arn}/*"]
      },
      {
        Effect   = "Allow"
        Action   = ["codebuild:BatchGetBuilds", "codebuild:StartBuild"]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "ecs:DescribeServices", "ecs:DescribeTaskDefinition",
          "ecs:DescribeTasks", "ecs:ListTasks",
          "ecs:RegisterTaskDefinition", "ecs:UpdateService"
        ]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = ["iam:PassRole"]
        Resource = aws_iam_role.ecs_task_execution.arn
      },
      {
        Effect   = "Allow"
        Action   = ["codestar-connections:UseConnection"]
        Resource = var.codestar_connection_arn
      }
    ]
  })
}

# CodeBuild Role
resource "aws_iam_role" "codebuild" {
  name = "${var.project_name}-${var.environment}-codebuild-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "codebuild.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "codebuild" {
  name = "codebuild-policy"
  role = aws_iam_role.codebuild.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
        Resource = "arn:aws:logs:${var.region}:${var.account_id}:log-group:/aws/codebuild/*"
      },
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject", "s3:GetBucketVersioning"]
        Resource = ["${aws_s3_bucket.artifacts.arn}", "${aws_s3_bucket.artifacts.arn}/*"]
      },
      {
        Effect = "Allow"
        Action = [
          "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability",
          "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage",
          "ecr:InitiateLayerUpload", "ecr:UploadLayerPart",
          "ecr:CompleteLayerUpload", "ecr:PutImage"
        ]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = ["secretsmanager:GetSecretValue"]
        Resource = "arn:aws:secretsmanager:${var.region}:${var.account_id}:secret:${var.project_name}/*"
      }
    ]
  })
}

# ECS Task Execution Role
resource "aws_iam_role" "ecs_task_execution" {
  name = "${var.project_name}-${var.environment}-ecs-execution-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

resource "aws_iam_role_policy" "ecs_task_execution_secrets" {
  name = "secrets-access"
  role = aws_iam_role.ecs_task_execution.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["secretsmanager:GetSecretValue", "kms:Decrypt"]
      Resource = ["arn:aws:secretsmanager:${var.region}:${var.account_id}:secret:${var.project_name}/*"]
    }]
  })
}

⚠️ Mistake #2: My first CodeBuild role had ecr:* and s3:* as wildcard resources. It worked, but our security team flagged it immediately. The version above scopes ECR actions correctly — ecr:GetAuthorizationToken doesn't support resource-level restrictions, so * is unavoidable there, but Secrets Manager is scoped to a specific path prefix. Always do this.

Part 5: ECS Cluster, Task Definition, and Service

# modules/ecs/main.tf

resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-${var.environment}"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    Name        = "${var.project_name}-${var.environment}"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name       = aws_ecs_cluster.main.name
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    base              = 1
    weight            = 100
    capacity_provider = "FARGATE"
  }
}

resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/${var.project_name}-${var.environment}"
  retention_in_days = 30
}

resource "aws_ecs_task_definition" "app" {
  family                   = "${var.project_name}-${var.environment}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.task_cpu
  memory                   = var.task_memory
  execution_role_arn       = var.execution_role_arn
  task_role_arn            = var.task_role_arn

  container_definitions = jsonencode([
    {
      name      = var.container_name
      image     = "${var.ecr_repository_url}:${var.image_tag}"
      essential = true

      portMappings = [{ containerPort = var.container_port, protocol = "tcp" }]

      environment = [
        { name = "APP_ENV", value = var.environment },
        { name = "PORT",    value = tostring(var.container_port) }
      ]

      # NOTE: Use secrets block, NOT environment block for sensitive values
      secrets = [
        {
          name      = "DATABASE_URL"
          valueFrom = "arn:aws:secretsmanager:${var.region}:${var.account_id}:secret:${var.project_name}/${var.environment}/database-url"
        }
      ]

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/${var.project_name}-${var.environment}"
          "awslogs-region"        = var.region
          "awslogs-stream-prefix" = "ecs"
        }
      }

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:${var.container_port}/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 60
      }
    }
  ])

  lifecycle {
    ignore_changes = [container_definitions]
  }
}

resource "aws_lb" "main" {
  name                       = "${var.project_name}-${var.environment}-alb"
  internal                   = false
  load_balancer_type         = "application"
  security_groups            = [aws_security_group.alb.id]
  subnets                    = var.public_subnet_ids
  enable_deletion_protection = var.environment == "prod" ? true : false
}

resource "aws_lb_target_group" "app" {
  name        = "${var.project_name}-${var.environment}-tg"
  port        = var.container_port
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  health_check {
    enabled             = true
    healthy_threshold   = 2
    interval            = 30
    matcher             = "200"
    path                = "/health"
    timeout             = 5
    unhealthy_threshold = 3
  }

  lifecycle { create_before_destroy = true }
}

resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.main.arn
  port              = "80"
  protocol          = "HTTP"

  default_action {
    type = "redirect"
    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = var.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

resource "aws_ecs_service" "app" {
  name            = "${var.project_name}-${var.environment}"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = var.desired_count

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 100
    base              = 1
  }

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = var.container_name
    container_port   = var.container_port
  }

  deployment_controller { type = "ECS" }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100

  # Critical: let CodePipeline own the task definition after initial deploy
  lifecycle {
    ignore_changes = [task_definition, desired_count]
  }

  depends_on = [aws_lb_listener.https]
}

The deployment_circuit_breaker with rollback = true is the safety net I wish I'd had from day one. If a deployment starts failing health checks, ECS automatically rolls back to the previous task definition — no human intervention needed at 2am.

Part 6: CodeBuild — The buildspec.yml

This lives in your application repository, not your infrastructure repo.

# buildspec.yml
version: 0.2

env:
  variables:
    DOCKER_BUILDKIT: "1"
  secrets-manager:
    DOCKER_HUB_TOKEN: myapp/docker-hub-token

phases:
  install:
    runtime-versions:
      docker: 20

  pre_build:
    commands:
      - echo "Logging in to Amazon ECR..."
      - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $ECR_REGISTRY
      # Pull cache image first - without this, --cache-from does nothing
      - docker pull $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest || true
      - IMAGE_TAG=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c1-7)
      - FULL_IMAGE_URI=$ECR_REGISTRY/$ECR_REPOSITORY_NAME:$IMAGE_TAG
      - echo "Image will be tagged as $FULL_IMAGE_URI"

  build:
    commands:
      - echo "Running tests..."
      - docker build --target test -t app-test .
      - docker run --rm app-test
      - echo "Build started on $(date)"
      - |
        docker build \
          --build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
          --build-arg GIT_COMMIT=$CODEBUILD_RESOLVED_SOURCE_VERSION \
          --cache-from $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest \
          -t $FULL_IMAGE_URI \
          -t $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest \
          .

  post_build:
    commands:
      - docker push $FULL_IMAGE_URI
      - docker push $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest
      - printf '[{"name":"%s","imageUri":"%s"}]' $CONTAINER_NAME $FULL_IMAGE_URI > imagedefinitions.json
      - cat imagedefinitions.json

artifacts:
  files:
    - imagedefinitions.json

cache:
  paths:
    - '/root/.gradle/caches/**/*'
    - '/root/.m2/**/*'

The imagedefinitions.json at the end is what CodePipeline's ECS deploy action reads to know which image to deploy. The format must be exactly right — an array with a single object containing name (matching your container name in the task definition exactly) and imageUri. A lot of people waste time debugging this.

Part 7: CodePipeline

# modules/codepipeline/main.tf

resource "aws_s3_bucket" "artifacts" {
  bucket        = "${var.project_name}-${var.environment}-pipeline-artifacts"
  force_destroy = var.environment != "prod"
}

resource "aws_s3_bucket_versioning" "artifacts" {
  bucket = aws_s3_bucket.artifacts.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "artifacts" {
  bucket = aws_s3_bucket.artifacts.id
  rule {
    apply_server_side_encryption_by_default { sse_algorithm = "AES256" }
  }
}

resource "aws_s3_bucket_public_access_block" "artifacts" {
  bucket                  = aws_s3_bucket.artifacts.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_codebuild_project" "app" {
  name          = "${var.project_name}-${var.environment}-build"
  build_timeout = 20
  service_role  = var.codebuild_role_arn

  artifacts { type = "CODEPIPELINE" }

  cache {
    type     = "S3"
    location = "${aws_s3_bucket.artifacts.bucket}/build-cache"
  }

  environment {
    compute_type                = "BUILD_GENERAL1_SMALL"
    image                       = "aws/codebuild/standard:7.0"
    type                        = "LINUX_CONTAINER"
    image_pull_credentials_type = "CODEBUILD"
    privileged_mode             = true # Required for Docker builds

    environment_variable { name = "ECR_REGISTRY",       value = var.ecr_registry }
    environment_variable { name = "ECR_REPOSITORY_NAME", value = var.ecr_repository_name }
    environment_variable { name = "CONTAINER_NAME",      value = var.container_name }
    environment_variable { name = "ENVIRONMENT",         value = var.environment }
  }

  logs_config {
    cloudwatch_logs {
      group_name  = "/aws/codebuild/${var.project_name}-${var.environment}"
      stream_name = "build-log"
    }
  }

  source { type = "CODEPIPELINE" }
}

resource "aws_codepipeline" "main" {
  name     = "${var.project_name}-${var.environment}-pipeline"
  role_arn = var.codepipeline_role_arn

  artifact_store {
    location = aws_s3_bucket.artifacts.bucket
    type     = "S3"
  }

  stage {
    name = "Source"
    action {
      name             = "Source"
      category         = "Source"
      owner            = "AWS"
      provider         = "CodeStarSourceConnection"
      version          = "1"
      output_artifacts = ["source_output"]

      configuration = {
        ConnectionArn    = var.codestar_connection_arn
        FullRepositoryId = var.github_repository
        BranchName       = var.branch_name
        DetectChanges    = true
      }
    }
  }

  stage {
    name = "Build"
    action {
      name             = "Build"
      category         = "Build"
      owner            = "AWS"
      provider         = "CodeBuild"
      version          = "1"
      input_artifacts  = ["source_output"]
      output_artifacts = ["build_output"]
      configuration    = { ProjectName = aws_codebuild_project.app.name }
    }
  }

  stage {
    name = "Deploy"
    action {
      name            = "Deploy"
      category        = "Deploy"
      owner           = "AWS"
      provider        = "ECS"
      version         = "1"
      input_artifacts = ["build_output"]

      configuration = {
        ClusterName = var.ecs_cluster_name
        ServiceName = var.ecs_service_name
        FileName    = "imagedefinitions.json"
      }
    }
  }
}

Connecting to GitHub — Don't Skip This

The CodeStar Connection requires a one-time manual authorization in the AWS console. Terraform creates the connection resource, but it'll be in PENDING status until you go to Developer Tools → Settings → Connections and click Update pending connection.

resource "aws_codestarconnections_connection" "github" {
  name          = "${var.project_name}-github"
  provider_type = "GitHub"
}

output "codestar_connection_arn" {
  value = aws_codestarconnections_connection.github.arn
}

After terraform apply, go activate the connection manually. This is a security control — AWS will not let Terraform automatically authorize access to your GitHub account.

Part 8: Pipeline Notifications

resource "aws_sns_topic" "pipeline_notifications" {
  name = "${var.project_name}-${var.environment}-pipeline-notifications"
}

resource "aws_sns_topic_subscription" "slack" {
  topic_arn = aws_sns_topic.pipeline_notifications.arn
  protocol  = "https"
  endpoint  = var.slack_webhook_url
}

resource "aws_cloudwatch_event_rule" "pipeline_state_change" {
  name        = "${var.project_name}-${var.environment}-pipeline-state-change"
  description = "Capture CodePipeline state changes"

  event_pattern = jsonencode({
    source      = ["aws.codepipeline"]
    detail-type = ["CodePipeline Pipeline Execution State Change"]
    detail = {
      pipeline = [aws_codepipeline.main.name]
      state    = ["FAILED", "SUCCEEDED", "STARTED"]
    }
  })
}

resource "aws_cloudwatch_event_target" "pipeline_sns" {
  rule      = aws_cloudwatch_event_rule.pipeline_state_change.name
  target_id = "SendToSNS"
  arn       = aws_sns_topic.pipeline_notifications.arn

  input_transformer {
    input_paths = {
      pipeline = "$.detail.pipeline"
      state    = "$.detail.state"
      time     = "$.time"
    }
    input_template = "\"Pipeline <pipeline> changed state to <state> at <time>\""
  }
}

Part 9: Putting It All Together

# environments/dev/main.tf

module "networking" {
  source             = "../../modules/networking"
  project_name       = var.project_name
  environment        = "dev"
  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b"]
}

module "ecr" {
  source       = "../../modules/ecr"
  project_name = var.project_name
  environment  = "dev"
}

module "iam" {
  source                  = "../../modules/iam"
  project_name            = var.project_name
  environment             = "dev"
  region                  = var.region
  account_id              = var.account_id
  codestar_connection_arn = var.codestar_connection_arn
  artifacts_bucket_arn    = module.codepipeline.artifacts_bucket_arn
}

module "ecs" {
  source              = "../../modules/ecs"
  project_name        = var.project_name
  environment         = "dev"
  region              = var.region
  account_id          = var.account_id
  vpc_id              = module.networking.vpc_id
  public_subnet_ids   = module.networking.public_subnet_ids
  private_subnet_ids  = module.networking.private_subnet_ids
  execution_role_arn  = module.iam.ecs_task_execution_role_arn
  task_role_arn       = module.iam.ecs_task_role_arn
  ecr_repository_url  = module.ecr.repository_url
  container_name      = var.container_name
  container_port      = 8080
  task_cpu            = 256
  task_memory         = 512
  desired_count       = 2
  certificate_arn     = var.certificate_arn
}

module "codepipeline" {
  source                  = "../../modules/codepipeline"
  project_name            = var.project_name
  environment             = "dev"
  codebuild_role_arn      = module.iam.codebuild_role_arn
  codepipeline_role_arn   = module.iam.codepipeline_role_arn
  ecr_registry            = "${var.account_id}.dkr.ecr.${var.region}.amazonaws.com"
  ecr_repository_name     = module.ecr.repository_name
  container_name          = var.container_name
  ecs_cluster_name        = module.ecs.cluster_name
  ecs_service_name        = module.ecs.service_name
  codestar_connection_arn = var.codestar_connection_arn
  github_repository       = var.github_repository
  branch_name             = "develop"
  slack_webhook_url       = var.slack_webhook_url
}

Run these in order the first time:

cd environments/dev
terraform init
terraform plan -out=tfplan
terraform apply tfplan

The Mistakes Section

I promised honesty. Here's what actually broke.

Mistake 3 — Docker layer caching wasn't working in CodeBuild

Builds were taking 12–15 minutes because nothing was being cached. privileged_mode = true is required for Docker-in-Docker, but even with that, --cache-from only works if the image exists locally in the build environment. You need to explicitly pull it first:

pre_build:
  commands:
    - docker pull $ECR_REGISTRY/$ECR_REPOSITORY_NAME:latest || true

The || true prevents failure on the very first run when no cache exists yet. After adding this, builds dropped to under 4 minutes.

Mistake 4 — ECS tasks couldn't pull from ECR (CannotPullContainerError)

This one wasted an entire afternoon. The tasks were in private subnets with NAT gateways and had internet access — but ECR pulls were still failing.

The actual cause: an org-level SCP required ECR to be accessed via VPC endpoints. Once I added the three required endpoints, everything worked:

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

# ECR uses S3 internally for layer storage — this one is easy to miss
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

Even if your org doesn't require VPC endpoints, adding them reduces egress costs and latency. Worth doing regardless.

Mistake 5 — Terraform fighting CodePipeline over the task definition

Every terraform plan after a deployment wanted to reset the ECS service back to the original task definition ARN from .tfvars. The fix is the ignore_changes = [task_definition] lifecycle block on the ECS service. Without it, you have two systems fighting to own the same resource and you'll keep accidentally reverting deployments.

Mistake 6 — The CodeStar Connection was never activated

I spent 45 minutes staring at a pipeline that failed immediately at the Source stage with a cryptic error. The connection showed up in Terraform state as created successfully. Eventually I found it: connections start in PENDING state and need manual authorization in the console.

Always verify the connection status after creating it:

aws codestar-connections list-connections \
  --provider-type GitHub \
  --query 'Connections[*].{Name:ConnectionName,Status:ConnectionStatus}'

Mistake 7 — Secrets in plain environment variables

My first version passed database credentials as plain environment variables in the task definition. Those show up in the ECS console, in CloudTrail, and in any container inspection. Always use the secrets block (shown in Part 5) which references Secrets Manager ARNs. The difference looks small in the code but is significant from a security standpoint.

The Day-to-Day Deployment Workflow

Once everything is provisioned, here's what a normal deployment looks like:

Developer opens a PR against the develop branch
After review and merge, CodePipeline triggers automatically
CodeBuild runs tests and builds the Docker image
Image is pushed to ECR tagged with the commit SHA (first 7 chars)
imagedefinitions.json is written with the new image URI
CodePipeline's ECS deploy action registers a new task definition revision
ECS performs a rolling deployment (min 100% healthy, max 200%)
Health checks confirm new tasks are healthy
Old tasks are drained and stopped
Slack notification fires with SUCCEEDED status

If step 8 fails, the deployment circuit breaker rolls back to the previous task definition automatically.

For production, I added a manual approval stage between Build and Deploy:

stage {
  name = "Approve"
  action {
    name     = "ManualApproval"
    category = "Approval"
    owner    = "AWS"
    provider = "Manual"
    version  = "1"

    configuration = {
      NotificationArn    = aws_sns_topic.pipeline_notifications.arn
      CustomData         = "Approve deployment to production for ${var.project_name}"
      ExternalEntityLink = "https://github.com/${var.github_repository}/compare/main...develop"
    }
  }
}

Monitoring and Observability

The pipeline is only as good as your ability to know when something's wrong.

Check pipeline execution history:

aws codepipeline list-pipeline-executions \
  --pipeline-name your-pipeline-name \
  --max-results 20 \
  --query 'pipelineExecutionSummaries[*].{Status:status,Start:startTime,Trigger:trigger.triggerType}'

First place to look when a deployment is stuck:

aws ecs describe-services \
  --cluster your-cluster-name \
  --services your-service-name \
  --query 'services[0].events[:10]'

CloudWatch Container Insights — enabling this on the ECS cluster (shown in Part 5) gives you CPU, memory, and network metrics per task without any additional instrumentation. Free observability.

Final Thoughts

Three things I'd tell myself before starting this:

IAM permissions are worth getting right from the start. It takes maybe two extra hours to scope things properly, and it saves you from security findings and the awkward conversation about why your build role had admin access.

The circuit breaker and rollback settings on the ECS service are not optional. At some point you will deploy something that fails health checks in production. You want that to roll back automatically, not at 2am when someone pages you.

ignore_changes on your ECS service is what makes the Terraform + CodePipeline combination actually work. Without it, you have two systems fighting over the same resource and you'll keep silently reverting deployments you just shipped.

The full Terraform code for this post is available on GitHub at [link to your repo]. If you run into something I didn't cover, drop a comment — I'm genuinely curious what edge cases show up in different environments and org setups.

If this helped you, share it with your team. The program is worth the effort.

DEV Community