DEV Community

Cover image for Building a Production-Ready ECS Pipeline: What I Learned Splitting Infrastructure into Layers
Elsa Adjei
Elsa Adjei

Posted on

Building a Production-Ready ECS Pipeline: What I Learned Splitting Infrastructure into Layers

Reading time: 5 minutes


"Just run terraform apply and you're done" — that's what I thought before I tried to build a real CI/CD pipeline.

Unfortunately, things rarely follow the 'happy path', as it's commonly called. I soon learned that so much more goes into creating a production ready CI/CD pipeline, from security measures, to considering trade offs and cost optimisation. However, this process was really valuable to me in terms of learning how to consider what was best for my needs and constraints, and growing my engineering mindset.

What you'll learn:

  • Why splitting infrastructure into persistent and ephemeral layers saves you from breaking your own pipelines
  • The iterative reality of IAM least privilege (it's not a one-shot thing)
  • Trade-offs of scratch Docker images that tutorials don't mention

What I Built

I built a fully automated CI/CD pipeline to deploy Gatus, a health monitoring application, to ECS Fargate using containers. It's complete with security scanning for the container image and Infrastructure as Code, and is available at my custom domain after deployment.

Architecture

Architecture Diagram
Two-layer design:

Layer Contains Lifecycle
Persistent OIDC, ECR, ACM, Route 53 zone Rarely changes
Ephemeral VPC, ALB, ECS, Route 53 records Created/destroyed per deployment

Why this split:
Why did I choose this split, you may ask? I decided to do this because it allows me to destroy my compute resources without breaking my CI/CD identity. This allows for cost control, as I tore down the compute layer daily to mitigate costs, while also decreasing the blast radius for any issues that a terraform destroy could cause.


The Hard Parts

1. Remote State Across Layers

The problem:
One of the first issues that I had to figure out was wiring the ephemeral and persistent layers. The ephemeral layer needed values, such as the hosted zone ID, the certificate ARN and the ECR URL, from the persistent layer.

What tripped me up:
The decision I went with was to let the ephemeral layer access the remote state of the persistent layer. This could be done by using a remote backend. I had two state files stored in one S3 bucket. Then, I could pass in the remote state of the persistent layer to the ephemeral layer as a data block in a remote_state.tf file. But how does this look in Terraform?

The solution:

data "terraform_remote_state" "persistent" {
  backend = "s3"
  config = {
   bucket = "terraform-state-gatus-elsa"
   key    = "gatus/persistent/terraform.tfstate"
   region = "eu-west-2"
 }
}


Enter fullscreen mode Exit fullscreen mode

2. IAM: Death by a Thousand Permissions

The problem:
Instead of trying to figure out exactly how many permissions I needed for my IAM roles (spoiler: it was A LOT!), I decided to start with broad privileges, just so that my CI/CD pipelines would work, and then scope down to least privilege.

What I expected:
As I started scoping down, I thought that I would get all the error messages from Terraform in one go when the pipeline ran, and that it would be a "one and done" situation. But I was very, very wrong.

What actually happened:
I severely underestimated how many permissions I would need to include in my configuration! It was a very iterative cycle: I'd apply the plan, it would fail and complain about which permission it needed, I'd then also check CloudTrail to see which permissions were being used, add them to the configuration and then repeat. It was a very gruelling process.

Permissions I didn't expect:

  • ec2:ModifySubnetAttribute
  • route53:GetChange (needs different ARN format)
  •  wafv2:*
  • cloudwatch:PutMetricAlarm

Lesson:
Sometimes least privilege is a process, not a destination. And CloudTrail is your friend!

3. Scratch Dockerfile Trade-offs

The appeal:
I wanted to go with a scratch Dockerfile. I liked the fact that the image would be tiny because of how minimal it is, and also from a security aspect there would be a minimal attack surface due to the lack of things like a shell.

What I gave up:

  • No shell for debugging
  • Can't add HEALTHCHECK instruction
  • Can't fetch config at runtime

Why I kept it:
I already had health checks being handled in my workflow and loadbalancer, so I felt adequately covered in that area. I also think that the security benefits are worth it, as well as keeping the small image for a faster workflow in terms of pulling and pushing my images.


What Clicked

Remote State as a Contract

My ephemeral state needed things from persistent. How does it get them? I used terraform_remote_state to read outputs, but the "aha" moment was understanding that these outputs were a contract and not just values. Persistent promises to provide certain outputs, ephemeral consumes them. Neither cares how the other is implemented internally.

Pipeline Variables

When my CD triggered via workflow_run, actions/checkout grabbed the latest commit on main by default, and not necessarily the commit CI had just tested. I realised that I had to specify the ref to avoid deploying untested code. The fix was ref: ${{ github.event.workflow_run.head_sha }}.

The Approval Gate OIDC Gotcha

When I added environment: production to my terraform-apply job, the OIDC trust policy broke. The sub claim changed from the main branch to that environment, and I had to add both to my trust policy.


What I'd Do Differently

Write the README First

What happened:
As I was building the project, I kept thinking of more features to add as time went on. This caused a bit of scope creep in my project.

Next time:
It's easier said than done, but in the future I'd want to decide exactly what the end state should be at the beginning of the project, then try to match that as best as possible. I'd also write down my architecture decisions before writing any code.


The Result

Demo

Gatus Dashboard

Pipeline

CI

Approval Gate

CD

Health check

Pipeline stages:

  • Build and push to ECR (commit SHA tag)
  • Trivy scan (container vulnerabilities)
  • Checkov scan (Terraform misconfigurations)
  • Terraform plan
  • Manual approval gate
  • Terraform apply
  • Health check

Duration: CI ~1:30, CD ~2:47


Key Takeaways

  1. Split infrastructure by lifecycle — persistent identity/registry, ephemeral compute
  2. IAM least privilege is iterative — use CloudTrail, expect multiple passes
  3. Scratch images have trade-offs — know what you're giving up
  4. Test your pipeline edge cases — approval gates, checkout SHAs, artifact paths
  5. Document before you build — README-first prevents scope creep

Links

  • Repository: [github.com/ElsaDevOps/Gatus-ECS]

What's Next

Right now, I'm doing a few things. I want to learn more about technology that interests and impresses me, while also learning Go, and also levelling up my infrastructure skills with Kubernetes, Amazon EKS in particular. So, I'm building my very own Tailscale/Headscale Prometheus exporter, which I will then deploy to EKS. I really love Tailscale's product, and so far I've learned a lot about Go and networking, so stay tuned for that!


Questions or feedback? [www.linkedin.com/in/elsadevops]

Top comments (0)