Prince Ayiku

Posted on Apr 7 • Edited on Apr 13

Building a GitOps Pipeline on AWS ECS: From Manual SSH to Zero-Downtime Blue/Green Deployments

#devops #aws #gitops #docker

How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break

I used to deploy by SSHing into a server, pulling new code, restarting Docker Compose, and hoping.

That worked until the day I pushed a bug to production on a Friday afternoon and spent the weekend manually rolling it back.

This is the story of rebuilding that entire workflow — from "SSH and pray" to a system where a git push triggers security scans, builds container images, shifts traffic 10% at a time, and automatically reverts if anything looks wrong.

Where It Started

The app is a full-stack notes manager: Next.js frontend, NestJS backend, PostgreSQL, with Nginx as the reverse proxy. Four containers. Nothing exotic.

The original deployment process:

ssh ubuntu@my-server-ip
cd /opt/notes-app
git pull
docker-compose down && docker-compose up -d --build
# Go get coffee. Hope it comes back up.

This is fine when you have one server and one developer. It breaks down the moment you want to deploy without downtime, roll back a bad release, or prove to a future employer that you know what you're doing.

So I documented the rebuild as four distinct phases — not because I planned it that way, but because each phase solved a specific pain I'd already felt.

Phase 1: Automate the Build (GitHub Actions)

First step was getting the build out of my hands entirely. A GitHub Actions workflow that fires on every push to main:

name: CI/CD Pipeline
on:
  push:
    branches: [main]

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push backend
        uses: docker/build-push-action@v5
        with:
          context: ./backend
          push: true
          tags: ${{ env.ECR_REGISTRY }}/notes-backend:${{ github.sha }}

Three images (backend, frontend, proxy), each tagged with the git commit SHA. No more latest tags overwriting each other. Every commit gets its own immutable image.

Phase 2: Add Security Gates Before Automating Deployments

Here's where I made a deliberate choice most tutorials skip: I added security scanning before I automated the actual deployment.

The logic: if you automate deployment of insecure code, you've just made insecurity faster.

The Jenkins pipeline I built runs 7 gates in sequence:

Gitleaks — scans the entire git history for hardcoded credentials, API keys, tokens
TypeScript + ESLint — type errors and code style issues caught at build time
npm audit — dependency vulnerability scan
SonarCloud — code quality gates (complexity, duplication, security rules)
Docker build — images built and tagged with git SHA
Trivy — scans each container image for CVEs (HIGH and CRITICAL flagged)
Syft — generates a Software Bill of Materials (CycloneDX + SPDX JSON)

In lab mode, these gates report but don't block. In production, you'd set exit-code: 1 on Trivy and SonarCloud to make them hard gates. The point was to build the habit of having the gates, not to enforce them from day one.

Phase 3: Move to ECS Fargate

Running Docker Compose on EC2 is fine until you need the EC2 instance to scale, fail over, or restart containers automatically. ECS Fargate solves all three: serverless containers, AWS manages the underlying compute, you define the task and it runs.

The Terraform configuration provisions the entire stack:

resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-cluster"
}

resource "aws_ecs_service" "app" {
  name            = "${var.project_name}-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  launch_type     = "FARGATE"

  deployment_controller {
    type = "CODE_DEPLOY"  # Hands deployment control to CodeDeploy
  }

  lifecycle {
    ignore_changes = [task_definition]  # CI/CD owns this, not Terraform
  }
}

That ignore_changes = [task_definition] line matters more than it looks. Terraform manages the infrastructure. Jenkins manages which task definition revision is deployed. Without it, every terraform apply would roll back to whatever task definition Terraform last knew about — overwriting the version Jenkins just pushed.

The Networking Trap That Got Me

Before I talk about Phase 4, there's a specific failure I need to document because it will get you too.

My backend couldn't connect to the database. ECONNREFUSED database:5432.

In Docker Compose, services reach each other by their service name. database resolves because Docker creates a shared bridge network with DNS for each service name.

ECS Fargate uses awsvpc network mode. All containers in the same task share a single network namespace — effectively the same localhost. There's no inter-container DNS. The hostname database doesn't resolve to anything.

The fix is one word:

# Docker Compose — works locally
DATABASE_URL=postgresql://user:pass@database:5432/db

# ECS Fargate — same task = same localhost
DATABASE_URL=postgresql://user:pass@localhost:5432/db

This isn't in the getting-started guide. It's buried in the ECS networking docs. It will silently break every multi-container Fargate deployment that was originally written for Docker Compose.

Phase 4: Blue/Green Deployments with CodeDeploy

This is the part that makes the system production-grade.

When a new version is deployed, CodeDeploy spins up new ECS tasks (Green) alongside the existing ones (Blue). Traffic shifts 10% per minute from Blue to Green. If a CloudWatch alarm fires during the shift — 5xx error rate, unhealthy targets — traffic instantly reverts to 100% Blue.

T+0s    Blue: 100%    Green: starting     ← deploy begins
T+60s   Blue:  90%    Green: 10%          ← 10% shifted
T+120s  Blue:  80%    Green: 20%          ← steady if healthy
...
T+600s  Blue:   0%    Green: 100%         ← complete
T+900s  Blue tasks terminated             ← cleanup

If the CloudWatch alarm fires at any point: traffic snaps back to 100% Blue instantly.

The Jenkins pipeline orchestrates this:

stage('Deploy to ECS') {
    steps {
        sh '''
          # Render task definition with current image tags
          ./ecs/render-task-def.sh \
            --image-tag ${GIT_COMMIT:0:7} \
            --region eu-west-1

          # Register new task definition revision
          TASK_DEF_ARN=$(aws ecs register-task-definition \
            --cli-input-json file://ecs/task-definition-rendered.json \
            --query taskDefinition.taskDefinitionArn \
            --output text)

          # Trigger CodeDeploy blue/green
          aws deploy create-deployment \
            --cli-input-json file://ecs/codedeploy-input.json
        '''
    }
}

Observability: Knowing It's Actually Working

Deploying successfully and knowing it's working are different things.

The observability stack:

Prometheus scraping NestJS /metrics endpoint every 15 seconds
Grafana dashboards for request rate, latency, error rate, container health
Alertmanager routing alert notifications to a Slack channel
CloudWatch for ECS logs with 30-day retention

The Prometheus NestJS integration is worth noting — NestJS doesn't expose metrics by default. You need to instrument it:

// metrics.module.ts
import { PrometheusModule } from '@willsoto/nestjs-prometheus';

@Module({
  imports: [
    PrometheusModule.register({
      path: '/metrics',
      defaultMetrics: { enabled: true },
    }),
  ],
})
export class MetricsModule {}

Once that's running, Prometheus scrapes HTTP request counts, latency histograms, and error rates automatically.

Key Learnings

1. Security gates belong before automated deployment, not after. The moment you automate deployment of untested, unscanned code, you've made your pipeline a liability. Build the gates first, then automate.

2. Fargate awsvpc mode changes inter-container communication fundamentally. Same-task containers talk on localhost. Cross-task communication needs service discovery or an internal load balancer. Know this before you hit it in production.

3. ignore_changes = [task_definition] is required when Terraform and CI/CD share an ECS service. Without it, Terraform and Jenkins will fight over task definition revisions on every apply.

4. Blue/green is only as good as your alarms. If your CloudWatch alarm isn't configured before the deployment starts, there's nothing to trigger the rollback. The alarm is the safety net — set it up before you need it.

5. The AppSpec for CodeDeploy must be JSON-wrapped via CLI. This is undocumented in the happy path. Use jq to wrap the YAML content as an AppSpecContent JSON object or the deployment will fail with an unhelpful error.

What I'd Do Differently

If I started this project today, I'd add AWS Systems Manager Session Manager from the start instead of a Bastion Host. No SSH port exposed, no key rotation, full audit trail of every session — and it's cheaper than running a separate EC2 instance as a jump box.

I'd also set the security gates to blocking mode from day one, not lab mode. The discipline of having a hard quality gate early shapes how you write code.

Resources & Next Steps

Next I'm building the advanced observability layer — distributed tracing with OpenTelemetry and Jaeger across the full service mesh. Follow along if that's useful.

What's the most important thing your deployment pipeline is missing right now? Drop it in the comments — I'm building a list of what engineers actually care about vs. what tutorials focus on. 👇

DEV Community