DEV Community: Prince Ayiku

I Built a Fully Serverless Task Manager on AWS — Here's What the Docs Don't Tell You

Prince Ayiku — Mon, 20 Apr 2026 08:54:47 +0000

I spent weeks building a fully serverless task management system on AWS — Lambda, DynamoDB, Cognito, SNS, SES, Amplify, the whole stack — provisioned entirely with Terraform and wired into a GitHub Actions CI/CD pipeline.

Here's what I learned. Not the happy path. The real stuff.

Repo: github.com/celetrialprince166/Serverless-task-management-app

What I Built

A role-based task management app where:

Admins can create, assign, update, and delete tasks
Members can view their assigned tasks and update status
Email notifications fire automatically when tasks are assigned or status changes
Everything runs serverless on AWS — zero servers to manage

The stack:

Layer	What
Frontend	React 19 + Vite + Tailwind → AWS Amplify
API	API Gateway REST + Cognito JWT auth
Compute	15+ Lambda functions, Node.js 20, TypeScript
Database	DynamoDB (single-table design)
Notifications	DynamoDB Streams → SNS → SES
IaC	Terraform (modular)
CI/CD	GitHub Actions (Checkov + npm audit + terraform validate)

Gotcha #1: DynamoDB Single-Table Design Is Not Optional

I started with multiple DynamoDB tables — one for tasks, one for users, one for assignments. Classic relational thinking.

The problem: DynamoDB has no JOIN. To get a task with its assignees, I needed three separate GetItem calls. Three round-trips. Three places for something to fail.

The fix: single-table design using composite primary keys.

// Task item
{ PK: "TASK#01HXYZ", SK: "METADATA", status: "IN_PROGRESS", ... }

// Assignment — same table, different SK prefix
{ PK: "TASK#01HXYZ", SK: "ASSIGN#USER#01HJKL", ... }

Now query(PK = "TASK#01HXYZ") returns the task AND all its assignments in one call. The begins_with("ASSIGN#") filter separates them client-side.

The trade-off: you must define ALL access patterns before building the schema. Change your access patterns later and you're adding GSIs or doing table scans.

I used PAY_PER_REQUEST billing — scales to zero when nothing's happening, which is perfect for a project portfolio app.

Gotcha #2: The Cognito Post-Confirmation Trigger Fires More Than Once

I added a post-confirmation Lambda trigger to create the user record in DynamoDB after signup.

What I didn't know: this trigger fires on sign-in events too, not just the initial signup confirmation. Specifically, when email verification or MFA is involved, the trigger re-fires on subsequent authentications.

Without an idempotency guard, every sign-in overwrites the user's DynamoDB record — silently resetting any role I'd manually set to ADMIN back to MEMBER.

Fix: one line of Terraform logic in the PutItemCommand:

ConditionExpression: "attribute_not_exists(PK)"
// Only writes if the item doesn't exist yet — idempotent

Now the first sign-in creates the profile. Every subsequent one does nothing.

Gotcha #3: CORS Is a Three-Layer Problem in API Gateway

Browser CORS error. Classic. Except in serverless, there are three places to fix it and you need all three:

Layer 1 — API Gateway: add OPTIONS method to every resource and configure Gateway Responses for 4xx/5xx

Layer 2 — Lambda: every handler response must include CORS headers

headers: {
  "Access-Control-Allow-Origin": process.env.ALLOWED_ORIGIN || "*",
  "Access-Control-Allow-Headers": "Content-Type,Authorization",
}

Layer 3 — The one nobody mentions: Gateway Responses for auth errors. When Cognito rejects a JWT, API Gateway returns a 401 — but that error comes from the authorizer, not from Lambda. So your Lambda CORS headers don't run. You get a CORS error that's actually a 401.

Fix it in Terraform:

resource "aws_api_gateway_gateway_response" "cors_4xx" {
  rest_api_id   = var.rest_api_id
  response_type = "DEFAULT_4XX"
  response_parameters = {
    "gatewayresponse.header.Access-Control-Allow-Origin" = "'*'"
  }
}

Gotcha #4: GitHub Actions OIDC Has a Silent Permission Requirement

I switched from static AWS access keys to OIDC federation for the CI pipeline. The configure-aws-credentials action just said: "Credentials could not be loaded."

No mention of permissions. No useful error.

The fix is one block:

permissions:
  id-token: write   # This is required. Without it, OIDC JWT request fails silently.
  contents: read

GitHub's documentation mentions this, but it's buried. The action's error message gives you zero hint that this is the problem.

Gotcha #5: Amplify Monorepo Needs `appRoot`

My repo has both backend/ and frontend/. Amplify's default build config looks for package.json in the repository root. It doesn't find one. Build fails with a vague error.

Fix in amplify.yml:

version: 1
applications:
  - frontend:
      phases:
        preBuild:
          commands: [npm ci]
        build:
          commands: [npm run build]
      artifacts:
        baseDirectory: dist
        files: ["**/*"]
    appRoot: frontend   # ← This is the important line

Without appRoot, Amplify tries to build from the repo root and fails every time.

Gotcha #6: Notifications Must Be Decoupled — Or You'll Regret It

My first instinct was to have the task Lambda call SES directly after writing to DynamoDB. Simple. Direct.

The problem: SES latency adds to every task write response time. If SES is down, task writes fail. If I want to add Slack notifications later, I edit the task Lambda.

The right pattern is DynamoDB Streams → SNS → email formatter Lambda:

Task write Lambda  →  DynamoDB  →  Stream  →  SNS  →  Email Lambda  →  SES

Now:

Task writes are fast — no SES dependency
SES failures don't break task creation
Adding Slack = adding one SNS subscriber. Zero changes to existing code.

The stream processor detects events by checking the item's SK prefix:

// New assignment? SK starts with "ASSIGN#"
if (newImage.SK?.startsWith("ASSIGN#") && record.eventName === "INSERT") {
  await sns.send(new PublishCommand({ type: "TASK_ASSIGNED", ... }));
}

// Status changed? SK is "METADATA" and status field differs
if (newImage.SK === "METADATA" && oldImage.status !== newImage.status) {
  await sns.send(new PublishCommand({ type: "STATUS_CHANGED", ... }));
}

What I'd Do Differently

Use provisioned concurrency for critical Lambda functions. Cold starts on the first request hit up to 800ms on my heavier handlers. For a real production app, provisioned concurrency keeps warm instances ready.

Add a WebSocket API for real-time task updates. Right now the frontend polls. API Gateway WebSockets would let me push status changes to connected clients instantly.

Design GSIs more carefully upfront. I added a second GSI midway through because I hadn't thought through the "list tasks assigned to user" access pattern. You can't backfill existing items to a new GSI's index — only new writes get indexed.

Key Takeaways

Single-table DynamoDB requires upfront access pattern design — change your mind later and you're adding indexes
Cognito post-confirmation triggers are NOT idempotent by default — guard your DynamoDB writes with attribute_not_exists
CORS in API Gateway has three independent configuration points — miss the Gateway Responses and you'll see CORS errors that are actually 401s
GitHub Actions OIDC requires id-token: write — the error message won't tell you this
Amplify monorepo requires appRoot — without it, every Amplify build fails silently
Decouple notifications from writes — Streams → SNS → Lambda is the right pattern, always

Full technical deep-dive with complete Terraform code and all 15+ Lambda handlers: My Hashnode blog

What does your serverless notification architecture look like? Still doing direct Lambda → SES, or have you moved to a queue/stream pattern? Drop it in the comments. 👇

I Built a Self-Healing Observability Stack on AWS ECS — Here Are the Bugs That Nearly Broke Me

Prince Ayiku — Mon, 13 Apr 2026 08:44:34 +0000

I Built a Self-Healing Observability Stack — Here Are the Bugs That Nearly Broke Me

My blue/green deployment rolled back successfully.

I had no idea why.

The CloudWatch alarm fired. CodeDeploy reverted. The Slack alert said "5xx spike." But which service? Which endpoint? Which specific request triggered the cascade? All I had was a timestamp and an alarm name. The system worked exactly as designed — and I couldn't explain what it had just protected me from.

That's when I started this project.

What I Was Actually Missing

I'd built a solid GitOps pipeline at this point: Jenkins security gates, ECS Fargate, blue/green deployments with automatic rollback. The deployment mechanics were production-grade. The observability layer was... three CloudWatch log groups and a feeling.

The stack I built to close that gap:

OpenTelemetry auto-instrumentation on the NestJS backend — every HTTP request generates a trace with spans across every service hop
Jaeger as the trace backend (receiving via OTLP HTTP on port 4318)
Pino structured logging with trace_id and span_id injected into every log line — so a CloudWatch log entry links directly to a Jaeger trace
Prometheus scraping custom NestJS metrics (request rate, latency histograms, error counters)
Grafana dashboards
Alertmanager → Slack for alert routing
Lambda auto-remediation — a function that detects high error rates via CloudWatch alarm and autonomously stops unhealthy ECS tasks

The goal: when something breaks, I can go from Slack alert → log line → trace → root cause in one click. And if the error rate spikes, the system handles it before I even see the alert.

Bug #1: OpenTelemetry Was Running But Not Working

This was the first thing I got completely wrong.

I installed @opentelemetry/auto-instrumentations-node, wired up the OTLP exporter, pointed it at Jaeger, and ran the app. Zero traces in Jaeger. No error. No warning. Just nothing.

I spent a long time confirming things that weren't the problem: Jaeger was reachable, the exporter config was correct, the SDK was initialising without throwing. Everything looked fine. Nothing was traced.

The problem was import order.

Node.js auto-instrumentation works by monkey-patching built-in modules (http, https, net) at process startup. The patches need to be applied before any other module loads. If NestJS (or Express, or anything) bootstraps first, those modules are already in memory — the patches never apply. The app runs normally but generates no spans.

The fix is one constraint:

// main.ts — THIS ORDER IS MANDATORY

import './tracing';          // Must be FIRST — patches Node.js internals
import { NestFactory } from '@nestjs/core';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  await app.listen(3001);
}
bootstrap();

And the tracing.ts initialisation itself:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',  // localhost because ECS awsvpc
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();  // Synchronous — must complete before bootstrap() runs

Note the URL: localhost:4318, not jaeger:4318. ECS Fargate's awsvpc network mode puts all containers in the same task into a shared network namespace. Same-task containers talk on localhost. Docker Compose service names don't resolve here.

After fixing the import order, traces started flowing immediately.

Correlating Logs to Traces

Having traces is useful. Having traces you can find from a log line is the actual goal.

The Pino logger needed a mixin function that reads the active OpenTelemetry span and injects its IDs into every log entry:

import { trace } from '@opentelemetry/api';

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};
    const ctx = span.spanContext();
    return {
      trace_id: ctx.traceId,
      span_id: ctx.spanId,
    };
  },
  formatters: {
    level(label) { return { level: label }; },
  },
});

Now every log line looks like this:

{
  "level": "error",
  "msg": "Database connection refused",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "timestamp": "2024-01-15T14:23:01.234Z"
}

When a Slack alert fires, I can grep CloudWatch Logs for the trace_id, then paste it directly into Jaeger's search. One click. The full trace — every service, every database query, every millisecond — is right there.

Bug #2: Lambda Auto-Remediation Broke the Deployment Controller

This one was more subtle.

The Lambda function's job: CloudWatch alarm fires (high 5xx rate) → Lambda detects it → Lambda restarts the unhealthy ECS service.

My first implementation used UpdateService with forceNewDeployment: true. That's the standard approach for restarting an ECS service. It should have worked.

ecs.update_service(
    cluster=cluster_arn,
    service=service_name,
    forceNewDeployment=True  # This fails silently or throws
)

It threw:

InvalidParameterException: Unable to update the service because
a deployment is already in progress

The reason: when an ECS service uses deployment_controller { type = "CODE_DEPLOY" }, AWS hands deployment control entirely to CodeDeploy. UpdateService --forceNewDeployment is incompatible with an active CodeDeploy-controlled service. The two systems conflict.

The correct approach is ecs:StopTask — stop the specific unhealthy task directly:

def remediate_unhealthy_task(cluster_arn, service_arn):
    # List running tasks for the service
    tasks = ecs.list_tasks(
        cluster=cluster_arn,
        serviceName=service_arn,
        desiredStatus='RUNNING'
    )['taskArns']

    if not tasks:
        return

    # Stop the first running task
    ecs.stop_task(
        cluster=cluster_arn,
        task=tasks[0],
        reason='Auto-remediation: high error rate detected via CloudWatch alarm'
    )

When a task stops, ECS detects the task count is below desired and launches a replacement. The service recovers. CodeDeploy is never touched. No deployment state corruption.

Bug #3: The Idempotency Problem

Lambda triggered three times for the same alarm window. Three concurrent invocations. Three tasks stopped simultaneously. The service dropped to zero running tasks and couldn't recover fast enough to pass health checks.

The fix: check your own logs before acting.

def check_recent_remediation(log_group, log_stream_prefix, window_minutes=10):
    """Return True if auto-remediation ran successfully in the last N minutes."""
    cutoff = int((datetime.utcnow() - timedelta(minutes=window_minutes)).timestamp() * 1000)

    streams = logs.describe_log_streams(
        logGroupName=log_group,
        logStreamNamePrefix=log_stream_prefix,
        orderBy='LastEventTime',
        descending=True,
        limit=5
    )['logStreams']

    for stream in streams:
        events = logs.get_log_events(
            logGroupName=log_group,
            logStreamName=stream['logStreamName'],
            startTime=cutoff
        )['events']

        for event in events:
            if 'Auto-remediation successful' in event['message']:
                return True

    return False

def lambda_handler(event, context):
    if check_recent_remediation(LOG_GROUP, LOG_STREAM_PREFIX):
        print('Recent remediation found — skipping to avoid thrash')
        return {'status': 'skipped', 'reason': 'idempotency_guard'}

    remediate_unhealthy_task(CLUSTER_ARN, SERVICE_ARN)
    print('Auto-remediation successful')

One alarm. One Lambda invocation that acts. All subsequent invocations within 10 minutes exit early. The service gets one clean restart instead of a cascade.

The Full Pipeline: 11 Stages

The Jenkins pipeline that drives all of this:

Secret Scan (Gitleaks)
Type Check + Lint (TypeScript + ESLint)
Dependency Audit (npm audit)
Code Quality (SonarCloud)
Build Images (Docker, tagged with git SHA)
Image Scan (Trivy CVE detection)
SBOM Generation (Syft — CycloneDX + SPDX)
IaC Scan (Checkov on Terraform)
ECR Push
Task Definition Registration
Blue/Green Deployment (CodeDeploy, 10% traffic per minute)

Security gates first. Deployment last. The same principle from the GitOps project applies here.

What It Looks Like When It Works

A request comes in to the NestJS backend:

OpenTelemetry generates a trace ID and creates a root span
Each downstream call (database query, external HTTP) gets a child span
Pino injects the trace ID into every log line during that request's lifecycle
Prometheus records the request duration in a histogram
If the response is 5xx: Alertmanager routes to Slack with the alarm context
In Slack: I see the alert, click the CloudWatch link, grep for the trace ID, open Jaeger, see the full call graph in under 60 seconds

And if the error rate crosses the threshold:

CloudWatch alarm fires
Lambda checks for recent remediation (idempotency guard)
Lambda stops the unhealthy task
ECS replaces it with a fresh task
Error rate drops
Alarm clears

No manual intervention. No 3am pages.

Key Takeaways

OTel import order is a hard constraint, not a preference. The SDK must patch Node.js internals before any framework loads. One wrong line breaks the entire tracing setup with no error message to guide you.

ecs:StopTask is the correct remediation call when using CODE_DEPLOY. forceNewDeployment conflicts with the CodeDeploy controller. Stop the task — ECS handles the replacement.

Idempotency in Lambda isn't optional when CloudWatch alarms are your trigger. Alarms fire multiple times. Your remediation function needs to know when it already ran.

Trace ID correlation turns three separate signals into one investigation. Logs, traces, and metrics are each useful in isolation. Together, with the trace ID as the link, they tell the complete story of a request.

Resources

What's the most useful observability signal you've added to a production system? Drop it below — I'm building a list of what actually helps vs. what just adds noise. 👇

Building a GitOps Pipeline on AWS ECS: From Manual SSH to Zero-Downtime Blue/Green Deployments

Prince Ayiku — Tue, 07 Apr 2026 08:30:30 +0000

How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break

I used to deploy by SSHing into a server, pulling new code, restarting Docker Compose, and hoping.

That worked until the day I pushed a bug to production on a Friday afternoon and spent the weekend manually rolling it back.

This is the story of rebuilding that entire workflow — from "SSH and pray" to a system where a git push triggers security scans, builds container images, shifts traffic 10% at a time, and automatically reverts if anything looks wrong.

Where It Started

The app is a full-stack notes manager: Next.js frontend, NestJS backend, PostgreSQL, with Nginx as the reverse proxy. Four containers. Nothing exotic.

The original deployment process:

ssh ubuntu@my-server-ip
cd /opt/notes-app
git pull
docker-compose down && docker-compose up -d --build
# Go get coffee. Hope it comes back up.

This is fine when you have one server and one developer. It breaks down the moment you want to deploy without downtime, roll back a bad release, or prove to a future employer that you know what you're doing.

So I documented the rebuild as four distinct phases — not because I planned it that way, but because each phase solved a specific pain I'd already felt.

Phase 1: Automate the Build (GitHub Actions)

First step was getting the build out of my hands entirely. A GitHub Actions workflow that fires on every push to main:

name: CI/CD Pipeline
on:
  push:
    branches: [main]

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push backend
        uses: docker/build-push-action@v5
        with:
          context: ./backend
          push: true
          tags: ${{ env.ECR_REGISTRY }}/notes-backend:${{ github.sha }}

Three images (backend, frontend, proxy), each tagged with the git commit SHA. No more latest tags overwriting each other. Every commit gets its own immutable image.

Phase 2: Add Security Gates Before Automating Deployments

Here's where I made a deliberate choice most tutorials skip: I added security scanning before I automated the actual deployment.

The logic: if you automate deployment of insecure code, you've just made insecurity faster.

The Jenkins pipeline I built runs 7 gates in sequence:

Gitleaks — scans the entire git history for hardcoded credentials, API keys, tokens
TypeScript + ESLint — type errors and code style issues caught at build time
npm audit — dependency vulnerability scan
SonarCloud — code quality gates (complexity, duplication, security rules)
Docker build — images built and tagged with git SHA
Trivy — scans each container image for CVEs (HIGH and CRITICAL flagged)
Syft — generates a Software Bill of Materials (CycloneDX + SPDX JSON)

In lab mode, these gates report but don't block. In production, you'd set exit-code: 1 on Trivy and SonarCloud to make them hard gates. The point was to build the habit of having the gates, not to enforce them from day one.

Phase 3: Move to ECS Fargate

Running Docker Compose on EC2 is fine until you need the EC2 instance to scale, fail over, or restart containers automatically. ECS Fargate solves all three: serverless containers, AWS manages the underlying compute, you define the task and it runs.

The Terraform configuration provisions the entire stack:

resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-cluster"
}

resource "aws_ecs_service" "app" {
  name            = "${var.project_name}-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  launch_type     = "FARGATE"

  deployment_controller {
    type = "CODE_DEPLOY"  # Hands deployment control to CodeDeploy
  }

  lifecycle {
    ignore_changes = [task_definition]  # CI/CD owns this, not Terraform
  }
}

That ignore_changes = [task_definition] line matters more than it looks. Terraform manages the infrastructure. Jenkins manages which task definition revision is deployed. Without it, every terraform apply would roll back to whatever task definition Terraform last knew about — overwriting the version Jenkins just pushed.

The Networking Trap That Got Me

Before I talk about Phase 4, there's a specific failure I need to document because it will get you too.

My backend couldn't connect to the database. ECONNREFUSED database:5432.

In Docker Compose, services reach each other by their service name. database resolves because Docker creates a shared bridge network with DNS for each service name.

ECS Fargate uses awsvpc network mode. All containers in the same task share a single network namespace — effectively the same localhost. There's no inter-container DNS. The hostname database doesn't resolve to anything.

The fix is one word:

# Docker Compose — works locally
DATABASE_URL=postgresql://user:pass@database:5432/db

# ECS Fargate — same task = same localhost
DATABASE_URL=postgresql://user:pass@localhost:5432/db

This isn't in the getting-started guide. It's buried in the ECS networking docs. It will silently break every multi-container Fargate deployment that was originally written for Docker Compose.

Phase 4: Blue/Green Deployments with CodeDeploy

This is the part that makes the system production-grade.

When a new version is deployed, CodeDeploy spins up new ECS tasks (Green) alongside the existing ones (Blue). Traffic shifts 10% per minute from Blue to Green. If a CloudWatch alarm fires during the shift — 5xx error rate, unhealthy targets — traffic instantly reverts to 100% Blue.

T+0s    Blue: 100%    Green: starting     ← deploy begins
T+60s   Blue:  90%    Green: 10%          ← 10% shifted
T+120s  Blue:  80%    Green: 20%          ← steady if healthy
...
T+600s  Blue:   0%    Green: 100%         ← complete
T+900s  Blue tasks terminated             ← cleanup

If the CloudWatch alarm fires at any point: traffic snaps back to 100% Blue instantly.

The Jenkins pipeline orchestrates this:

stage('Deploy to ECS') {
    steps {
        sh '''
          # Render task definition with current image tags
          ./ecs/render-task-def.sh \
            --image-tag ${GIT_COMMIT:0:7} \
            --region eu-west-1

          # Register new task definition revision
          TASK_DEF_ARN=$(aws ecs register-task-definition \
            --cli-input-json file://ecs/task-definition-rendered.json \
            --query taskDefinition.taskDefinitionArn \
            --output text)

          # Trigger CodeDeploy blue/green
          aws deploy create-deployment \
            --cli-input-json file://ecs/codedeploy-input.json
        '''
    }
}

Observability: Knowing It's Actually Working

Deploying successfully and knowing it's working are different things.

The observability stack:

Prometheus scraping NestJS /metrics endpoint every 15 seconds
Grafana dashboards for request rate, latency, error rate, container health
Alertmanager routing alert notifications to a Slack channel
CloudWatch for ECS logs with 30-day retention

The Prometheus NestJS integration is worth noting — NestJS doesn't expose metrics by default. You need to instrument it:

// metrics.module.ts
import { PrometheusModule } from '@willsoto/nestjs-prometheus';

@Module({
  imports: [
    PrometheusModule.register({
      path: '/metrics',
      defaultMetrics: { enabled: true },
    }),
  ],
})
export class MetricsModule {}

Once that's running, Prometheus scrapes HTTP request counts, latency histograms, and error rates automatically.

Key Learnings

1. Security gates belong before automated deployment, not after. The moment you automate deployment of untested, unscanned code, you've made your pipeline a liability. Build the gates first, then automate.

2. Fargate awsvpc mode changes inter-container communication fundamentally. Same-task containers talk on localhost. Cross-task communication needs service discovery or an internal load balancer. Know this before you hit it in production.

3. ignore_changes = [task_definition] is required when Terraform and CI/CD share an ECS service. Without it, Terraform and Jenkins will fight over task definition revisions on every apply.

4. Blue/green is only as good as your alarms. If your CloudWatch alarm isn't configured before the deployment starts, there's nothing to trigger the rollback. The alarm is the safety net — set it up before you need it.

5. The AppSpec for CodeDeploy must be JSON-wrapped via CLI. This is undocumented in the happy path. Use jq to wrap the YAML content as an AppSpecContent JSON object or the deployment will fail with an unhelpful error.

What I'd Do Differently

If I started this project today, I'd add AWS Systems Manager Session Manager from the start instead of a Bastion Host. No SSH port exposed, no key rotation, full audit trail of every session — and it's cheaper than running a separate EC2 instance as a jump box.

I'd also set the security gates to blocking mode from day one, not lab mode. The discipline of having a hard quality gate early shapes how you write code.

Resources & Next Steps

Next I'm building the advanced observability layer — distributed tracing with OpenTelemetry and Jaeger across the full service mesh. Follow along if that's useful.

What's the most important thing your deployment pipeline is missing right now? Drop it in the comments — I'm building a list of what engineers actually care about vs. what tutorials focus on. 👇

How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break

Prince Ayiku — Tue, 07 Apr 2026 08:30:06 +0000

How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break

I used to deploy by SSHing into a server, pulling new code, restarting Docker Compose, and hoping.

That worked until the day I pushed a bug to production on a Friday afternoon and spent the weekend manually rolling it back.

Where It Started

The app is a full-stack notes manager: Next.js frontend, NestJS backend, PostgreSQL, with Nginx as the reverse proxy. Four containers. Nothing exotic.

The original deployment process:

ssh ubuntu@my-server-ip
cd /opt/notes-app
git pull
docker-compose down && docker-compose up -d --build
# Go get coffee. Hope it comes back up.

So I documented the rebuild as four distinct phases — not because I planned it that way, but because each phase solved a specific pain I'd already felt.

Phase 1: Automate the Build (GitHub Actions)

First step was getting the build out of my hands entirely. A GitHub Actions workflow that fires on every push to main:

name: CI/CD Pipeline
on:
  push:
    branches: [main]

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push backend
        uses: docker/build-push-action@v5
        with:
          context: ./backend
          push: true
          tags: ${{ env.ECR_REGISTRY }}/notes-backend:${{ github.sha }}

Three images (backend, frontend, proxy), each tagged with the git commit SHA. No more latest tags overwriting each other. Every commit gets its own immutable image.

Phase 2: Add Security Gates Before Automating Deployments

Here's where I made a deliberate choice most tutorials skip: I added security scanning before I automated the actual deployment.

The logic: if you automate deployment of insecure code, you've just made insecurity faster.

The Jenkins pipeline I built runs 7 gates in sequence:

Gitleaks — scans the entire git history for hardcoded credentials, API keys, tokens
TypeScript + ESLint — type errors and code style issues caught at build time
npm audit — dependency vulnerability scan
SonarCloud — code quality gates (complexity, duplication, security rules)
Docker build — images built and tagged with git SHA
Trivy — scans each container image for CVEs (HIGH and CRITICAL flagged)
Syft — generates a Software Bill of Materials (CycloneDX + SPDX JSON)

Phase 3: Move to ECS Fargate

The Terraform configuration provisions the entire stack:

resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-cluster"
}

resource "aws_ecs_service" "app" {
  name            = "${var.project_name}-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  launch_type     = "FARGATE"

  deployment_controller {
    type = "CODE_DEPLOY"  # Hands deployment control to CodeDeploy
  }

  lifecycle {
    ignore_changes = [task_definition]  # CI/CD owns this, not Terraform
  }
}

The Networking Trap That Got Me

Before I talk about Phase 4, there's a specific failure I need to document because it will get you too.

My backend couldn't connect to the database. ECONNREFUSED database:5432.

In Docker Compose, services reach each other by their service name. database resolves because Docker creates a shared bridge network with DNS for each service name.

The fix is one word:

# Docker Compose — works locally
DATABASE_URL=postgresql://user:pass@database:5432/db

# ECS Fargate — same task = same localhost
DATABASE_URL=postgresql://user:pass@localhost:5432/db

This isn't in the getting-started guide. It's buried in the ECS networking docs. It will silently break every multi-container Fargate deployment that was originally written for Docker Compose.

Phase 4: Blue/Green Deployments with CodeDeploy

This is the part that makes the system production-grade.

T+0s    Blue: 100%    Green: starting     ← deploy begins
T+60s   Blue:  90%    Green: 10%          ← 10% shifted
T+120s  Blue:  80%    Green: 20%          ← steady if healthy
...
T+600s  Blue:   0%    Green: 100%         ← complete
T+900s  Blue tasks terminated             ← cleanup

If the CloudWatch alarm fires at any point: traffic snaps back to 100% Blue instantly.

The Jenkins pipeline orchestrates this:

stage('Deploy to ECS') {
    steps {
        sh '''
          # Render task definition with current image tags
          ./ecs/render-task-def.sh \
            --image-tag ${GIT_COMMIT:0:7} \
            --region eu-west-1

          # Register new task definition revision
          TASK_DEF_ARN=$(aws ecs register-task-definition \
            --cli-input-json file://ecs/task-definition-rendered.json \
            --query taskDefinition.taskDefinitionArn \
            --output text)

          # Trigger CodeDeploy blue/green
          aws deploy create-deployment \
            --cli-input-json file://ecs/codedeploy-input.json
        '''
    }
}

Observability: Knowing It's Actually Working

Deploying successfully and knowing it's working are different things.

The observability stack:

Prometheus scraping NestJS /metrics endpoint every 15 seconds
Grafana dashboards for request rate, latency, error rate, container health
Alertmanager routing alert notifications to a Slack channel
CloudWatch for ECS logs with 30-day retention

The Prometheus NestJS integration is worth noting — NestJS doesn't expose metrics by default. You need to instrument it:

// metrics.module.ts
import { PrometheusModule } from '@willsoto/nestjs-prometheus';

@Module({
  imports: [
    PrometheusModule.register({
      path: '/metrics',
      defaultMetrics: { enabled: true },
    }),
  ],
})
export class MetricsModule {}

Once that's running, Prometheus scrapes HTTP request counts, latency histograms, and error rates automatically.

Key Learnings

3. ignore_changes = [task_definition] is required when Terraform and CI/CD share an ECS service. Without it, Terraform and Jenkins will fight over task definition revisions on every apply.

What I'd Do Differently

I'd also set the security gates to blocking mode from day one, not lab mode. The discipline of having a hard quality gate early shapes how you write code.

Resources & Next Steps

Next I'm building the advanced observability layer — distributed tracing with OpenTelemetry and Jaeger across the full service mesh. Follow along if that's useful.

What's the most important thing your deployment pipeline is missing right now? Drop it in the comments — I'm building a list of what engineers actually care about vs. what tutorials focus on. 👇

I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up

Prince Ayiku — Tue, 31 Mar 2026 18:11:06 +0000

I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up

I thought I understood Terraform. Then I tried to inject a database endpoint that didn't exist yet into a server that hadn't booted yet, and I stared at my screen for a solid hour.

That moment taught me more about Infrastructure as Code than any tutorial had.

This is the story of building a production-style 3-tier AWS architecture from scratch — what I built, what broke, and what I'd do differently.

Why I Built This

I'm on a DevOps learning path at AmaliTech, and I'd been doing the usual things: tutorials, small scripts, single-instance deployments. But I kept noticing that production systems don't look like that. They have layers. They isolate things. The database is never directly reachable from the internet.

So I decided to build one — not a toy, but an architecture that actually reflects how real workloads run. The application I chose to deploy on top of it was a Pharma AI assistant: Next.js frontend, Python FastAPI backend, Clerk authentication, Paystack payments, and Groq for the LLM layer.

The goal wasn't to ship the app. It was to build the infrastructure correctly, document the decisions, and understand why each piece exists.

What I Built

The architecture has three layers, each in its own network zone:

Tier 1 — Public (ALB + Bastion Host)
The Application Load Balancer receives traffic from the internet and forwards it to the app tier. The Bastion Host is the only way to SSH into anything — and it's locked down to my IP only.

Tier 2 — Private Application (EC2 Auto Scaling Group)
EC2 instances running Docker containers. They live in private subnets — no public IPs, no direct internet exposure. The only traffic that reaches them comes through the ALB. They can reach the internet outbound through a NAT Gateway (for pulling Docker images), but nothing can reach them directly.

Tier 3 — Private Database (RDS PostgreSQL)
The database sits in private subnets with no route table attached to an internet gateway. Not "protected by a security group." Structurally unreachable from the internet.

The Terraform Structure

Everything is modular. Five modules: networking, security, database, alb, compute. The root main.tf just orchestrates them.

modules/
├── networking/   # VPC, subnets, IGW, NAT Gateway, route tables
├── security/     # 4 security groups (ALB, Bastion, App, DB)
├── database/     # RDS PostgreSQL in private DB subnets
├── alb/          # Application Load Balancer + target group + listener
└── compute/      # Launch template, Auto Scaling Group, Bastion Host

Each module exposes outputs that the next module depends on. Networking outputs flow into security, security flows into compute and database, everything flows into compute.

The Problem That Stopped Me Cold

Here's where things got interesting.

My EC2 instances boot by running a user_data.sh script. That script pulls a Docker image from Docker Hub and runs it with environment variables — including the database connection string.

The database connection string includes the RDS endpoint. Like this:

postgresql://username:password@mydb.abc123xyz.us-east-1.rds.amazonaws.com:5432/myapp

The problem: that endpoint only exists after Terraform creates the RDS instance. Which happens before the EC2 instances boot. Which means I need to pass a value that doesn't exist at the start of terraform apply into a script that runs at the end of it.

I tried a few approaches that didn't work:

Hardcoding the endpoint — completely defeats the purpose of IaC. Next deploy on a fresh account, it breaks immediately.
Passing it as a plain string variable — still needs the actual value upfront.
Running a second script after apply — works once, but now you have manual steps that live outside your Terraform state.

The solution was templatefile().

# In the compute module, launch template user_data:
user_data = base64encode(
  templatefile("${path.module}/scripts/user_data.sh", {
    database_url     = "postgresql://${var.db_username}:${var.db_password}@${var.db_endpoint}/${var.db_name}?sslmode=require"
    direct_url       = "postgresql://${var.db_username}:${var.db_password}@${var.db_endpoint}/${var.db_name}?sslmode=require"
    docker_username  = var.dockerhub_username
    docker_password  = var.dockerhub_token
    clerk_secret_key = var.clerk_secret_key
    # ... other vars
  })
)

Terraform resolves module outputs in dependency order. By the time it runs the compute module, the database module has already completed — and its endpoint output is available. The templatefile() function substitutes all the variables into the shell script before base64-encoding it. The EC2 instance boots with a fully rendered startup script that has the real database URL already baked in.

No manual steps. No hardcoded values. Works on every fresh deploy.

The Other Gotcha: Docker Hub from a Private Subnet

After solving the database injection problem, I ran terraform apply and watched everything provision cleanly. Then I checked the EC2 logs:

sudo cat /var/log/user-data.log

The Docker pull was failing. The instance couldn't reach Docker Hub.

I had set up private subnets and a NAT Gateway, but I'd missed connecting them. The private route table didn't have a route for 0.0.0.0/0 pointing at the NAT Gateway. Private subnets need that explicit route — they don't inherit it from anywhere.

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id  # This is what I was missing
  }
}

Fix applied. Docker pull worked. App came up.

Security Groups That Reference Each Other

One thing I'm actually proud of in this project is how the security groups are set up.

Instead of allowing traffic from IP address ranges, each security group references another security group as the source:

# App security group — only allows HTTP from the ALB security group
ingress {
  from_port       = 80
  to_port         = 80
  protocol        = "tcp"
  security_groups = [aws_security_group.alb.id]
}

# Database security group — only allows PostgreSQL from the App security group
ingress {
  from_port       = 5432
  to_port         = 5432
  protocol        = "tcp"
  security_groups = [aws_security_group.app.id]
}

Why does this matter? Because EC2 instances in an Auto Scaling Group come and go. Their IP addresses change. If you allow traffic from 10.0.2.0/24, you need to keep that CIDR accurate forever. If you allow traffic from the App security group ID, any instance in that group is automatically covered — regardless of its IP.

The CI/CD Pipeline

I set up a GitHub Actions workflow that triggers on every push to main:

Checkout the code
Log in to Docker Hub
Build the Docker image from ./pharma_app/Dockerfile
Push it with the latest tag

When new instances launch (via ASG), they pull latest from Docker Hub automatically through the user_data script. So a code push → Docker Hub → next instance launch picks up the new image.

What Actually Running Looks Like

After terraform apply completes, the output gives you the ALB DNS name. Hit that in a browser and the app loads — served through the load balancer, from a private EC2 instance, talking to a database that has no internet exposure.

Key Learnings

templatefile() is how you inject dynamic values into user_data. Terraform resolves the value after the dependency it comes from is complete. Use it — don't work around it.
Private subnets are not automatically NAT'd. You need to create the NAT Gateway, create a private route table, add a 0.0.0.0/0 route pointing at the NAT, and associate that route table with your private subnets. All four steps. No shortcuts.
Security groups that reference each other are more resilient than CIDR-based rules. Especially in environments where instances scale dynamically.
Add retry logic to user_data scripts. Instance networking isn't always ready the second the script runs. Five retries with 15-second delays costs nothing and prevents a class of flaky failures.
RDS subnet groups require at least two AZs. This is an AWS requirement, not a recommendation. Design your subnets for multi-AZ from the start.

Lessons for Other Learners

If you're working through a similar architecture and something isn't connecting, nine times out of ten it's a routing issue or a security group that's too restrictive. Check your route tables before you start doubting your application code.

And don't skip the modular structure because it feels like overhead. When something breaks, knowing that the networking module is isolated from the compute module makes debugging dramatically faster.

Resources & Next Steps

Next: I'm building out a full CI/CD pipeline with Jenkins, GitHub Actions, SonarCloud code analysis, and Trivy image scanning. Follow along if that's useful.

Have you built a 3-tier architecture before? What was the part that gave you the most trouble? Drop it in the comments — I'm genuinely curious whether the NAT Gateway thing trips everyone up or just me. 👇

DEV Community: Prince Ayiku

I Built a Fully Serverless Task Manager on AWS — Here's What the Docs Don't Tell You

What I Built

Gotcha #1: DynamoDB Single-Table Design Is Not Optional

Gotcha #2: The Cognito Post-Confirmation Trigger Fires More Than Once

Gotcha #3: CORS Is a Three-Layer Problem in API Gateway

Gotcha #4: GitHub Actions OIDC Has a Silent Permission Requirement

Gotcha #5: Amplify Monorepo Needs appRoot

Gotcha #6: Notifications Must Be Decoupled — Or You'll Regret It

What I'd Do Differently

Key Takeaways

I Built a Self-Healing Observability Stack on AWS ECS — Here Are the Bugs That Nearly Broke Me

I Built a Self-Healing Observability Stack — Here Are the Bugs That Nearly Broke Me

What I Was Actually Missing

Bug #1: OpenTelemetry Was Running But Not Working

Correlating Logs to Traces

Bug #2: Lambda Auto-Remediation Broke the Deployment Controller

Bug #3: The Idempotency Problem

The Full Pipeline: 11 Stages

What It Looks Like When It Works

Key Takeaways

Resources

Building a GitOps Pipeline on AWS ECS: From Manual SSH to Zero-Downtime Blue/Green Deployments

How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break

Where It Started

Phase 1: Automate the Build (GitHub Actions)

Phase 2: Add Security Gates Before Automating Deployments

Phase 3: Move to ECS Fargate

The Networking Trap That Got Me

Phase 4: Blue/Green Deployments with CodeDeploy

Observability: Knowing It's Actually Working

Key Learnings

What I'd Do Differently

Resources & Next Steps

How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break

How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break

Where It Started

Phase 1: Automate the Build (GitHub Actions)

Phase 2: Add Security Gates Before Automating Deployments

Phase 3: Move to ECS Fargate

The Networking Trap That Got Me

Phase 4: Blue/Green Deployments with CodeDeploy

Observability: Knowing It's Actually Working

Key Learnings

What I'd Do Differently

Resources & Next Steps

I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up

I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up

Why I Built This

What I Built

The Terraform Structure

The Problem That Stopped Me Cold

The Other Gotcha: Docker Hub from a Private Subnet

Security Groups That Reference Each Other

The CI/CD Pipeline

What Actually Running Looks Like

Key Learnings

Lessons for Other Learners

Resources & Next Steps

Gotcha #5: Amplify Monorepo Needs `appRoot`