<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prince Ayiku</title>
    <description>The latest articles on DEV Community by Prince Ayiku (@prince_ayiku_166).</description>
    <link>https://dev.to/prince_ayiku_166</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2082058%2F43008bfa-e14b-4c2f-b205-ad6e8ca2496b.png</url>
      <title>DEV Community: Prince Ayiku</title>
      <link>https://dev.to/prince_ayiku_166</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prince_ayiku_166"/>
    <language>en</language>
    <item>
      <title>I Built a Fully Serverless Task Manager on AWS — Here's What the Docs Don't Tell You</title>
      <dc:creator>Prince Ayiku</dc:creator>
      <pubDate>Mon, 20 Apr 2026 08:54:47 +0000</pubDate>
      <link>https://dev.to/prince_ayiku_166/i-built-a-fully-serverless-task-manager-on-aws-heres-what-the-docs-dont-tell-you-37c8</link>
      <guid>https://dev.to/prince_ayiku_166/i-built-a-fully-serverless-task-manager-on-aws-heres-what-the-docs-dont-tell-you-37c8</guid>
      <description>&lt;p&gt;I spent weeks building a fully serverless task management system on AWS — Lambda, DynamoDB, Cognito, SNS, SES, Amplify, the whole stack — provisioned entirely with Terraform and wired into a GitHub Actions CI/CD pipeline.&lt;/p&gt;

&lt;p&gt;Here's what I learned. Not the happy path. The real stuff.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Repo: &lt;a href="https://github.com/celetrialprince166/Serverless-task-management-app" rel="noopener noreferrer"&gt;github.com/celetrialprince166/Serverless-task-management-app&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A role-based task management app where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Admins&lt;/strong&gt; can create, assign, update, and delete tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Members&lt;/strong&gt; can view their assigned tasks and update status&lt;/li&gt;
&lt;li&gt;Email notifications fire automatically when tasks are assigned or status changes&lt;/li&gt;
&lt;li&gt;Everything runs serverless on AWS — zero servers to manage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The stack:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;React 19 + Vite + Tailwind → AWS Amplify&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;API Gateway REST + Cognito JWT auth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute&lt;/td&gt;
&lt;td&gt;15+ Lambda functions, Node.js 20, TypeScript&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;DynamoDB (single-table design)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notifications&lt;/td&gt;
&lt;td&gt;DynamoDB Streams → SNS → SES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IaC&lt;/td&gt;
&lt;td&gt;Terraform (modular)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD&lt;/td&gt;
&lt;td&gt;GitHub Actions (Checkov + npm audit + terraform validate)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9wuum9q19qb37olk3vv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9wuum9q19qb37olk3vv.png" alt="Architecture diagram" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotcha #1: DynamoDB Single-Table Design Is Not Optional
&lt;/h2&gt;

&lt;p&gt;I started with multiple DynamoDB tables — one for tasks, one for users, one for assignments. Classic relational thinking.&lt;/p&gt;

&lt;p&gt;The problem: DynamoDB has no JOIN. To get a task with its assignees, I needed three separate &lt;code&gt;GetItem&lt;/code&gt; calls. Three round-trips. Three places for something to fail.&lt;/p&gt;

&lt;p&gt;The fix: &lt;strong&gt;single-table design&lt;/strong&gt; using composite primary keys.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Task item&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;PK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TASK#01HXYZ&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;SK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;METADATA&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;IN_PROGRESS&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Assignment — same table, different SK prefix&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;PK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TASK#01HXYZ&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;SK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ASSIGN#USER#01HJKL&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;query(PK = "TASK#01HXYZ")&lt;/code&gt; returns the task AND all its assignments in one call. The &lt;code&gt;begins_with("ASSIGN#")&lt;/code&gt; filter separates them client-side.&lt;/p&gt;

&lt;p&gt;The trade-off: you must define ALL access patterns before building the schema. Change your access patterns later and you're adding GSIs or doing table scans.&lt;/p&gt;

&lt;p&gt;I used &lt;code&gt;PAY_PER_REQUEST&lt;/code&gt; billing — scales to zero when nothing's happening, which is perfect for a project portfolio app.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotcha #2: The Cognito Post-Confirmation Trigger Fires More Than Once
&lt;/h2&gt;

&lt;p&gt;I added a post-confirmation Lambda trigger to create the user record in DynamoDB after signup.&lt;/p&gt;

&lt;p&gt;What I didn't know: &lt;strong&gt;this trigger fires on sign-in events too&lt;/strong&gt;, not just the initial signup confirmation. Specifically, when email verification or MFA is involved, the trigger re-fires on subsequent authentications.&lt;/p&gt;

&lt;p&gt;Without an idempotency guard, every sign-in overwrites the user's DynamoDB record — silently resetting any role I'd manually set to &lt;code&gt;ADMIN&lt;/code&gt; back to &lt;code&gt;MEMBER&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Fix: one line of Terraform logic in the &lt;code&gt;PutItemCommand&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;ConditionExpression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;attribute_not_exists(PK)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;// Only writes if the item doesn't exist yet — idempotent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the first sign-in creates the profile. Every subsequent one does nothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotcha #3: CORS Is a Three-Layer Problem in API Gateway
&lt;/h2&gt;

&lt;p&gt;Browser &lt;code&gt;CORS error&lt;/code&gt;. Classic. Except in serverless, there are three places to fix it and you need all three:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1&lt;/strong&gt; — API Gateway: add OPTIONS method to every resource and configure Gateway Responses for 4xx/5xx&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2&lt;/strong&gt; — Lambda: every handler response must include CORS headers&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Access-Control-Allow-Origin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ALLOWED_ORIGIN&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Access-Control-Allow-Headers&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Content-Type,Authorization&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 3&lt;/strong&gt; — The one nobody mentions: &lt;strong&gt;Gateway Responses for auth errors&lt;/strong&gt;. When Cognito rejects a JWT, API Gateway returns a 401 — but that error comes from the authorizer, not from Lambda. So your Lambda CORS headers don't run. You get a CORS error that's actually a 401.&lt;/p&gt;

&lt;p&gt;Fix it in Terraform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_api_gateway_gateway_response"&lt;/span&gt; &lt;span class="s2"&gt;"cors_4xx"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;rest_api_id&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rest_api_id&lt;/span&gt;
  &lt;span class="nx"&gt;response_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DEFAULT_4XX"&lt;/span&gt;
  &lt;span class="nx"&gt;response_parameters&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"gatewayresponse.header.Access-Control-Allow-Origin"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"'*'"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Gotcha #4: GitHub Actions OIDC Has a Silent Permission Requirement
&lt;/h2&gt;

&lt;p&gt;I switched from static AWS access keys to OIDC federation for the CI pipeline. The &lt;code&gt;configure-aws-credentials&lt;/code&gt; action just said: "Credentials could not be loaded."&lt;/p&gt;

&lt;p&gt;No mention of permissions. No useful error.&lt;/p&gt;

&lt;p&gt;The fix is one block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;   &lt;span class="c1"&gt;# This is required. Without it, OIDC JWT request fails silently.&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub's documentation mentions this, but it's buried. The action's error message gives you zero hint that this is the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotcha #5: Amplify Monorepo Needs &lt;code&gt;appRoot&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;My repo has both &lt;code&gt;backend/&lt;/code&gt; and &lt;code&gt;frontend/&lt;/code&gt;. Amplify's default build config looks for &lt;code&gt;package.json&lt;/code&gt; in the repository root. It doesn't find one. Build fails with a vague error.&lt;/p&gt;

&lt;p&gt;Fix in &lt;code&gt;amplify.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;applications&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;frontend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;phases&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;preBuild&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;npm ci&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;npm run build&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;artifacts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseDirectory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dist&lt;/span&gt;
        &lt;span class="na"&gt;files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;appRoot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;   &lt;span class="c1"&gt;# ← This is the important line&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without &lt;code&gt;appRoot&lt;/code&gt;, Amplify tries to build from the repo root and fails every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotcha #6: Notifications Must Be Decoupled — Or You'll Regret It
&lt;/h2&gt;

&lt;p&gt;My first instinct was to have the task Lambda call SES directly after writing to DynamoDB. Simple. Direct.&lt;/p&gt;

&lt;p&gt;The problem: SES latency adds to every task write response time. If SES is down, task writes fail. If I want to add Slack notifications later, I edit the task Lambda.&lt;/p&gt;

&lt;p&gt;The right pattern is DynamoDB Streams → SNS → email formatter Lambda:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task write Lambda  →  DynamoDB  →  Stream  →  SNS  →  Email Lambda  →  SES
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task writes are fast — no SES dependency&lt;/li&gt;
&lt;li&gt;SES failures don't break task creation&lt;/li&gt;
&lt;li&gt;Adding Slack = adding one SNS subscriber. Zero changes to existing code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The stream processor detects events by checking the item's SK prefix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// New assignment? SK starts with "ASSIGN#"&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;newImage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SK&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ASSIGN#&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eventName&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;INSERT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PublishCommand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TASK_ASSIGNED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Status changed? SK is "METADATA" and status field differs&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;newImage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SK&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;METADATA&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;oldImage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;newImage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PublishCommand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;STATUS_CHANGED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use provisioned concurrency for critical Lambda functions.&lt;/strong&gt; Cold starts on the first request hit up to 800ms on my heavier handlers. For a real production app, provisioned concurrency keeps warm instances ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add a WebSocket API for real-time task updates.&lt;/strong&gt; Right now the frontend polls. API Gateway WebSockets would let me push status changes to connected clients instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design GSIs more carefully upfront.&lt;/strong&gt; I added a second GSI midway through because I hadn't thought through the "list tasks assigned to user" access pattern. You can't backfill existing items to a new GSI's index — only new writes get indexed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single-table DynamoDB requires upfront access pattern design&lt;/strong&gt; — change your mind later and you're adding indexes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cognito post-confirmation triggers are NOT idempotent by default&lt;/strong&gt; — guard your DynamoDB writes with &lt;code&gt;attribute_not_exists&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CORS in API Gateway has three independent configuration points&lt;/strong&gt; — miss the Gateway Responses and you'll see CORS errors that are actually 401s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions OIDC requires &lt;code&gt;id-token: write&lt;/code&gt;&lt;/strong&gt; — the error message won't tell you this&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amplify monorepo requires &lt;code&gt;appRoot&lt;/code&gt;&lt;/strong&gt; — without it, every Amplify build fails silently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decouple notifications from writes&lt;/strong&gt; — Streams → SNS → Lambda is the right pattern, always&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Full technical deep-dive with complete Terraform code and all 15+ Lambda handlers: &lt;a href="https://princeayiku.hashnode.dev/aws-serverless-task-manager-lambda-dynamodb-cognito" rel="noopener noreferrer"&gt;My Hashnode blog&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What does your serverless notification architecture look like? Still doing direct Lambda → SES, or have you moved to a queue/stream pattern? Drop it in the comments. 👇&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>serverless</category>
      <category>terraform</category>
    </item>
    <item>
      <title>I Built a Self-Healing Observability Stack on AWS ECS — Here Are the Bugs That Nearly Broke Me</title>
      <dc:creator>Prince Ayiku</dc:creator>
      <pubDate>Mon, 13 Apr 2026 08:44:34 +0000</pubDate>
      <link>https://dev.to/prince_ayiku_166/i-built-a-self-healing-observability-stack-on-aws-ecs-here-are-the-bugs-that-nearly-broke-me-pi1</link>
      <guid>https://dev.to/prince_ayiku_166/i-built-a-self-healing-observability-stack-on-aws-ecs-here-are-the-bugs-that-nearly-broke-me-pi1</guid>
      <description>&lt;h1&gt;
  
  
  I Built a Self-Healing Observability Stack — Here Are the Bugs That Nearly Broke Me
&lt;/h1&gt;

&lt;p&gt;My blue/green deployment rolled back successfully.&lt;/p&gt;

&lt;p&gt;I had no idea why.&lt;/p&gt;

&lt;p&gt;The CloudWatch alarm fired. CodeDeploy reverted. The Slack alert said "5xx spike." But which service? Which endpoint? Which specific request triggered the cascade? All I had was a timestamp and an alarm name. The system worked exactly as designed — and I couldn't explain what it had just protected me from.&lt;/p&gt;

&lt;p&gt;That's when I started this project.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Was Actually Missing
&lt;/h2&gt;

&lt;p&gt;I'd built a solid GitOps pipeline at this point: Jenkins security gates, ECS Fargate, blue/green deployments with automatic rollback. The deployment mechanics were production-grade. The observability layer was... three CloudWatch log groups and a feeling.&lt;/p&gt;

&lt;p&gt;The stack I built to close that gap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt; auto-instrumentation on the NestJS backend — every HTTP request generates a trace with spans across every service hop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jaeger&lt;/strong&gt; as the trace backend (receiving via OTLP HTTP on port 4318)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pino structured logging&lt;/strong&gt; with &lt;code&gt;trace_id&lt;/code&gt; and &lt;code&gt;span_id&lt;/code&gt; injected into every log line — so a CloudWatch log entry links directly to a Jaeger trace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus&lt;/strong&gt; scraping custom NestJS metrics (request rate, latency histograms, error counters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt; dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alertmanager → Slack&lt;/strong&gt; for alert routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda auto-remediation&lt;/strong&gt; — a function that detects high error rates via CloudWatch alarm and autonomously stops unhealthy ECS tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjown1xqalbmr0ui70lzl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjown1xqalbmr0ui70lzl.png" alt="Advanced Observability Stack Architecture" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal: when something breaks, I can go from Slack alert → log line → trace → root cause in one click. And if the error rate spikes, the system handles it before I even see the alert.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug #1: OpenTelemetry Was Running But Not Working
&lt;/h2&gt;

&lt;p&gt;This was the first thing I got completely wrong.&lt;/p&gt;

&lt;p&gt;I installed &lt;code&gt;@opentelemetry/auto-instrumentations-node&lt;/code&gt;, wired up the OTLP exporter, pointed it at Jaeger, and ran the app. Zero traces in Jaeger. No error. No warning. Just nothing.&lt;/p&gt;

&lt;p&gt;I spent a long time confirming things that weren't the problem: Jaeger was reachable, the exporter config was correct, the SDK was initialising without throwing. Everything looked fine. Nothing was traced.&lt;/p&gt;

&lt;p&gt;The problem was import order.&lt;/p&gt;

&lt;p&gt;Node.js auto-instrumentation works by monkey-patching built-in modules (&lt;code&gt;http&lt;/code&gt;, &lt;code&gt;https&lt;/code&gt;, &lt;code&gt;net&lt;/code&gt;) at process startup. The patches need to be applied &lt;strong&gt;before&lt;/strong&gt; any other module loads. If NestJS (or Express, or anything) bootstraps first, those modules are already in memory — the patches never apply. The app runs normally but generates no spans.&lt;/p&gt;

&lt;p&gt;The fix is one constraint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// main.ts — THIS ORDER IS MANDATORY&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./tracing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// Must be FIRST — patches Node.js internals&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;NestFactory&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@nestjs/core&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;bootstrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;NestFactory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;AppModule&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3001&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;bootstrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the &lt;code&gt;tracing.ts&lt;/code&gt; initialisation itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;NodeSDK&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/sdk-node&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OTLPTraceExporter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/exporter-trace-otlp-http&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getNodeAutoInstrumentations&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/auto-instrumentations-node&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;NodeSDK&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;traceExporter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OTLPTraceExporter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:4318/v1/traces&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// localhost because ECS awsvpc&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;instrumentations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;getNodeAutoInstrumentations&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  &lt;span class="c1"&gt;// Synchronous — must complete before bootstrap() runs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the URL: &lt;code&gt;localhost:4318&lt;/code&gt;, not &lt;code&gt;jaeger:4318&lt;/code&gt;. ECS Fargate's &lt;code&gt;awsvpc&lt;/code&gt; network mode puts all containers in the same task into a shared network namespace. Same-task containers talk on &lt;code&gt;localhost&lt;/code&gt;. Docker Compose service names don't resolve here.&lt;/p&gt;

&lt;p&gt;After fixing the import order, traces started flowing immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Correlating Logs to Traces
&lt;/h2&gt;

&lt;p&gt;Having traces is useful. Having traces you can find from a log line is the actual goal.&lt;/p&gt;

&lt;p&gt;The Pino logger needed a &lt;code&gt;mixin&lt;/code&gt; function that reads the active OpenTelemetry span and injects its IDs into every log entry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/api&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pino&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nf"&gt;mixin&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getActiveSpan&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spanContext&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;span_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;spanId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;formatters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;level&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;label&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every log line looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"msg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Database connection refused"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4bf92f3577b34da6a3ce929d0e0e4736"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"span_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"00f067aa0ba902b7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-01-15T14:23:01.234Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a Slack alert fires, I can grep CloudWatch Logs for the &lt;code&gt;trace_id&lt;/code&gt;, then paste it directly into Jaeger's search. One click. The full trace — every service, every database query, every millisecond — is right there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5j3z7wiaqtuel1hjohwd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5j3z7wiaqtuel1hjohwd.png" alt="Grafana Dashboard" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug #2: Lambda Auto-Remediation Broke the Deployment Controller
&lt;/h2&gt;

&lt;p&gt;This one was more subtle.&lt;/p&gt;

&lt;p&gt;The Lambda function's job: CloudWatch alarm fires (high 5xx rate) → Lambda detects it → Lambda restarts the unhealthy ECS service.&lt;/p&gt;

&lt;p&gt;My first implementation used &lt;code&gt;UpdateService&lt;/code&gt; with &lt;code&gt;forceNewDeployment: true&lt;/code&gt;. That's the standard approach for restarting an ECS service. It should have worked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ecs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_service&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cluster_arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;forceNewDeployment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# This fails silently or throws
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It threw:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;InvalidParameterException: Unable to update the service because
a deployment is already in progress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason: when an ECS service uses &lt;code&gt;deployment_controller { type = "CODE_DEPLOY" }&lt;/code&gt;, AWS hands deployment control entirely to CodeDeploy. &lt;code&gt;UpdateService --forceNewDeployment&lt;/code&gt; is incompatible with an active CodeDeploy-controlled service. The two systems conflict.&lt;/p&gt;

&lt;p&gt;The correct approach is &lt;code&gt;ecs:StopTask&lt;/code&gt; — stop the specific unhealthy task directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remediate_unhealthy_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster_arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service_arn&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# List running tasks for the service
&lt;/span&gt;    &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ecs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_tasks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cluster_arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;serviceName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;service_arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;desiredStatus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RUNNING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;taskArns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="c1"&gt;# Stop the first running task
&lt;/span&gt;    &lt;span class="n"&gt;ecs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cluster_arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Auto-remediation: high error rate detected via CloudWatch alarm&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a task stops, ECS detects the task count is below desired and launches a replacement. The service recovers. CodeDeploy is never touched. No deployment state corruption.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug #3: The Idempotency Problem
&lt;/h2&gt;

&lt;p&gt;Lambda triggered three times for the same alarm window. Three concurrent invocations. Three tasks stopped simultaneously. The service dropped to zero running tasks and couldn't recover fast enough to pass health checks.&lt;/p&gt;

&lt;p&gt;The fix: check your own logs before acting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_recent_remediation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_stream_prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return True if auto-remediation ran successfully in the last N minutes.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;window_minutes&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;streams&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_log_streams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;logGroupName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;log_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;logStreamNamePrefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;log_stream_prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;orderBy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LastEventTime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;descending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
    &lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;logStreams&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_log_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;logGroupName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;log_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;logStreamName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;logStreamName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;startTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cutoff&lt;/span&gt;
        &lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Auto-remediation successful&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;check_recent_remediation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOG_GROUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LOG_STREAM_PREFIX&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Recent remediation found — skipping to avoid thrash&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;skipped&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;idempotency_guard&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nf"&gt;remediate_unhealthy_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CLUSTER_ARN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SERVICE_ARN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Auto-remediation successful&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One alarm. One Lambda invocation that acts. All subsequent invocations within 10 minutes exit early. The service gets one clean restart instead of a cascade.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2t5j3yl412r55p0xdmk3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2t5j3yl412r55p0xdmk3.png" alt="Jenkins Pipeline Flow" width="800" height="154"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Pipeline: 11 Stages
&lt;/h2&gt;

&lt;p&gt;The Jenkins pipeline that drives all of this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Secret Scan&lt;/strong&gt; (Gitleaks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type Check + Lint&lt;/strong&gt; (TypeScript + ESLint)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Audit&lt;/strong&gt; (npm audit)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Quality&lt;/strong&gt; (SonarCloud)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build Images&lt;/strong&gt; (Docker, tagged with git SHA)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Scan&lt;/strong&gt; (Trivy CVE detection)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SBOM Generation&lt;/strong&gt; (Syft — CycloneDX + SPDX)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IaC Scan&lt;/strong&gt; (Checkov on Terraform)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ECR Push&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task Definition Registration&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blue/Green Deployment&lt;/strong&gt; (CodeDeploy, 10% traffic per minute)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Security gates first. Deployment last. The same principle from the GitOps project applies here.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Looks Like When It Works
&lt;/h2&gt;

&lt;p&gt;A request comes in to the NestJS backend:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenTelemetry generates a trace ID and creates a root span&lt;/li&gt;
&lt;li&gt;Each downstream call (database query, external HTTP) gets a child span&lt;/li&gt;
&lt;li&gt;Pino injects the trace ID into every log line during that request's lifecycle&lt;/li&gt;
&lt;li&gt;Prometheus records the request duration in a histogram&lt;/li&gt;
&lt;li&gt;If the response is 5xx: Alertmanager routes to Slack with the alarm context&lt;/li&gt;
&lt;li&gt;In Slack: I see the alert, click the CloudWatch link, grep for the trace ID, open Jaeger, see the full call graph in under 60 seconds&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And if the error rate crosses the threshold:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CloudWatch alarm fires&lt;/li&gt;
&lt;li&gt;Lambda checks for recent remediation (idempotency guard)&lt;/li&gt;
&lt;li&gt;Lambda stops the unhealthy task&lt;/li&gt;
&lt;li&gt;ECS replaces it with a fresh task&lt;/li&gt;
&lt;li&gt;Error rate drops&lt;/li&gt;
&lt;li&gt;Alarm clears&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No manual intervention. No 3am pages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0tlku4ht6z5rxqg2vnk5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0tlku4ht6z5rxqg2vnk5.png" alt="Slack Alerts" width="759" height="402"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OTel import order is a hard constraint, not a preference.&lt;/strong&gt; The SDK must patch Node.js internals before any framework loads. One wrong line breaks the entire tracing setup with no error message to guide you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;ecs:StopTask&lt;/code&gt; is the correct remediation call when using CODE_DEPLOY.&lt;/strong&gt; &lt;code&gt;forceNewDeployment&lt;/code&gt; conflicts with the CodeDeploy controller. Stop the task — ECS handles the replacement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotency in Lambda isn't optional when CloudWatch alarms are your trigger.&lt;/strong&gt; Alarms fire multiple times. Your remediation function needs to know when it already ran.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace ID correlation turns three separate signals into one investigation.&lt;/strong&gt; Logs, traces, and metrics are each useful in isolation. Together, with the trace ID as the link, they tell the complete story of a request.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/celetrialprince166/Advanced_monitoring" rel="noopener noreferrer"&gt;Full repository — github.com/celetrialprince166/Advanced_monitoring&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/languages/js/automatic/" rel="noopener noreferrer"&gt;OpenTelemetry Node.js Auto-Instrumentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.jaegertracing.io/docs/latest/apis/" rel="noopener noreferrer"&gt;Jaeger OTLP Ingestion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-steps-ecs.html" rel="noopener noreferrer"&gt;AWS CodeDeploy ECS Blue/Green&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's the most useful observability signal you've added to a production system? Drop it below — I'm building a list of what actually helps vs. what just adds noise. 👇&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
    <item>
      <title>Building a GitOps Pipeline on AWS ECS: From Manual SSH to Zero-Downtime Blue/Green Deployments</title>
      <dc:creator>Prince Ayiku</dc:creator>
      <pubDate>Tue, 07 Apr 2026 08:30:30 +0000</pubDate>
      <link>https://dev.to/prince_ayiku_166/building-a-gitops-pipeline-on-aws-ecs-from-manual-ssh-to-zero-downtime-bluegreen-deployments-3dlo</link>
      <guid>https://dev.to/prince_ayiku_166/building-a-gitops-pipeline-on-aws-ecs-from-manual-ssh-to-zero-downtime-bluegreen-deployments-3dlo</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break
&lt;/h1&gt;

&lt;p&gt;I used to deploy by SSHing into a server, pulling new code, restarting Docker Compose, and hoping.&lt;/p&gt;

&lt;p&gt;That worked until the day I pushed a bug to production on a Friday afternoon and spent the weekend manually rolling it back.&lt;/p&gt;

&lt;p&gt;This is the story of rebuilding that entire workflow — from "SSH and pray" to a system where a git push triggers security scans, builds container images, shifts traffic 10% at a time, and automatically reverts if anything looks wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where It Started
&lt;/h2&gt;

&lt;p&gt;The app is a full-stack notes manager: Next.js frontend, NestJS backend, PostgreSQL, with Nginx as the reverse proxy. Four containers. Nothing exotic.&lt;/p&gt;

&lt;p&gt;The original deployment process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh ubuntu@my-server-ip
&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/notes-app
git pull
docker-compose down &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;span class="c"&gt;# Go get coffee. Hope it comes back up.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is fine when you have one server and one developer. It breaks down the moment you want to deploy without downtime, roll back a bad release, or prove to a future employer that you know what you're doing.&lt;/p&gt;

&lt;p&gt;So I documented the rebuild as four distinct phases — not because I planned it that way, but because each phase solved a specific pain I'd already felt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1: Automate the Build (GitHub Actions)
&lt;/h2&gt;

&lt;p&gt;First step was getting the build out of my hands entirely. A GitHub Actions workflow that fires on every push to &lt;code&gt;main&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CI/CD Pipeline&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build-and-push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Login to Amazon ECR&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/amazon-ecr-login@v2&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and push backend&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/build-push-action@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./backend&lt;/span&gt;
          &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.ECR_REGISTRY }}/notes-backend:${{ github.sha }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three images (backend, frontend, proxy), each tagged with the git commit SHA. No more &lt;code&gt;latest&lt;/code&gt; tags overwriting each other. Every commit gets its own immutable image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl0lql1qf3g1vnty1fqci.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl0lql1qf3g1vnty1fqci.png" alt="Architecture Diagram" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Add Security Gates Before Automating Deployments
&lt;/h2&gt;

&lt;p&gt;Here's where I made a deliberate choice most tutorials skip: I added security scanning &lt;em&gt;before&lt;/em&gt; I automated the actual deployment.&lt;/p&gt;

&lt;p&gt;The logic: if you automate deployment of insecure code, you've just made insecurity faster.&lt;/p&gt;

&lt;p&gt;The Jenkins pipeline I built runs 7 gates in sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Gitleaks&lt;/strong&gt; — scans the entire git history for hardcoded credentials, API keys, tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TypeScript + ESLint&lt;/strong&gt; — type errors and code style issues caught at build time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm audit&lt;/strong&gt; — dependency vulnerability scan&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SonarCloud&lt;/strong&gt; — code quality gates (complexity, duplication, security rules)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker build&lt;/strong&gt; — images built and tagged with git SHA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trivy&lt;/strong&gt; — scans each container image for CVEs (HIGH and CRITICAL flagged)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Syft&lt;/strong&gt; — generates a Software Bill of Materials (CycloneDX + SPDX JSON)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In lab mode, these gates report but don't block. In production, you'd set &lt;code&gt;exit-code: 1&lt;/code&gt; on Trivy and SonarCloud to make them hard gates. The point was to build the habit of having the gates, not to enforce them from day one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3: Move to ECS Fargate
&lt;/h2&gt;

&lt;p&gt;Running Docker Compose on EC2 is fine until you need the EC2 instance to scale, fail over, or restart containers automatically. ECS Fargate solves all three: serverless containers, AWS manages the underlying compute, you define the task and it runs.&lt;/p&gt;

&lt;p&gt;The Terraform configuration provisions the entire stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_cluster"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.project_name}-cluster"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.project_name}-service"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;task_definition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_task_definition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;launch_type&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;

  &lt;span class="nx"&gt;deployment_controller&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CODE_DEPLOY"&lt;/span&gt;  &lt;span class="c1"&gt;# Hands deployment control to CodeDeploy&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;lifecycle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;ignore_changes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;task_definition&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# CI/CD owns this, not Terraform&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;ignore_changes = [task_definition]&lt;/code&gt; line matters more than it looks. Terraform manages the infrastructure. Jenkins manages which task definition revision is deployed. Without it, every &lt;code&gt;terraform apply&lt;/code&gt; would roll back to whatever task definition Terraform last knew about — overwriting the version Jenkins just pushed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbjplpe4pazdp2ccvdkun.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbjplpe4pazdp2ccvdkun.png" alt="ECS Fargate Architecture" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Networking Trap That Got Me
&lt;/h2&gt;

&lt;p&gt;Before I talk about Phase 4, there's a specific failure I need to document because it will get you too.&lt;/p&gt;

&lt;p&gt;My backend couldn't connect to the database. &lt;code&gt;ECONNREFUSED database:5432&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In Docker Compose, services reach each other by their service name. &lt;code&gt;database&lt;/code&gt; resolves because Docker creates a shared bridge network with DNS for each service name.&lt;/p&gt;

&lt;p&gt;ECS Fargate uses &lt;code&gt;awsvpc&lt;/code&gt; network mode. All containers in the same task share a single network namespace — effectively the same &lt;code&gt;localhost&lt;/code&gt;. There's no inter-container DNS. The hostname &lt;code&gt;database&lt;/code&gt; doesn't resolve to anything.&lt;/p&gt;

&lt;p&gt;The fix is one word:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Docker Compose — works locally&lt;/span&gt;
&lt;span class="nv"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql://user:pass@database:5432/db

&lt;span class="c"&gt;# ECS Fargate — same task = same localhost&lt;/span&gt;
&lt;span class="nv"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql://user:pass@localhost:5432/db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't in the getting-started guide. It's buried in the ECS networking docs. It will silently break every multi-container Fargate deployment that was originally written for Docker Compose.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4: Blue/Green Deployments with CodeDeploy
&lt;/h2&gt;

&lt;p&gt;This is the part that makes the system production-grade.&lt;/p&gt;

&lt;p&gt;When a new version is deployed, CodeDeploy spins up new ECS tasks (Green) alongside the existing ones (Blue). Traffic shifts 10% per minute from Blue to Green. If a CloudWatch alarm fires during the shift — 5xx error rate, unhealthy targets — traffic instantly reverts to 100% Blue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T+0s    Blue: 100%    Green: starting     ← deploy begins
T+60s   Blue:  90%    Green: 10%          ← 10% shifted
T+120s  Blue:  80%    Green: 20%          ← steady if healthy
...
T+600s  Blue:   0%    Green: 100%         ← complete
T+900s  Blue tasks terminated             ← cleanup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the CloudWatch alarm fires at any point: traffic snaps back to 100% Blue instantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwor7dj1yo1lweoi0wff0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwor7dj1yo1lweoi0wff0.png" alt="CodeDeploy Blue/Green Architecture" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Jenkins pipeline orchestrates this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight groovy"&gt;&lt;code&gt;&lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Deploy to ECS'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="s1"&gt;'''
          # Render task definition with current image tags
          ./ecs/render-task-def.sh \
            --image-tag ${GIT_COMMIT:0:7} \
            --region eu-west-1

          # Register new task definition revision
          TASK_DEF_ARN=$(aws ecs register-task-definition \
            --cli-input-json file://ecs/task-definition-rendered.json \
            --query taskDefinition.taskDefinitionArn \
            --output text)

          # Trigger CodeDeploy blue/green
          aws deploy create-deployment \
            --cli-input-json file://ecs/codedeploy-input.json
        '''&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5j9kvm0eb6u8m87ax2xd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5j9kvm0eb6u8m87ax2xd.png" alt="Jenkins Pipeline Flow" width="800" height="118"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability: Knowing It's Actually Working
&lt;/h2&gt;

&lt;p&gt;Deploying successfully and &lt;em&gt;knowing&lt;/em&gt; it's working are different things.&lt;/p&gt;

&lt;p&gt;The observability stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus&lt;/strong&gt; scraping NestJS &lt;code&gt;/metrics&lt;/code&gt; endpoint every 15 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt; dashboards for request rate, latency, error rate, container health&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alertmanager&lt;/strong&gt; routing alert notifications to a Slack channel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch&lt;/strong&gt; for ECS logs with 30-day retention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb54p6zagremnhewy08hk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb54p6zagremnhewy08hk.png" alt="Grafana Dashboard" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjfgzvth86ab27n7jeyvk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjfgzvth86ab27n7jeyvk.png" alt="Slack Alerts" width="759" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Prometheus NestJS integration is worth noting — NestJS doesn't expose metrics by default. You need to instrument it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// metrics.module.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;PrometheusModule&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@willsoto/nestjs-prometheus&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;imports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;PrometheusModule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/metrics&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;defaultMetrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MetricsModule&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once that's running, Prometheus scrapes HTTP request counts, latency histograms, and error rates automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Security gates belong before automated deployment, not after.&lt;/strong&gt; The moment you automate deployment of untested, unscanned code, you've made your pipeline a liability. Build the gates first, then automate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Fargate &lt;code&gt;awsvpc&lt;/code&gt; mode changes inter-container communication fundamentally.&lt;/strong&gt; Same-task containers talk on &lt;code&gt;localhost&lt;/code&gt;. Cross-task communication needs service discovery or an internal load balancer. Know this before you hit it in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;ignore_changes = [task_definition]&lt;/code&gt; is required when Terraform and CI/CD share an ECS service.&lt;/strong&gt; Without it, Terraform and Jenkins will fight over task definition revisions on every apply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Blue/green is only as good as your alarms.&lt;/strong&gt; If your CloudWatch alarm isn't configured before the deployment starts, there's nothing to trigger the rollback. The alarm is the safety net — set it up before you need it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The AppSpec for CodeDeploy must be JSON-wrapped via CLI.&lt;/strong&gt; This is undocumented in the happy path. Use &lt;code&gt;jq&lt;/code&gt; to wrap the YAML content as an &lt;code&gt;AppSpecContent&lt;/code&gt; JSON object or the deployment will fail with an unhelpful error.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;If I started this project today, I'd add AWS Systems Manager Session Manager from the start instead of a Bastion Host. No SSH port exposed, no key rotation, full audit trail of every session — and it's cheaper than running a separate EC2 instance as a jump box.&lt;/p&gt;

&lt;p&gt;I'd also set the security gates to blocking mode from day one, not lab mode. The discipline of having a hard quality gate early shapes how you write code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/celetrialprince166/gitops_lab" rel="noopener noreferrer"&gt;Full repository — github.com/celetrialprince166/gitops_lab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking-awsvpc.html" rel="noopener noreferrer"&gt;ECS Fargate awsvpc networking docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-steps-ecs.html" rel="noopener noreferrer"&gt;CodeDeploy ECS Blue/Green deployment guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/meta-arguments/lifecycle" rel="noopener noreferrer"&gt;Terraform ECS lifecycle meta-argument&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next I'm building the advanced observability layer — distributed tracing with OpenTelemetry and Jaeger across the full service mesh. Follow along if that's useful.&lt;/p&gt;




&lt;p&gt;What's the most important thing your deployment pipeline is missing right now? Drop it in the comments — I'm building a list of what engineers actually care about vs. what tutorials focus on. 👇&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>gitops</category>
      <category>docker</category>
    </item>
    <item>
      <title>How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break</title>
      <dc:creator>Prince Ayiku</dc:creator>
      <pubDate>Tue, 07 Apr 2026 08:30:06 +0000</pubDate>
      <link>https://dev.to/prince_ayiku_166/how-i-built-a-gitops-pipeline-that-deploys-itself-and-rolls-back-when-things-break-5f8m</link>
      <guid>https://dev.to/prince_ayiku_166/how-i-built-a-gitops-pipeline-that-deploys-itself-and-rolls-back-when-things-break-5f8m</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a GitOps Pipeline That Deploys Itself — and Rolls Back When Things Break
&lt;/h1&gt;

&lt;p&gt;I used to deploy by SSHing into a server, pulling new code, restarting Docker Compose, and hoping.&lt;/p&gt;

&lt;p&gt;That worked until the day I pushed a bug to production on a Friday afternoon and spent the weekend manually rolling it back.&lt;/p&gt;

&lt;p&gt;This is the story of rebuilding that entire workflow — from "SSH and pray" to a system where a git push triggers security scans, builds container images, shifts traffic 10% at a time, and automatically reverts if anything looks wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where It Started
&lt;/h2&gt;

&lt;p&gt;The app is a full-stack notes manager: Next.js frontend, NestJS backend, PostgreSQL, with Nginx as the reverse proxy. Four containers. Nothing exotic.&lt;/p&gt;

&lt;p&gt;The original deployment process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh ubuntu@my-server-ip
&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/notes-app
git pull
docker-compose down &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;span class="c"&gt;# Go get coffee. Hope it comes back up.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is fine when you have one server and one developer. It breaks down the moment you want to deploy without downtime, roll back a bad release, or prove to a future employer that you know what you're doing.&lt;/p&gt;

&lt;p&gt;So I documented the rebuild as four distinct phases — not because I planned it that way, but because each phase solved a specific pain I'd already felt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1: Automate the Build (GitHub Actions)
&lt;/h2&gt;

&lt;p&gt;First step was getting the build out of my hands entirely. A GitHub Actions workflow that fires on every push to &lt;code&gt;main&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CI/CD Pipeline&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build-and-push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Login to Amazon ECR&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/amazon-ecr-login@v2&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and push backend&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/build-push-action@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./backend&lt;/span&gt;
          &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.ECR_REGISTRY }}/notes-backend:${{ github.sha }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three images (backend, frontend, proxy), each tagged with the git commit SHA. No more &lt;code&gt;latest&lt;/code&gt; tags overwriting each other. Every commit gets its own immutable image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Farch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Farch.png" alt="Architecture Diagram" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Add Security Gates Before Automating Deployments
&lt;/h2&gt;

&lt;p&gt;Here's where I made a deliberate choice most tutorials skip: I added security scanning &lt;em&gt;before&lt;/em&gt; I automated the actual deployment.&lt;/p&gt;

&lt;p&gt;The logic: if you automate deployment of insecure code, you've just made insecurity faster.&lt;/p&gt;

&lt;p&gt;The Jenkins pipeline I built runs 7 gates in sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Gitleaks&lt;/strong&gt; — scans the entire git history for hardcoded credentials, API keys, tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TypeScript + ESLint&lt;/strong&gt; — type errors and code style issues caught at build time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm audit&lt;/strong&gt; — dependency vulnerability scan&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SonarCloud&lt;/strong&gt; — code quality gates (complexity, duplication, security rules)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker build&lt;/strong&gt; — images built and tagged with git SHA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trivy&lt;/strong&gt; — scans each container image for CVEs (HIGH and CRITICAL flagged)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Syft&lt;/strong&gt; — generates a Software Bill of Materials (CycloneDX + SPDX JSON)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In lab mode, these gates report but don't block. In production, you'd set &lt;code&gt;exit-code: 1&lt;/code&gt; on Trivy and SonarCloud to make them hard gates. The point was to build the habit of having the gates, not to enforce them from day one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3: Move to ECS Fargate
&lt;/h2&gt;

&lt;p&gt;Running Docker Compose on EC2 is fine until you need the EC2 instance to scale, fail over, or restart containers automatically. ECS Fargate solves all three: serverless containers, AWS manages the underlying compute, you define the task and it runs.&lt;/p&gt;

&lt;p&gt;The Terraform configuration provisions the entire stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_cluster"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.project_name}-cluster"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.project_name}-service"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;task_definition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_task_definition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;launch_type&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;

  &lt;span class="nx"&gt;deployment_controller&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CODE_DEPLOY"&lt;/span&gt;  &lt;span class="c1"&gt;# Hands deployment control to CodeDeploy&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;lifecycle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;ignore_changes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;task_definition&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# CI/CD owns this, not Terraform&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;ignore_changes = [task_definition]&lt;/code&gt; line matters more than it looks. Terraform manages the infrastructure. Jenkins manages which task definition revision is deployed. Without it, every &lt;code&gt;terraform apply&lt;/code&gt; would roll back to whatever task definition Terraform last knew about — overwriting the version Jenkins just pushed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Ffargatearch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Ffargatearch.png" alt="ECS Fargate Architecture" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Networking Trap That Got Me
&lt;/h2&gt;

&lt;p&gt;Before I talk about Phase 4, there's a specific failure I need to document because it will get you too.&lt;/p&gt;

&lt;p&gt;My backend couldn't connect to the database. &lt;code&gt;ECONNREFUSED database:5432&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In Docker Compose, services reach each other by their service name. &lt;code&gt;database&lt;/code&gt; resolves because Docker creates a shared bridge network with DNS for each service name.&lt;/p&gt;

&lt;p&gt;ECS Fargate uses &lt;code&gt;awsvpc&lt;/code&gt; network mode. All containers in the same task share a single network namespace — effectively the same &lt;code&gt;localhost&lt;/code&gt;. There's no inter-container DNS. The hostname &lt;code&gt;database&lt;/code&gt; doesn't resolve to anything.&lt;/p&gt;

&lt;p&gt;The fix is one word:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Docker Compose — works locally&lt;/span&gt;
&lt;span class="nv"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql://user:pass@database:5432/db

&lt;span class="c"&gt;# ECS Fargate — same task = same localhost&lt;/span&gt;
&lt;span class="nv"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql://user:pass@localhost:5432/db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't in the getting-started guide. It's buried in the ECS networking docs. It will silently break every multi-container Fargate deployment that was originally written for Docker Compose.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4: Blue/Green Deployments with CodeDeploy
&lt;/h2&gt;

&lt;p&gt;This is the part that makes the system production-grade.&lt;/p&gt;

&lt;p&gt;When a new version is deployed, CodeDeploy spins up new ECS tasks (Green) alongside the existing ones (Blue). Traffic shifts 10% per minute from Blue to Green. If a CloudWatch alarm fires during the shift — 5xx error rate, unhealthy targets — traffic instantly reverts to 100% Blue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T+0s    Blue: 100%    Green: starting     ← deploy begins
T+60s   Blue:  90%    Green: 10%          ← 10% shifted
T+120s  Blue:  80%    Green: 20%          ← steady if healthy
...
T+600s  Blue:   0%    Green: 100%         ← complete
T+900s  Blue tasks terminated             ← cleanup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the CloudWatch alarm fires at any point: traffic snaps back to 100% Blue instantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Fcodedeployarch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Fcodedeployarch.png" alt="CodeDeploy Blue/Green Architecture" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Jenkins pipeline orchestrates this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight groovy"&gt;&lt;code&gt;&lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Deploy to ECS'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="s1"&gt;'''
          # Render task definition with current image tags
          ./ecs/render-task-def.sh \
            --image-tag ${GIT_COMMIT:0:7} \
            --region eu-west-1

          # Register new task definition revision
          TASK_DEF_ARN=$(aws ecs register-task-definition \
            --cli-input-json file://ecs/task-definition-rendered.json \
            --query taskDefinition.taskDefinitionArn \
            --output text)

          # Trigger CodeDeploy blue/green
          aws deploy create-deployment \
            --cli-input-json file://ecs/codedeploy-input.json
        '''&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Fjenkins_pipeline.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Fjenkins_pipeline.png" alt="Jenkins Pipeline Flow" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability: Knowing It's Actually Working
&lt;/h2&gt;

&lt;p&gt;Deploying successfully and &lt;em&gt;knowing&lt;/em&gt; it's working are different things.&lt;/p&gt;

&lt;p&gt;The observability stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus&lt;/strong&gt; scraping NestJS &lt;code&gt;/metrics&lt;/code&gt; endpoint every 15 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt; dashboards for request rate, latency, error rate, container health&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alertmanager&lt;/strong&gt; routing alert notifications to a Slack channel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch&lt;/strong&gt; for ECS logs with 30-day retention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Fgrafanadash.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Fgrafanadash.png" alt="Grafana Dashboard" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Fslackalertscreenshot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fceletrialprince166%2Fgitops_lab%2Fmain%2FMulti_Container_App%2Fimages%2Fslackalertscreenshot.png" alt="Slack Alerts" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Prometheus NestJS integration is worth noting — NestJS doesn't expose metrics by default. You need to instrument it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// metrics.module.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;PrometheusModule&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@willsoto/nestjs-prometheus&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;imports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;PrometheusModule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/metrics&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;defaultMetrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MetricsModule&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once that's running, Prometheus scrapes HTTP request counts, latency histograms, and error rates automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Security gates belong before automated deployment, not after.&lt;/strong&gt; The moment you automate deployment of untested, unscanned code, you've made your pipeline a liability. Build the gates first, then automate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Fargate &lt;code&gt;awsvpc&lt;/code&gt; mode changes inter-container communication fundamentally.&lt;/strong&gt; Same-task containers talk on &lt;code&gt;localhost&lt;/code&gt;. Cross-task communication needs service discovery or an internal load balancer. Know this before you hit it in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;ignore_changes = [task_definition]&lt;/code&gt; is required when Terraform and CI/CD share an ECS service.&lt;/strong&gt; Without it, Terraform and Jenkins will fight over task definition revisions on every apply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Blue/green is only as good as your alarms.&lt;/strong&gt; If your CloudWatch alarm isn't configured before the deployment starts, there's nothing to trigger the rollback. The alarm is the safety net — set it up before you need it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The AppSpec for CodeDeploy must be JSON-wrapped via CLI.&lt;/strong&gt; This is undocumented in the happy path. Use &lt;code&gt;jq&lt;/code&gt; to wrap the YAML content as an &lt;code&gt;AppSpecContent&lt;/code&gt; JSON object or the deployment will fail with an unhelpful error.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;If I started this project today, I'd add AWS Systems Manager Session Manager from the start instead of a Bastion Host. No SSH port exposed, no key rotation, full audit trail of every session — and it's cheaper than running a separate EC2 instance as a jump box.&lt;/p&gt;

&lt;p&gt;I'd also set the security gates to blocking mode from day one, not lab mode. The discipline of having a hard quality gate early shapes how you write code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/celetrialprince166/gitops_lab" rel="noopener noreferrer"&gt;Full repository — github.com/celetrialprince166/gitops_lab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking-awsvpc.html" rel="noopener noreferrer"&gt;ECS Fargate awsvpc networking docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-steps-ecs.html" rel="noopener noreferrer"&gt;CodeDeploy ECS Blue/Green deployment guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/meta-arguments/lifecycle" rel="noopener noreferrer"&gt;Terraform ECS lifecycle meta-argument&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next I'm building the advanced observability layer — distributed tracing with OpenTelemetry and Jaeger across the full service mesh. Follow along if that's useful.&lt;/p&gt;




&lt;p&gt;What's the most important thing your deployment pipeline is missing right now? Drop it in the comments — I'm building a list of what engineers actually care about vs. what tutorials focus on. 👇&lt;/p&gt;

</description>
      <category>devops</category>
      <category>gitops</category>
      <category>aws</category>
      <category>cicd</category>
    </item>
    <item>
      <title>I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up</title>
      <dc:creator>Prince Ayiku</dc:creator>
      <pubDate>Tue, 31 Mar 2026 18:11:06 +0000</pubDate>
      <link>https://dev.to/prince_ayiku_166/i-built-a-3-tier-aws-architecture-with-terraform-heres-what-actually-tripped-me-up-3d55</link>
      <guid>https://dev.to/prince_ayiku_166/i-built-a-3-tier-aws-architecture-with-terraform-heres-what-actually-tripped-me-up-3d55</guid>
      <description>&lt;h1&gt;
  
  
  I Built a 3-Tier AWS Architecture With Terraform — Here's What Actually Tripped Me Up
&lt;/h1&gt;

&lt;p&gt;I thought I understood Terraform. Then I tried to inject a database endpoint that didn't exist yet into a server that hadn't booted yet, and I stared at my screen for a solid hour.&lt;/p&gt;

&lt;p&gt;That moment taught me more about Infrastructure as Code than any tutorial had.&lt;/p&gt;

&lt;p&gt;This is the story of building a production-style 3-tier AWS architecture from scratch — what I built, what broke, and what I'd do differently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;I'm on a DevOps learning path at AmaliTech, and I'd been doing the usual things: tutorials, small scripts, single-instance deployments. But I kept noticing that production systems don't look like that. They have layers. They isolate things. The database is never directly reachable from the internet.&lt;/p&gt;

&lt;p&gt;So I decided to build one — not a toy, but an architecture that actually reflects how real workloads run. The application I chose to deploy on top of it was a Pharma AI assistant: Next.js frontend, Python FastAPI backend, Clerk authentication, Paystack payments, and Groq for the LLM layer.&lt;/p&gt;

&lt;p&gt;The goal wasn't to ship the app. It was to build the infrastructure correctly, document the decisions, and understand &lt;em&gt;why&lt;/em&gt; each piece exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;The architecture has three layers, each in its own network zone:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttebdlpf1zn6tc62b6zk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttebdlpf1zn6tc62b6zk.png" alt="3-Tier Architecture" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1 — Public (ALB + Bastion Host)&lt;/strong&gt;&lt;br&gt;
The Application Load Balancer receives traffic from the internet and forwards it to the app tier. The Bastion Host is the only way to SSH into anything — and it's locked down to my IP only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2 — Private Application (EC2 Auto Scaling Group)&lt;/strong&gt;&lt;br&gt;
EC2 instances running Docker containers. They live in private subnets — no public IPs, no direct internet exposure. The only traffic that reaches them comes through the ALB. They can reach the internet outbound through a NAT Gateway (for pulling Docker images), but nothing can reach them directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 3 — Private Database (RDS PostgreSQL)&lt;/strong&gt;&lt;br&gt;
The database sits in private subnets with no route table attached to an internet gateway. Not "protected by a security group." &lt;em&gt;Structurally unreachable&lt;/em&gt; from the internet.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Terraform Structure
&lt;/h2&gt;

&lt;p&gt;Everything is modular. Five modules: &lt;code&gt;networking&lt;/code&gt;, &lt;code&gt;security&lt;/code&gt;, &lt;code&gt;database&lt;/code&gt;, &lt;code&gt;alb&lt;/code&gt;, &lt;code&gt;compute&lt;/code&gt;. The root &lt;code&gt;main.tf&lt;/code&gt; just orchestrates them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;modules/
├── networking/   # VPC, subnets, IGW, NAT Gateway, route tables
├── security/     # 4 security groups (ALB, Bastion, App, DB)
├── database/     # RDS PostgreSQL in private DB subnets
├── alb/          # Application Load Balancer + target group + listener
└── compute/      # Launch template, Auto Scaling Group, Bastion Host
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each module exposes outputs that the next module depends on. Networking outputs flow into security, security flows into compute and database, everything flows into compute.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8aezjg2uki1bfqq8077.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8aezjg2uki1bfqq8077.png" alt="Terraform Apply Output" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem That Stopped Me Cold
&lt;/h2&gt;

&lt;p&gt;Here's where things got interesting.&lt;/p&gt;

&lt;p&gt;My EC2 instances boot by running a &lt;code&gt;user_data.sh&lt;/code&gt; script. That script pulls a Docker image from Docker Hub and runs it with environment variables — including the database connection string.&lt;/p&gt;

&lt;p&gt;The database connection string includes the RDS endpoint. Like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgresql://username:password@mydb.abc123xyz.us-east-1.rds.amazonaws.com:5432/myapp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem: that endpoint only exists &lt;em&gt;after&lt;/em&gt; Terraform creates the RDS instance. Which happens &lt;em&gt;before&lt;/em&gt; the EC2 instances boot. Which means I need to pass a value that doesn't exist at the start of &lt;code&gt;terraform apply&lt;/code&gt; into a script that runs at the end of it.&lt;/p&gt;

&lt;p&gt;I tried a few approaches that didn't work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hardcoding the endpoint&lt;/strong&gt; — completely defeats the purpose of IaC. Next deploy on a fresh account, it breaks immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Passing it as a plain string variable&lt;/strong&gt; — still needs the actual value upfront.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Running a second script after apply&lt;/strong&gt; — works once, but now you have manual steps that live outside your Terraform state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The solution was &lt;code&gt;templatefile()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In the compute module, launch template user_data:&lt;/span&gt;
&lt;span class="nx"&gt;user_data&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;base64encode&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;templatefile&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"${path.module}/scripts/user_data.sh"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;database_url&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"postgresql://${var.db_username}:${var.db_password}@${var.db_endpoint}/${var.db_name}?sslmode=require"&lt;/span&gt;
    &lt;span class="nx"&gt;direct_url&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"postgresql://${var.db_username}:${var.db_password}@${var.db_endpoint}/${var.db_name}?sslmode=require"&lt;/span&gt;
    &lt;span class="nx"&gt;docker_username&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dockerhub_username&lt;/span&gt;
    &lt;span class="nx"&gt;docker_password&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dockerhub_token&lt;/span&gt;
    &lt;span class="nx"&gt;clerk_secret_key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;clerk_secret_key&lt;/span&gt;
    &lt;span class="c1"&gt;# ... other vars&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terraform resolves module outputs in dependency order. By the time it runs the compute module, the database module has already completed — and its endpoint output is available. The &lt;code&gt;templatefile()&lt;/code&gt; function substitutes all the variables into the shell script before base64-encoding it. The EC2 instance boots with a fully rendered startup script that has the real database URL already baked in.&lt;/p&gt;

&lt;p&gt;No manual steps. No hardcoded values. Works on every fresh deploy.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Other Gotcha: Docker Hub from a Private Subnet
&lt;/h2&gt;

&lt;p&gt;After solving the database injection problem, I ran &lt;code&gt;terraform apply&lt;/code&gt; and watched everything provision cleanly. Then I checked the EC2 logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo cat&lt;/span&gt; /var/log/user-data.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Docker pull was failing. The instance couldn't reach Docker Hub.&lt;/p&gt;

&lt;p&gt;I had set up private subnets and a NAT Gateway, but I'd missed connecting them. The private route table didn't have a route for &lt;code&gt;0.0.0.0/0&lt;/code&gt; pointing at the NAT Gateway. Private subnets need that explicit route — they don't inherit it from anywhere.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route_table"&lt;/span&gt; &lt;span class="s2"&gt;"private"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;route&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_block&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
    &lt;span class="nx"&gt;nat_gateway_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_nat_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;  &lt;span class="c1"&gt;# This is what I was missing&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fix applied. Docker pull worked. App came up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security Groups That Reference Each Other
&lt;/h2&gt;

&lt;p&gt;One thing I'm actually proud of in this project is how the security groups are set up.&lt;/p&gt;

&lt;p&gt;Instead of allowing traffic from IP address ranges, each security group references another security group as the source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# App security group — only allows HTTP from the ALB security group&lt;/span&gt;
&lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;security_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Database security group — only allows PostgreSQL from the App security group&lt;/span&gt;
&lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;security_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why does this matter? Because EC2 instances in an Auto Scaling Group come and go. Their IP addresses change. If you allow traffic from &lt;code&gt;10.0.2.0/24&lt;/code&gt;, you need to keep that CIDR accurate forever. If you allow traffic from the App security group ID, any instance in that group is automatically covered — regardless of its IP.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI/CD Pipeline
&lt;/h2&gt;

&lt;p&gt;I set up a GitHub Actions workflow that triggers on every push to &lt;code&gt;main&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Checkout the code&lt;/li&gt;
&lt;li&gt;Log in to Docker Hub&lt;/li&gt;
&lt;li&gt;Build the Docker image from &lt;code&gt;./pharma_app/Dockerfile&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Push it with the &lt;code&gt;latest&lt;/code&gt; tag&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy5qb4nfpcg8opgbtdwq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy5qb4nfpcg8opgbtdwq.png" alt="CI/CD Pipeline" width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When new instances launch (via ASG), they pull &lt;code&gt;latest&lt;/code&gt; from Docker Hub automatically through the user_data script. So a code push → Docker Hub → next instance launch picks up the new image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk9l7ehpxfszz1ytzzj5q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk9l7ehpxfszz1ytzzj5q.png" alt="Successful Build" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Running Looks Like
&lt;/h2&gt;

&lt;p&gt;After &lt;code&gt;terraform apply&lt;/code&gt; completes, the output gives you the ALB DNS name. Hit that in a browser and the app loads — served through the load balancer, from a private EC2 instance, talking to a database that has no internet exposure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8sjst4te2p8wx5t193ps.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8sjst4te2p8wx5t193ps.png" alt="Application Running" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;templatefile()&lt;/code&gt; is how you inject dynamic values into user_data.&lt;/strong&gt; Terraform resolves the value after the dependency it comes from is complete. Use it — don't work around it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Private subnets are not automatically NAT'd.&lt;/strong&gt; You need to create the NAT Gateway, create a private route table, add a &lt;code&gt;0.0.0.0/0&lt;/code&gt; route pointing at the NAT, and associate that route table with your private subnets. All four steps. No shortcuts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Security groups that reference each other are more resilient than CIDR-based rules.&lt;/strong&gt; Especially in environments where instances scale dynamically.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add retry logic to user_data scripts.&lt;/strong&gt; Instance networking isn't always ready the second the script runs. Five retries with 15-second delays costs nothing and prevents a class of flaky failures.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RDS subnet groups require at least two AZs.&lt;/strong&gt; This is an AWS requirement, not a recommendation. Design your subnets for multi-AZ from the start.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Lessons for Other Learners
&lt;/h2&gt;

&lt;p&gt;If you're working through a similar architecture and something isn't connecting, nine times out of ten it's a routing issue or a security group that's too restrictive. Check your route tables before you start doubting your application code.&lt;/p&gt;

&lt;p&gt;And don't skip the modular structure because it feels like overhead. When something breaks, knowing that the networking module is isolated from the compute module makes debugging dramatically faster.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Full repo: &lt;a href="https://github.com/celetrialprince166/Terraform_3tierArch" rel="noopener noreferrer"&gt;github.com/celetrialprince166/Terraform_3tierArch&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/functions/templatefile" rel="noopener noreferrer"&gt;Terraform templatefile() docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/whitepapers/latest/serverless-multi-tier-architectures-api-gateway-lambda/three-tier-architecture-overview.html" rel="noopener noreferrer"&gt;AWS 3-Tier Architecture reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next: I'm building out a full CI/CD pipeline with Jenkins, GitHub Actions, SonarCloud code analysis, and Trivy image scanning. Follow along if that's useful.&lt;/p&gt;




&lt;p&gt;Have you built a 3-tier architecture before? What was the part that gave you the most trouble? Drop it in the comments — I'm genuinely curious whether the NAT Gateway thing trips everyone up or just me. 👇&lt;/p&gt;

</description>
      <category>devops</category>
      <category>terraform</category>
      <category>aws</category>
      <category>iac</category>
    </item>
  </channel>
</rss>
