<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Suleiman Abdulkadir</title>
    <description>The latest articles on DEV Community by Suleiman Abdulkadir (@suletete).</description>
    <link>https://dev.to/suletete</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1134100%2F1bc50dbc-4e21-4e49-8a1e-abea00988ff6.png</url>
      <title>DEV Community: Suleiman Abdulkadir</title>
      <link>https://dev.to/suletete</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/suletete"/>
    <language>en</language>
    <item>
      <title>I built a self-healing web app on AWS and watched it recover from failure in real time</title>
      <dc:creator>Suleiman Abdulkadir</dc:creator>
      <pubDate>Wed, 01 Jul 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/suletete/i-built-a-self-healing-web-app-on-aws-and-watched-it-recover-from-failure-in-real-time-4eok</link>
      <guid>https://dev.to/suletete/i-built-a-self-healing-web-app-on-aws-and-watched-it-recover-from-failure-in-real-time-4eok</guid>
      <description>&lt;p&gt;I wanted actually to understand AWS networking. Not "I followed a tutorial, and it worked" understand. More like "I can explain why this NAT Gateway exists and what breaks if I delete it" understand.&lt;/p&gt;

&lt;p&gt;So I built CloudPulse. It's a TypeScript app that monitors its own infrastructure and displays the health of every instance on a dashboard. The interesting part: when you kill an instance, the system detects the failure and replaces it automatically while users never notice anything went wrong.&lt;/p&gt;

&lt;p&gt;No Terraform. No CloudFormation. Raw AWS CLI calls in bash scripts, each one commented so I'd remember what it does in six months.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fexieus3d9nddpok18q9a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fexieus3d9nddpok18q9a.png" alt="CloudPulse Dashboard" width="799" height="620"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How it's wired together
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdqjp8jgq0d8s7di6yj2k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdqjp8jgq0d8s7di6yj2k.png" alt="Architecture diagram" width="800" height="780"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Internet traffic hits an Application Load Balancer sitting in public subnets. The ALB forwards requests to EC2 instances in private subnets on port 3000. Those instances have no public IP; they can't be reached directly from the internet at all. When they need to talk to AWS APIs (publishing CloudWatch metrics, describing their own ASG), they go through NAT Gateways.&lt;/p&gt;

&lt;p&gt;There's one NAT Gateway per availability zone. If the one in AZ-1 dies, only the instance in AZ-1 loses outbound connectivity. The instance in AZ-2 keeps working through its own NAT Gateway. That's the point of having two.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkh6qew1ie5y8bd8mljj9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkh6qew1ie5y8bd8mljj9.png" alt="Network layout" width="799" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The ALB checks /health every 30 seconds. Three consecutive failures and the instance gets pulled from the target group. The Auto Scaling Group notices the instance is unhealthy, terminates it, and launches a fresh one. No human involved.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I actually wanted to see: killing an instance
&lt;/h2&gt;

&lt;p&gt;This is why I built the whole thing. I wanted to watch a system heal itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fa8vzt7ynky72lxv0qoy0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fa8vzt7ynky72lxv0qoy0.png" alt="Terminating an instance" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I ran &lt;code&gt;aws ec2 terminate-instances&lt;/code&gt; on one of the two running instances. Then I sat there watching the dashboard refresh every 30 seconds.&lt;/p&gt;

&lt;p&gt;Within about a minute, the terminated instance showed up as unhealthy. The ASG launched a replacement. The new instance booted Amazon Linux 2023, pulled my app from S3, installed dependencies, started the Node process, and began responding to the ALB health checks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fnm5okr0up2a5zqb7c4t3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fnm5okr0up2a5zqb7c4t3.png" alt="New instance coming up" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fki8d646u1fiz98kbed9j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fki8d646u1fiz98kbed9j.png" alt="Recovery complete" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Total recovery time: under 2 minutes. And during those 2 minutes, the ALB was sending all traffic to the surviving instance. Nobody waiting for a page load would have noticed anything.&lt;/p&gt;

&lt;p&gt;That's the thing about self-healing infrastructure. It's boring when it works. You kill something, wait a bit, and everything is back to normal. But getting to that boring place required wiring up health checks, ASG policies, target group settings, and IAM permissions correctly. The boring outcome is the proof that the wiring works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auto scaling under load
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fshvnj0vc622dlth0t39p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fshvnj0vc622dlth0t39p.png" alt="Auto Scaling Group" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I connected to one of the instances via SSM Session Manager (no SSH keys anywhere in this setup) and ran &lt;code&gt;stress --cpu 4 --timeout 180&lt;/code&gt;. This pegged the CPU at 100% for 3 minutes.&lt;/p&gt;

&lt;p&gt;CloudWatch saw the CPUUtilization metric exceed 70% for 2 consecutive 60-second periods. The alarm fired. The ASG added a third instance. When the stress test ended and CPU dropped below 30% for 2 minutes, the alarm fired again, and the ASG removed the extra instance.&lt;/p&gt;

&lt;p&gt;The scaling policies have a 300-second cooldown so they don't thrash back and forth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The instances themselves
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fg2h3smcian7ihc93k10a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fg2h3smcian7ihc93k10a.png" alt="Running instances" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Both run t3.micro (free tier eligible, sort of; you get 750 hours/month, but 2 instances burn 1440 hours). Private subnets, no public IP, no SSH key pair. I access them through Systems Manager Session Manager when I need to poke around.&lt;/p&gt;

&lt;p&gt;The IAM role attached to the instances allows exactly four things: publish CloudWatch metrics, describe EC2 instances, describe ASG state, and use SSM for shell access. Nothing else.&lt;/p&gt;

&lt;h2&gt;
  
  
  One-command deployment
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsuletetes%2FAWS-HA-WebApp%2Fmain%2Fdiagrams%2FCloudPulse%2FDeployed%2520On%2520Terminal%2520%28IDE%29.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsuletetes%2FAWS-HA-WebApp%2Fmain%2Fdiagrams%2FCloudPulse%2FDeployed%2520On%2520Terminal%2520%28IDE%29.png" alt="Terminal deployment" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bash deploy.sh&lt;/code&gt; runs five scripts in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;iam.sh creates the role and instance profile&lt;/li&gt;
&lt;li&gt;vpc.sh builds the entire network (this takes ~3 minutes because NAT Gateways are slow to provision)&lt;/li&gt;
&lt;li&gt;alb.sh creates security groups, the load balancer, target group, and listener&lt;/li&gt;
&lt;li&gt;compute.sh creates the launch template and ASG (instances start booting here)&lt;/li&gt;
&lt;li&gt;monitoring.sh creates the CloudWatch alarms&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At the end, it prints the ALB URL. Wait 3-5 minutes for instances to pass health checks, then open it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bash teardown.sh&lt;/code&gt; deletes everything in reverse order. Takes about 3 minutes. I run it every time I finish a learning session because NAT Gateways cost $2/day just sitting there.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I used
&lt;/h2&gt;

&lt;p&gt;The app itself is TypeScript on Express. Server-side rendered HTML with EJS (no frontend framework; the dashboard is one page that refreshes every 30 seconds). 101 tests across unit, property-based (fast-check), and integration (supertest).&lt;/p&gt;

&lt;p&gt;The infrastructure is pure AWS CLI in bash. Every script sources a shared config file and a common utilities file. Resource IDs get saved to an env file so scripts can reference what previous scripts created.&lt;/p&gt;

&lt;p&gt;AWS services in this project: VPC, public/private subnets, Internet Gateway, NAT Gateways, route tables, NACLs, security groups, Application Load Balancer, EC2 via Launch Template, Auto Scaling Group, CloudWatch custom metrics, CloudWatch alarms, IAM roles with instance profiles, EBS gp3 volumes, SSM Session Manager, and S3 for code delivery.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned the hard way
&lt;/h2&gt;

&lt;p&gt;Git Bash on Windows rewrites any path starting with &lt;code&gt;/&lt;/code&gt; to a Windows path. My health check path &lt;code&gt;/health&lt;/code&gt; became &lt;code&gt;C:/Program Files/Git/health&lt;/code&gt; during deployment. Took me a while to figure out why the target group health check was failing. Fix: &lt;code&gt;export MSYS_NO_PATHCONV=1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;ALB network interfaces take 5-10 minutes to fully release after you delete the ALB. If you try to delete the security groups too early, you get "DependencyViolation" errors. The teardown script has to wait.&lt;/p&gt;

&lt;p&gt;IAM is eventually consistent. If you create an instance profile and immediately reference it in a launch template, it sometimes fails because the profile hasn't propagated yet. I added a 10-second sleep after IAM operations. Ugly, but it works.&lt;/p&gt;

&lt;p&gt;Security groups that reference each other can't be deleted independently. You have to remove the cross-reference rules first, then delete them. The teardown script handles this, but it was a pain to debug the first time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Source
&lt;/h2&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/suletetes/AWS-HA-WebApp" rel="noopener noreferrer"&gt;suletetes/AWS-HA-WebApp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>typescript</category>
      <category>cloud</category>
    </item>
    <item>
      <title>I built a group expense app where the database refuses to let balances lie</title>
      <dc:creator>Suleiman Abdulkadir</dc:creator>
      <pubDate>Sat, 27 Jun 2026 16:14:37 +0000</pubDate>
      <link>https://dev.to/suletete/i-built-a-group-expense-app-where-the-database-refuses-to-let-balances-lie-ec0</link>
      <guid>https://dev.to/suletete/i-built-a-group-expense-app-where-the-database-refuses-to-let-balances-lie-ec0</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Built for the &lt;strong&gt;H0 Hackathon&lt;/strong&gt; AWS + Vercel track.&lt;br&gt;
Live: &lt;a href="https://ledgerloop-delta.vercel.app" rel="noopener noreferrer"&gt;ledgerloop-delta.vercel.app&lt;/a&gt;&lt;br&gt;
Repo: &lt;a href="https://github.com/suletete/LedgerLoop" rel="noopener noreferrer"&gt;github.com/suletete/LedgerLoop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three friends in three time zones split a hotel bill. Two of them add expenses from different cities at the same time. Most apps let one of those writes quietly overwrite the other. Balances drift by a cent. Nobody notices until someone gets asked to pay more than they actually owe.&lt;/p&gt;

&lt;p&gt;That's not a UI problem. It's a math problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The invariant that started everything
&lt;/h2&gt;

&lt;p&gt;In a closed expense group, the sum of all net balances must equal zero. Always. If Alice owes Bob $50 and Bob owes Carol $30, those flows have to cancel out across the whole group. That's not a business requirement someone wrote in a spec. It's arithmetic.&lt;/p&gt;

&lt;p&gt;Most apps treat this as an application concern; they lock a row, update a running total, and release the lock. The problem: if two writes race, application-layer optimistic locking can fail silently. You get a wrong number and no error.&lt;/p&gt;

&lt;p&gt;I wanted the database to refuse the inconsistency. Don't paper over it. Refuse it.&lt;/p&gt;

&lt;p&gt;Aurora PostgreSQL runs with serializable isolation. When two transactions touch the same data at the same instant, one of them gets &lt;code&gt;SQLSTATE 40001&lt;/code&gt;,  a serialization failure, and the database aborts it. No silent merge. No wrong number. A clean error you can retry from a fresh read.&lt;/p&gt;

&lt;p&gt;The rest of the architecture follows from that.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;LedgerLoop is a group expense ledger. Friends split shared costs, rent, trips, dinners, and the app figures out exactly what each person owes, reduced to the fewest possible payments.&lt;/p&gt;

&lt;p&gt;The core flow: add an expense, split it (equal, by percentage, or exact amounts), and watch the balances update. If someone tries to settle $600 when they only owe $500, the system rejects it before anything hits the database. Twelve tangled debts collapse to four payments.&lt;/p&gt;

&lt;p&gt;Stack: Next.js 15 (App Router) on Vercel, TypeScript strict mode, Tailwind CSS, Aurora PostgreSQL Serverless v2, Vitest + fast-check.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0ufhdp4x4n739f3rt72f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0ufhdp4x4n739f3rt72f.png" alt="LedgerLoop architecture overview" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The concurrency model changed my architecture
&lt;/h2&gt;

&lt;p&gt;Traditional systems: lock the row, update it, release the lock. With OCC, you read a snapshot, compute your change, commit if someone else changed the same data while you were computing, you get &lt;code&gt;40001&lt;/code&gt;, and start over from a fresh read.&lt;/p&gt;

&lt;p&gt;That sounds like a retry strategy. It's more than that.&lt;/p&gt;

&lt;p&gt;Once the model clicked, I stopped storing derived values. A running balance column is a conflict surface. Two writes that touch it will race. So I didn't store it. Balance is computed fresh on every read, from the raw ledger expenses, splits, and settlements. Nothing to corrupt. Nothing to get out of sync.&lt;/p&gt;

&lt;p&gt;The ledger is append-only. Expenses and settlements are inserted. Corrections are new reversing rows, not edits. Two inserts with different UUIDs rarely collide. The conflict window gets narrow enough that retries are rare in practice.&lt;/p&gt;

&lt;p&gt;Every write goes through &lt;code&gt;withOccRetry&lt;/code&gt;: on &lt;code&gt;40001&lt;/code&gt;, exponential backoff with jitter, up to 4 attempts. If all 4 exhaust, a clean error comes back. The ledger is exactly as it was before the first attempt.&lt;/p&gt;

&lt;p&gt;The architecture didn't lead me to the concurrency model. The concurrency model told me what the architecture had to look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The remainder bug property testing found
&lt;/h2&gt;

&lt;p&gt;My first equal-split implementation passed every unit test I wrote.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Naive: $10.00 / 3 = 333 + 333 + 333 = 999 cents. One cent gone, every time.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;perPerson&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The property test caught it on the second run. fast-check threw random amounts and random participant counts at the invariant &lt;code&gt;sum(shares) == amount&lt;/code&gt;  and found a specific combination I hadn't thought to try.&lt;/p&gt;

&lt;p&gt;The fix was two lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;remainder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;// First `remainder` participants get base + 1, the rest get base.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For $10.00 / 3: 334 + 333 + 333 = 1000 cents. Always.&lt;/p&gt;

&lt;p&gt;I wrote dozens of unit tests before this. None of them found it. The property test found it on run 2 because it generated the exact amount and participant count combination where naive flooring fails.&lt;/p&gt;

&lt;p&gt;That's property testing. You write down what must always be true and let the library go looking for trouble.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six invariants, each one trying to break itself
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;What must hold&lt;/th&gt;
&lt;th&gt;Enforced by&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;INV-1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;sum(splits) == expense amount&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Split Calculator + atomic transaction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INV-2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sum(balances) == 0&lt;/code&gt; across a group&lt;/td&gt;
&lt;td&gt;Balance Engine derivation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INV-3&lt;/td&gt;
&lt;td&gt;No double-counting under concurrency&lt;/td&gt;
&lt;td&gt;Aurora SERIALIZABLE + withOccRetry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INV-4&lt;/td&gt;
&lt;td&gt;Money is always integer minor units&lt;/td&gt;
&lt;td&gt;BIGINT storage, no floats&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INV-5&lt;/td&gt;
&lt;td&gt;Settlement &amp;lt;= what's owed&lt;/td&gt;
&lt;td&gt;Settlement Validator against derived ledger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INV-6&lt;/td&gt;
&lt;td&gt;Every row references a real entity&lt;/td&gt;
&lt;td&gt;Auth Guard + DB foreign keys&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;27 property-based tests total. Each one runs 100+ random inputs via fast-check.&lt;/p&gt;

&lt;p&gt;INV-2 is the most satisfying. The balance engine reads every expense, split, and settlement for a group and computes each member's net. Positive means the group owes you. Negative means you owe. The sum across the whole group is always zero, not something verified after the fact, but a consequence of how the formula adds up.&lt;/p&gt;

&lt;p&gt;Here's the schema those invariants protect. Groups and membership:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fs1nken5ijldhlfnab7jm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fs1nken5ijldhlfnab7jm.png" alt="Groups and membership schema" width="800" height="716"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The append-only ledger records expenses, splits, and settlements. No UPDATE ever runs on these tables. Corrections are new reversing rows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fguddh1izqqz2l7hr0kwd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fguddh1izqqz2l7hr0kwd.png" alt="Expense ledger schema" width="800" height="539"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Auth tables persisted to Aurora, so sessions survive Vercel cold starts:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ff6e1s3ly1u44s7372axk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ff6e1s3ly1u44s7372axk.png" alt="Auth schema" width="800" height="698"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why integers everywhere
&lt;/h2&gt;

&lt;p&gt;IEEE 754 can't represent 0.1 exactly. &lt;code&gt;$10.50&lt;/code&gt; stored as a float might come back as &lt;code&gt;10.4999...97&lt;/code&gt;. The fix isn't smarter, rounding it's never using floats.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$10.00&lt;/code&gt; is stored as &lt;code&gt;1000&lt;/code&gt; cents. The database column is &lt;code&gt;BIGINT&lt;/code&gt;. The TypeScript type is &lt;code&gt;number&lt;/code&gt; (safe for integers up to 2^53 - 1). Formatting back to major units happens exactly once, at the display layer, and nowhere else. Any field with money in it ends in &lt;code&gt;Minor&lt;/code&gt;: &lt;code&gt;amountMinor&lt;/code&gt;, &lt;code&gt;shareMinor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The percentage split problem is subtler. &lt;code&gt;Math.round(amount * pct / 100)&lt;/code&gt; per person produces totals that are off by 2 minor units on certain inputs. The actual fix: floor every share, sum the floors, hand out the shortfall one unit at a time to whoever had the largest fractional part. The property test for this ran 100 iterations. It hasn't failed once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing OCC without a live database
&lt;/h2&gt;

&lt;p&gt;138 tests. 24 files. Under 25 seconds. No database, no network.&lt;/p&gt;

&lt;p&gt;The in-memory fake has an &lt;code&gt;injectOccConflict(n)&lt;/code&gt; method. Pass &lt;code&gt;n = 2&lt;/code&gt; and the next two writes throw &lt;code&gt;40001&lt;/code&gt; before touching the state. That's how the retry path gets exercised, including the full backoff sequence without a live database.&lt;/p&gt;

&lt;p&gt;The persistence layer sits behind an interface. The entire test suite runs against the fake. The real Aurora adapter and the fake both implement the same interface, so swapping them is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/lib/persistence-factory.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getPersistence&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nx"&gt;Persistence&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AURORA_HOST&lt;/span&gt;
    &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AuroraPersistence&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;InMemoryPersistence&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsuletetes%2Fledgerloop%2Fmain%2Fdocs%2Fledgerloop_request_flow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsuletetes%2Fledgerloop%2Fmain%2Fdocs%2Fledgerloop_request_flow.png" alt="addExpense write path" width="799" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One Next.js deployment. One Aurora database. No cross-service coordination. Concurrency correctness is Aurora's job, not the application's.&lt;/p&gt;

&lt;p&gt;Every write goes through four steps left to right: auth check → split calculation → OCC retry wrapper → Aurora atomic transaction. The red dashed arrow is the &lt;code&gt;SQLSTATE 40001&lt;/code&gt; retry loop. Aurora fires it on a conflict, &lt;code&gt;withOccRetry&lt;/code&gt; backs off and retries, the second attempt lands cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Registration&lt;/strong&gt;: how a new account gets created and persisted to Aurora:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F49m5t42q7b42ooa4nqb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F49m5t42q7b42ooa4nqb8.png" alt="Registration flow" width="800" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sign-in and session lookup&lt;/strong&gt; warm path hits in-memory, cold start falls back to Aurora:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3kyrchd5qn7m0tc7pp0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3kyrchd5qn7m0tc7pp0g.png" alt="Sign-in and session flow" width="799" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/suletete/LedgerLoop
&lt;span class="nb"&gt;cd &lt;/span&gt;LedgerLoop
npm &lt;span class="nb"&gt;install
&lt;/span&gt;npm &lt;span class="nb"&gt;test&lt;/span&gt;      &lt;span class="c"&gt;# 138 tests, all pass, no database needed&lt;/span&gt;
npm run dev   &lt;span class="c"&gt;# http://localhost:3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No AWS account. No Docker. The in-memory fake handles everything locally.&lt;/p&gt;

&lt;p&gt;Or go straight to &lt;a href="https://ledgerloop-delta.vercel.app" rel="noopener noreferrer"&gt;ledgerloop-delta.vercel.app&lt;/a&gt; and register. The first request may take 5-10 seconds for an Aurora Serverless cold start. The second is fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;The settle-up flow is half-wired. The UI renders, and the server action exists. I ran out of time connecting them before the deadline. That's the most obvious gap.&lt;/p&gt;

&lt;p&gt;The OCC demo page shows two concurrent writes and lets you watch one retry. It works, but real-time feedback instead of a page reload would be better. That's a WebSocket problem I decided not to introduce mid-hackathon.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing that stuck with me
&lt;/h2&gt;

&lt;p&gt;I came into this thinking the concurrency problem was a detail I'd handle at the end. It turned out to be the thing everything else bent around. Append-only inserts, no stored balances, no mutable totals to race on, none of those were the original plan. The OCC model required them.&lt;/p&gt;

&lt;p&gt;The database refusing a conflicting write isn't a retry strategy. It's a contract. The architecture is just what you have to build to honor it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built for the H0 Hackathon (AWS + Vercel track). #H0Hackathon&lt;/em&gt;&lt;/p&gt;

</description>
      <category>nextjs</category>
      <category>typescript</category>
      <category>aws</category>
      <category>h0hackathon</category>
    </item>
    <item>
      <title>I built an event-driven order system with both ECS and Lambda. Here's why.</title>
      <dc:creator>Suleiman Abdulkadir</dc:creator>
      <pubDate>Tue, 23 Jun 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/suletete/i-built-an-event-driven-order-system-with-both-ecs-and-lambda-heres-why-fcp</link>
      <guid>https://dev.to/suletete/i-built-an-event-driven-order-system-with-both-ecs-and-lambda-heres-why-fcp</guid>
      <description>&lt;p&gt;Every AWS interview I've done asks some version of the same question: containers or serverless? And every time, the "right answer" is "it depends." Which is true but useless.&lt;/p&gt;

&lt;p&gt;So I built a system that uses both. On purpose. Not as a compromise, but because different parts of the same application have different runtime needs. The API needs consistent latency. Background jobs need to scale to zero. Trying to force both into one compute model is the wrong move.&lt;/p&gt;

&lt;p&gt;This is EventForge. It's an e-commerce order processing platform with event-driven architecture, a Step Functions saga, and about 15 AWS services wired together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsuletetes%2FEventForge%2Fmain%2Fdocs%2Feventforge-architecture.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsuletetes%2FEventForge%2Fmain%2Fdocs%2Feventforge-architecture.png" alt="Full system architecture" width="800" height="942"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The full picture is hard to read at this scale, so I split it into three views.&lt;/p&gt;

&lt;h3&gt;
  
  
  Request flow
&lt;/h3&gt;

&lt;p&gt;A user signs in via Cognito, the React frontend sends authenticated requests to the ALB, which forwards to ECS Fargate containers running the Express API. The API reads/writes DynamoDB and publishes events to EventBridge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F2xw5rqs992uuxl6kv2i4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F2xw5rqs992uuxl6kv2i4.png" alt="Request Flow" width="800" height="916"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Order workflow
&lt;/h3&gt;

&lt;p&gt;The Step Functions saga that processes an order through validation, inventory reservation, payment, and confirmation, with compensation paths when something fails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsuletetes%2FEventForge%2Fmain%2Fdocs%2Feventforge-2-order-workflow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsuletetes%2FEventForge%2Fmain%2Fdocs%2Feventforge-2-order-workflow.png" alt="Order Workflow" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Background processing
&lt;/h3&gt;

&lt;p&gt;After an order completes, EventBridge fans out to SQS queues. Lambda processors handle emails, PDF receipts, and webhooks. Dead letter queues catch failures, CloudWatch alarms notify on DLQ depth.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhq63eg0yu4indosp1j07.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhq63eg0yu4indosp1j07.png" alt="Background Processing" width="800" height="1212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The containers vs. serverless thing
&lt;/h2&gt;

&lt;p&gt;I'll keep this short because it's genuinely simple once you see it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;ECS Fargate (API)&lt;/th&gt;
&lt;th&gt;Lambda (background)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Response time&lt;/td&gt;
&lt;td&gt;Consistent sub-200ms&lt;/td&gt;
&lt;td&gt;Cold starts add 500ms-3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sustained traffic&lt;/td&gt;
&lt;td&gt;Predictable cost&lt;/td&gt;
&lt;td&gt;Expensive at high RPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle periods&lt;/td&gt;
&lt;td&gt;You're paying anyway&lt;/td&gt;
&lt;td&gt;Free (scale to zero)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Burst scaling&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;My API runs on Fargate. Two tasks behind an ALB. Health checks pass in under 100ms because there's no cold start penalty. Users hit the order creation endpoint, get a 201 back in ~150ms, and the system handles the rest asynchronously.&lt;/p&gt;

&lt;p&gt;The "rest" is ten Lambda functions that process emails, generate PDF receipts, deliver webhooks, and run the entire order fulfillment saga. They sit idle most of the time. When an order comes in, they wake up, do their thing, and go back to sleep. I pay nothing when nobody's ordering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fgrl6da22d2zxz9m4yt5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fgrl6da22d2zxz9m4yt5f.png" alt="All Lambda functions in the console" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The order saga (the interesting part)
&lt;/h2&gt;

&lt;p&gt;This is where I spent most of my time. An order goes through four steps: validate, reserve inventory, charge payment, confirm. If step 3 (payment) fails after step 2 (inventory) succeeded, you have a problem. Inventory is reserved but the order is dead.&lt;/p&gt;

&lt;p&gt;Step Functions handles this with a saga pattern:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd1rhpeb898bl6b9hqdlh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd1rhpeb898bl6b9hqdlh.png" alt="Step Functions workflow" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each step is a separate Lambda. If &lt;code&gt;ChargePayment&lt;/code&gt; throws, the workflow doesn't just fail. It routes to &lt;code&gt;ReleaseInventory&lt;/code&gt; first, which undoes the reservation. Then it calls &lt;code&gt;OrderFailed&lt;/code&gt; to persist the failure status. Only then does the execution terminate.&lt;/p&gt;

&lt;p&gt;I defined the workflow in ASL (Amazon States Language). Each state uses &lt;code&gt;Catch&lt;/code&gt; blocks that route to compensation steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"ChargePayment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${ChargePaymentFunctionArn}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ConfirmOrder"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Catch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ErrorEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"States.ALL"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ReleaseInventory"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The compensation path runs in reverse. Payment failed? Release the reservation. Reservation failed? Nothing to compensate, just mark as failed. It's boring when it works, which is the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  EventBridge does the fan-out
&lt;/h2&gt;

&lt;p&gt;When &lt;code&gt;ConfirmOrder&lt;/code&gt; completes, it publishes an &lt;code&gt;order.completed&lt;/code&gt; event to a custom EventBridge bus. One event, three consumers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;SQS queue -&amp;gt; Lambda sends confirmation email (SES)&lt;/li&gt;
&lt;li&gt;SQS queue -&amp;gt; Lambda generates PDF receipt (uploads to S3)&lt;/li&gt;
&lt;li&gt;SQS queue -&amp;gt; Lambda delivers to registered webhook URLs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqrer7lehzg8027je36k5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqrer7lehzg8027je36k5.png" alt="EventBridge bus" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each queue has a dead letter queue. Each DLQ has a CloudWatch alarm. If messages start piling up in the DLQ, something is broken and I want to know.&lt;/p&gt;

&lt;p&gt;The PDF processor generates a minimal valid PDF (no library dependencies, just raw PDF syntax) and uploads it to S3 under &lt;code&gt;receipts/{orderId}.pdf&lt;/code&gt;. The orders API exposes a presigned URL endpoint so users can download their receipt.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxopr67d46cbvs0n2si7a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxopr67d46cbvs0n2si7a.png" alt="Receipts stored in S3" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3qltexodkltg4nvygent.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3qltexodkltg4nvygent.png" alt="Generated PDF receipt" width="800" height="865"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;External systems can also push events in via API Gateway. There's a separate HTTP API with a &lt;code&gt;POST /webhooks/ingest&lt;/code&gt; route that validates the payload and publishes to EventBridge. This is how third party services would feed events into the system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fsktul2fubtwmk82t5bw6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fsktul2fubtwmk82t5bw6.png" alt="API Gateway for webhook ingestion" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The processors unwrap the EventBridge envelope (the SQS body is the full EventBridge event, not just the detail), extract the order data, and do their thing. I wasted two hours on this during deployment. The processors kept crashing and the DLQs were filling up. Turned out EventBridge wraps your payload in an envelope with &lt;code&gt;version&lt;/code&gt;, &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;detail-type&lt;/code&gt; fields, and the actual data is nested inside &lt;code&gt;detail&lt;/code&gt;. My code was doing &lt;code&gt;JSON.parse(body)&lt;/code&gt; and treating the result as the order directly. Everything was &lt;code&gt;undefined&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The API layer
&lt;/h2&gt;

&lt;p&gt;TypeScript, Express, running in a Docker container on Fargate. Standard stuff. The parts worth mentioning:&lt;/p&gt;

&lt;p&gt;JWT validation against Cognito (JWKS endpoint with key caching), an event publisher that retries transient EventBridge failures with exponential backoff, and a DynamoDB single-table design where orders, events, and webhook registrations all live in one table with composite keys.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4tvpgba0d153ipkfoqh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4tvpgba0d153ipkfoqh3.png" alt="Cognito user pool" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fyii1yaczmyivnm0v4pfz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fyii1yaczmyivnm0v4pfz.png" alt="ALB serving the API" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Docker image lives in ECR. The GitHub Actions pipeline pushes a new image on every merge to main, and ECS picks it up on the next deployment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4r0kykkgp25w615gxjt8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4r0kykkgp25w615gxjt8.png" alt="ECR repository" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The frontend is React on S3 with static website hosting. It polls &lt;code&gt;/api/events&lt;/code&gt; and &lt;code&gt;/api/orders&lt;/code&gt; every 10 seconds. There's a form to create orders and a section to register webhook URLs. Nothing fancy, but it proves the whole pipeline works.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8jgadb4wx1xcbv5ffdwh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8jgadb4wx1xcbv5ffdwh.png" alt="Dashboard on S3" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure as code (all of it)
&lt;/h2&gt;

&lt;p&gt;One &lt;code&gt;template.yaml&lt;/code&gt; at the root. Nested stacks for each layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC (2 AZs, public/private subnets, NAT gateways)&lt;/li&gt;
&lt;li&gt;DynamoDB&lt;/li&gt;
&lt;li&gt;SQS queues + DLQs&lt;/li&gt;
&lt;li&gt;Cognito&lt;/li&gt;
&lt;li&gt;IAM roles (least privilege per service)&lt;/li&gt;
&lt;li&gt;ECS cluster + service + ALB&lt;/li&gt;
&lt;li&gt;EventBridge bus + rules&lt;/li&gt;
&lt;li&gt;Lambda functions&lt;/li&gt;
&lt;li&gt;API Gateway (for external webhook ingestion)&lt;/li&gt;
&lt;li&gt;CloudWatch alarms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4ruhcvoztjacbmugo3dc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4ruhcvoztjacbmugo3dc.png" alt="CloudFormation stacks" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sam package&lt;/code&gt; and &lt;code&gt;sam deploy&lt;/code&gt;. Two commands to go from code to running infrastructure. The Lambda code is pre-bundled with esbuild into self-contained files because SAM can't resolve npm workspace symlinks (this took me a while to figure out). I wrote a small script (&lt;code&gt;scripts/bundle-lambdas.js&lt;/code&gt;) that creates ten individual bundles, each with all dependencies inlined except the AWS SDK (provided by the runtime).&lt;/p&gt;

&lt;h2&gt;
  
  
  The deployment pipeline
&lt;/h2&gt;

&lt;p&gt;GitHub Actions. Push to main and it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Builds TypeScript&lt;/li&gt;
&lt;li&gt;Bundles Lambdas with esbuild&lt;/li&gt;
&lt;li&gt;Builds and pushes the Docker image to ECR&lt;/li&gt;
&lt;li&gt;Packages and deploys with SAM&lt;/li&gt;
&lt;li&gt;Reads the new Cognito pool ID and ALB DNS from stack outputs&lt;/li&gt;
&lt;li&gt;Rebuilds the frontend with those values baked in&lt;/li&gt;
&lt;li&gt;Syncs to S3&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole thing takes about 8 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;343 tests. 19 of them are property-based (fast-check). Those generate 100 random inputs per test and verify invariants like "for any valid order request, the system always produces exactly one pending record and one event" and "for any webhook registration, the URL hash is deterministic."&lt;/p&gt;

&lt;p&gt;The property-based tests caught two bugs that unit tests missed: an edge case in the order validator where a price of exactly 0.00 passed validation (it shouldn't), and a race condition in the idempotency check where two identical requests within the same millisecond could both succeed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it costs
&lt;/h2&gt;

&lt;p&gt;About $35/month with two Fargate tasks running. Most of that is the ALB ($16/month regardless of traffic) and Fargate compute ($18). Everything else (Lambda, DynamoDB, SQS, EventBridge) falls under free tier at low traffic.&lt;/p&gt;

&lt;p&gt;If you're showing this off for 30 minutes and then tearing it down, it costs about $0.50.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stuff I'd do differently
&lt;/h2&gt;

&lt;p&gt;The NAT gateways are expensive for a demo. I'd use VPC endpoints for DynamoDB and EventBridge instead, which drops the monthly cost significantly. I kept the NAT gateways because ECS tasks in private subnets need them to pull images from ECR, but there's an ECR VPC endpoint that solves that too.&lt;/p&gt;

&lt;p&gt;SES is still in sandbox mode, so emails only go to verified addresses. For a real production system you'd request production access.&lt;/p&gt;

&lt;p&gt;The frontend is HTTP-only (S3 static hosting). A real deployment would put CloudFront in front for HTTPS. I tried it during development but hit a circular dependency between the OAC and the bucket policy, so I dropped it and went with direct S3 hosting. Works fine for a demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/suletetes/EventForge" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>typescript</category>
      <category>architecture</category>
    </item>
    <item>
      <title>I built a pipeline that rolls itself back when production breaks</title>
      <dc:creator>Suleiman Abdulkadir</dc:creator>
      <pubDate>Fri, 05 Jun 2026 22:04:55 +0000</pubDate>
      <link>https://dev.to/suletete/i-built-a-pipeline-that-rolls-itself-back-when-production-breaks-30f6</link>
      <guid>https://dev.to/suletete/i-built-a-pipeline-that-rolls-itself-back-when-production-breaks-30f6</guid>
      <description>&lt;p&gt;Deployments that break silently at night bother me. By the time someone checks Slack in the morning, users have been hitting 502s for hours. I built ShipGuard because I wanted the infrastructure to fix itself before I even knew something was wrong.&lt;/p&gt;

&lt;p&gt;It's a CodePipeline that does blue/green deployment with canary traffic shifting to EC2. If the new version starts returning 5xx errors, CodeDeploy shifts traffic back to the old version and kills the broken instances. I don't have to do anything.&lt;/p&gt;

&lt;p&gt;Three CloudFormation templates. Everything in source control. Nothing configured by hand.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline flow
&lt;/h2&gt;

&lt;p&gt;Push to main. Everything after that is automatic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pull source from GitHub&lt;/li&gt;
&lt;li&gt;npm audit, Trivy, git-secrets run first. High or critical vuln? Build dies.&lt;/li&gt;
&lt;li&gt;Tests run, TypeScript compiles, artifact gets packaged&lt;/li&gt;
&lt;li&gt;Deploy to staging (one instance, in-place)&lt;/li&gt;
&lt;li&gt;Email lands asking me to approve&lt;/li&gt;
&lt;li&gt;I approve. Blue/green starts on production.&lt;/li&gt;
&lt;li&gt;10% of traffic routes to the new version for 5 minutes&lt;/li&gt;
&lt;li&gt;CloudWatch alarm stays quiet? Remaining traffic shifts over.&lt;/li&gt;
&lt;li&gt;Old instances terminated. Done.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the alarm fires during steps 7 or 8, traffic goes 100% back to the previous version. Green instances die. I get an email explaining which alarm triggered the rollback.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpfxzsxsnq8nj4o7i0ln7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpfxzsxsnq8nj4o7i0ln7.png" alt="ShipGuard Architecture" width="800" height="671"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Things that bit me
&lt;/h2&gt;

&lt;h3&gt;
  
  
  TimeBasedCanary doesn't work for EC2
&lt;/h3&gt;

&lt;p&gt;I spent an afternoon trying to configure &lt;code&gt;TimeBasedCanary&lt;/code&gt; in a custom deployment config. CloudFormation accepted the template at lint time and then failed at deploy time with "Traffic routing configuration should be null for Server deployment configuration."&lt;/p&gt;

&lt;p&gt;Turns out canary percentage configs only exist for ECS and Lambda. For EC2, the ALB and target group weights handle traffic shifting, not a CodeDeploy config. Nowhere in the docs does it say this clearly; you just find out when it breaks.&lt;/p&gt;

&lt;h3&gt;
  
  
  IAM role chain from hell
&lt;/h3&gt;

&lt;p&gt;Four roles. They all need to trust different AWS services, and they all need slightly different permissions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline role needs &lt;code&gt;iam:PassRole&lt;/code&gt; to hand off to CodeDeploy&lt;/li&gt;
&lt;li&gt;CodeBuild role needs S3 access to the artifact bucket plus CloudWatch Logs&lt;/li&gt;
&lt;li&gt;CodeDeploy role needs EC2, ASG, ALB, S3&lt;/li&gt;
&lt;li&gt;EC2 instance profile needs to pull from S3 and push CloudWatch metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Miss one permission and you get "Access Denied" with no indication of which call failed or which role is the problem. I iterated on this more times than I'd like to admit.&lt;/p&gt;

&lt;h3&gt;
  
  
  You need an AMI in your Launch Template
&lt;/h3&gt;

&lt;p&gt;This one's embarrassing. cfn-lint doesn't catch a missing &lt;code&gt;ImageId&lt;/code&gt; in a Launch Template. CloudFormation doesn't catch it either, until the ASG actually tries to spin up an instance and fails. The fix is an SSM parameter that resolves to the latest Amazon Linux 2023 AMI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;LatestAmiId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::SSM::Parameter::Value&amp;lt;AWS::EC2::Image::Id&amp;gt;&lt;/span&gt;
  &lt;span class="na"&gt;Default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CodeStarNotifications has different tag syntax
&lt;/h3&gt;

&lt;p&gt;I deployed the pipeline stack three times before figuring this out. Every other resource in CloudFormation uses Tags as a list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
    &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-rule&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;AWS::CodeStarNotifications::NotificationRule&lt;/code&gt; wants a map:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-rule&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CloudFormation gives you "Properties validation failed" with no hint about what's wrong. I found the answer buried in a GitHub issue after 20 minutes of searching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security scanning
&lt;/h2&gt;

&lt;p&gt;Three scans run before tests. If any exit non-zero, the build stops, and nothing reaches staging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;pre_build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm audit --audit-level=high&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;trivy fs --severity HIGH, CRITICAL --exit-code &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;git secrets --scan&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No third-party service needed. npm audit is already there, Trivy downloads in a few seconds during install, and git-secrets is an AWS open source tool that's one clone away. The pipeline sends a notification identifying which scan killed the build.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rollback alarm
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Production5xxAlarm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::CloudWatch::Alarm&lt;/span&gt;
  &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;MetricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPCode_Target_5XX_Count&lt;/span&gt;
    &lt;span class="na"&gt;Namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS/ApplicationELB&lt;/span&gt;
    &lt;span class="na"&gt;Statistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Sum&lt;/span&gt;
    &lt;span class="na"&gt;Period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
    &lt;span class="na"&gt;EvaluationPeriods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;Threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;ComparisonOperator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GreaterThanThreshold&lt;/span&gt;
    &lt;span class="na"&gt;TreatMissingData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notBreaching&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;10 or more 5xx errors in 60 seconds trigger the alarm. CodeDeploy has &lt;code&gt;DEPLOYMENT_STOP_ON_ALARM&lt;/code&gt; in its rollback config, so it catches the alarm and reverses the deployment.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;TreatMissingData: notBreaching&lt;/code&gt; is worth noting. Without it, the alarm fires during periods with zero traffic (nights, weekends) because "no data" defaults to "assume breach." That caused a false rollback the first time I tested this on a weekend.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change next time
&lt;/h2&gt;

&lt;p&gt;I'd probably use ECS Fargate. CodeDeploy's blue/green for ECS actually supports &lt;code&gt;TimeBasedCanary&lt;/code&gt; properly, so you can do true 10% -&amp;gt; 50% -&amp;gt; 100% shifts with observation windows between each step. EC2 blue/green is coarser. You get "new instances with traffic control," but the fine-grained percentage steps aren't natively there.&lt;/p&gt;

&lt;p&gt;I stuck with EC2 because I wanted to learn how the instance-level deployment mechanics work. Worth it for the education. Probably not what I'd pick for a real production system in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repo
&lt;/h2&gt;

&lt;p&gt;Public: &lt;a href="https://github.com/suletetes/ShipGuard" rel="noopener noreferrer"&gt;github.com/suletetes/ShipGuard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three templates, one buildspec, one appspec, four shell scripts. Deploy staging first, then production, then pipeline. Push code. Pipeline picks it up.&lt;/p&gt;

&lt;p&gt;Costs about $45/month with everything running. ALBs are most of that ($16 each). Tear down staging when you're not testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CodePipeline, CodeBuild, CodeDeploy&lt;/li&gt;
&lt;li&gt;EC2 Auto Scaling behind ALBs&lt;/li&gt;
&lt;li&gt;CloudWatch alarms&lt;/li&gt;
&lt;li&gt;SNS&lt;/li&gt;
&lt;li&gt;S3&lt;/li&gt;
&lt;li&gt;CloudFormation&lt;/li&gt;
&lt;li&gt;TypeScript/Express (the app being deployed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've done the "deploy a Lambda" tutorials and want something closer to what production infrastructure actually looks like, this hits the right problems. Cross-stack references, IAM chains, blue/green mechanics, alarm-driven automation. The stuff that takes four tries to get right and nobody warns you about in advance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deployed infrastructure
&lt;/h2&gt;

&lt;p&gt;Here's what the stacks actually create in AWS:&lt;/p&gt;

&lt;h3&gt;
  
  
  CloudFormation stacks
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1yp9825xciq8uvk9o48b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1yp9825xciq8uvk9o48b.png" alt="CloudFormation Stacks" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  VPC subnets (multi-AZ)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxmrz07hdfdd202pxb6ot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxmrz07hdfdd202pxb6ot.png" alt="Subnets" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Security groups
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3yqh4v380kw3z76jg7yy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3yqh4v380kw3z76jg7yy.png" alt="Security Groups" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd2y79a23dydqu1o139q0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd2y79a23dydqu1o139q0.png" alt="Security Group Rules" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Load balancers
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjoiyz1eiz4o1flpikzrx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjoiyz1eiz4o1flpikzrx.png" alt="ALB" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Target groups (blue and green)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkd24ci62gik06scc5f84.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkd24ci62gik06scc5f84.png" alt="Target Groups" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto Scaling groups
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0jn4i5rpvnvpzgxr1o3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0jn4i5rpvnvpzgxr1o3.png" alt="ASG" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  EC2 instances
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtyvx446c3lkx3t2lbd5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtyvx446c3lkx3t2lbd5.png" alt="Instances" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  S3 artifact bucket
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwsia67c89xs7a0a75v9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwsia67c89xs7a0a75v9.png" alt="S3" width="799" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  SNS topics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3igc3blm3qjrul7jtop.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3igc3blm3qjrul7jtop.png" alt="SNS" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>cicd</category>
      <category>cloudformation</category>
    </item>
    <item>
      <title>Migrating a MERN app to AWS serverless (and what broke)</title>
      <dc:creator>Suleiman Abdulkadir</dc:creator>
      <pubDate>Thu, 28 May 2026 21:02:34 +0000</pubDate>
      <link>https://dev.to/suletete/migrating-a-mern-app-to-aws-serverless-and-what-broke-318m</link>
      <guid>https://dev.to/suletete/migrating-a-mern-app-to-aws-serverless-and-what-broke-318m</guid>
      <description>&lt;p&gt;I built Taskly about a year ago. Standard MERN stack, ran on a $10/month VPS with PM2 and nginx. It worked fine. Nobody was complaining.&lt;/p&gt;

&lt;p&gt;I migrated it to AWS serverless anyway. Partly to learn, partly because I was mass applying to DevOps roles and needed something real to talk about in interviews. "I deployed a hello world Lambda" doesn't cut it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The app
&lt;/h2&gt;

&lt;p&gt;Task management for small teams. Tasks, projects, teams, calendar, notifications, avatar uploads, productivity stats. About 15 API routes, 6 Mongoose models. React frontend with Context API nothing fancy but enough moving parts that the migration wasn't trivial.&lt;/p&gt;

&lt;p&gt;Original stack: Express, session auth, MongoDB, Cloudinary, Resend for emails.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I ended up
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft0qx27ok2e0qztf2lh4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft0qx27ok2e0qztf2lh4x.png" alt=" " width="800" height="835"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Request flow: users hit CloudFront for the React app, WAF-filtered API Gateway for the backend. Lambda runs Express via serverless-express, talks to DocumentDB in a private VPC, pushes events to EventBridge, and queues emails through SQS to SES.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ypo0lfdj43xttj7vq88.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ypo0lfdj43xttj7vq88.png" alt=" " width="800" height="804"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Network layer: VPC with public and private subnets across two AZs. Security groups restrict DocumentDB to Lambda only (port 27017). VPC endpoints for Secrets Manager. NAT gateway for outbound.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8w79g2vyopdwteuyps2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8w79g2vyopdwteuyps2.png" alt=" " width="800" height="612"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Deployment: GitHub Actions authenticates via OIDC (no stored keys), packages Lambda, does canary traffic shifting, monitors error rate, auto-rolls back if anything breaks.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;VPC with private subnets. Security groups. NAT gateway. Secrets Manager. 12 Terraform modules, about 2000 lines of HCL. GitHub Actions CI/CD with OIDC auth and canary deployments.&lt;/p&gt;

&lt;p&gt;Took about 3 weeks of evenings.&lt;/p&gt;
&lt;h2&gt;
  
  
  Sessions broke immediately
&lt;/h2&gt;

&lt;p&gt;First thing. My Express app used express-session with a MongoDB store. Lambda spins up new instances per request. Sessions were just gone.&lt;/p&gt;

&lt;p&gt;I ended up with dual mode auth. Sessions for local dev (easy to debug, familiar), Cognito JWT for production (stateless, works with Lambda). The middleware checks which environment it's running in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;COGNITO_USER_POOL_ID&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;COGNITO_CLIENT_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;validateCognitoToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isAuthenticated&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({...});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not elegant. Works.&lt;/p&gt;

&lt;h2&gt;
  
  
  DocumentDB is almost MongoDB
&lt;/h2&gt;

&lt;p&gt;95% compatible. The other 5% shows up at the worst times.&lt;/p&gt;

&lt;p&gt;Some aggregation stages behave differently. The connection needs a TLS certificate bundle you have to download from AWS and ship with your Lambda zip. I only caught these because I tested against the actual cluster, not just local Mongo.&lt;/p&gt;

&lt;p&gt;If I'd only tested locally, these would have been production bugs discovered at 2 am.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraform got out of hand fast
&lt;/h2&gt;

&lt;p&gt;Started with one main.tf. Lasted a day.&lt;/p&gt;

&lt;p&gt;Split into modules: vpc, lambda, iam, s3, documentdb, waf, ses, api-gateway, cloudfront, secrets, monitoring, disaster-recovery. State in S3 with DynamoDB locking.&lt;/p&gt;

&lt;p&gt;The thing about Terraform: &lt;code&gt;plan&lt;/code&gt; says "2 to add, 0 to destroy" and you feel safe. Then &lt;code&gt;apply&lt;/code&gt; takes 15 minutes because NAT gateways are slow. And if it fails halfway, you get to learn about &lt;code&gt;terraform state rm&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security groups
&lt;/h2&gt;

&lt;p&gt;The mental model that clicked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda SG: egress all (needs DocumentDB, NAT, VPC endpoints)&lt;/li&gt;
&lt;li&gt;DocumentDB SG: ingress port 27017 from Lambda SG only&lt;/li&gt;
&lt;li&gt;VPC Endpoints SG: ingress port 443 from Lambda SG only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three groups. Database unreachable from internet. Lambda can reach database. Done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Canary deploys
&lt;/h2&gt;

&lt;p&gt;The CI/CD pipeline packages the code, uploads to S3, publishes a new Lambda version, shifts 10% of traffic to it, waits 5 minutes watching CloudWatch error metrics, and either promotes to 100% or rolls back.&lt;/p&gt;

&lt;p&gt;Saved me twice. Once from a missing env var, once from a dependency that worked locally but not in the Lambda runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change
&lt;/h2&gt;

&lt;p&gt;Skip the VPC for Lambda if possible. The ENI attachment adds cold start latency, and NAT gateways cost $32/month each. DocumentDB forces you into a VPC though, so I was stuck.&lt;/p&gt;

&lt;p&gt;Write smaller Terraform modules. My IAM module has 8 policies in one file. Should be separate.&lt;/p&gt;

&lt;p&gt;Set up CI/CD first, not last. I did manual deploys for weeks. Dumb.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Old VPS: $10/month&lt;/li&gt;
&lt;li&gt;AWS serverless: ~$45/month (mostly NAT gateway and DocumentDB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More expensive. But I actually understand VPCs, IAM, security groups, and Terraform now. That's worth more than $35/month to me.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/suletetes/taskly" rel="noopener noreferrer"&gt;github.com/suletetes/taskly&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Infrastructure in &lt;code&gt;infrastructure/&lt;/code&gt;, Lambda handler in &lt;code&gt;backend/lambda/handler.js&lt;/code&gt;, CI/CD in &lt;code&gt;.github/workflows/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you're doing something similar, start with VPC and DocumentDB. They take the longest to provision and have the most surprises. Get those working, then add Lambda and API Gateway on top.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>node</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Built a SaaS Platform From Scratch. Here's How I Architected It on AWS.</title>
      <dc:creator>Suleiman Abdulkadir</dc:creator>
      <pubDate>Tue, 28 Apr 2026 00:03:29 +0000</pubDate>
      <link>https://dev.to/suletete/i-built-a-saas-platform-from-scratch-heres-how-i-architected-it-on-aws-5li</link>
      <guid>https://dev.to/suletete/i-built-a-saas-platform-from-scratch-heres-how-i-architected-it-on-aws-5li</guid>
      <description>&lt;p&gt;So I've been working on something for a while now. It's called TechVerse. It's a SaaS e-commerce platform, and I built the whole thing from the ground up using the MERN stack.&lt;/p&gt;

&lt;p&gt;I want to talk about the cloud architecture side of things. Not the textbook version. The real version. The one where you're staring at your screen at 2am trying to figure out why your WebSocket connections keep dropping, or why your API response times just spiked to 4 seconds.&lt;/p&gt;

&lt;p&gt;Let me walk you through it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem I Was Trying to Solve
&lt;/h2&gt;

&lt;p&gt;Here's the situation. There are thousands of tech retailers in Nigeria who sell laptops, phones, accessories, all kinds of stuff. Most of them run their entire business from a physical shop. No website. No online store. Nothing.&lt;/p&gt;

&lt;p&gt;Why? Because getting a custom e-commerce site built costs a fortune. And the international platforms? They charge in dollars. That's a dealbreaker when your local currency fluctuates every other week.&lt;/p&gt;

&lt;p&gt;So I thought, what if I built a SaaS platform that lets these businesses spin up a professional online store for a fraction of the cost? Pricing in local currency. Optimized for local internet speeds. Local payment gateways baked right in.&lt;/p&gt;

&lt;p&gt;That was the idea. Now I had to figure out how to actually build and deploy it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Stack
&lt;/h2&gt;

&lt;p&gt;I went with what I know best. MongoDB, Express, React, Node. The classic MERN stack. But I made some specific choices that matter for production.&lt;/p&gt;

&lt;p&gt;React 19 with Vite 7 on the frontend. Vite is ridiculously fast for builds. The dev experience alone is worth it, but more importantly, the production bundles are tiny. That matters when your users are on 3G connections.&lt;/p&gt;

&lt;p&gt;Node.js 20 with Express on the backend. Nothing fancy here. It works. It scales. The ecosystem is massive. I added Socket.io for real-time features like order notifications and live inventory updates.&lt;/p&gt;

&lt;p&gt;MongoDB Atlas for the database. I considered self-hosting on EC2, but honestly, managed databases save you so much headache. Automated backups, point-in-time recovery, monitoring. All handled. I went with an M10 cluster to start.&lt;/p&gt;

&lt;p&gt;Redis through ElastiCache for caching and session management. This was a game changer for performance. More on that later.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture (And Why Each Piece Exists)
&lt;/h2&gt;

&lt;p&gt;Alright, let's break down the actual AWS setup. I'll explain why I chose each service, not just what it does.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Entry Point: Route 53 and CloudFront
&lt;/h3&gt;

&lt;p&gt;Every request starts at Route 53 for DNS resolution. Simple enough. But the real magic is CloudFront.&lt;/p&gt;

&lt;p&gt;CloudFront is AWS's CDN, and it has edge locations in Lagos. That's huge. It means my users in Nigeria are hitting a server that's physically close to them, not one sitting in Ireland or Virginia.&lt;/p&gt;

&lt;p&gt;I configured CloudFront to do two things. Static file requests go to an S3 bucket where my React build lives. API requests get forwarded to my backend through the Application Load Balancer. One domain, two destinations. Clean and simple.&lt;/p&gt;

&lt;p&gt;I also attached an ACM certificate here. Free SSL. No reason not to use it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Frontend: S3
&lt;/h3&gt;

&lt;p&gt;The React app gets built by Vite, and the output goes straight into an S3 bucket. No servers involved. S3 serves static files incredibly well, and combined with CloudFront caching, my frontend loads in under 2 seconds on a 3G connection.&lt;/p&gt;

&lt;p&gt;I set up error page redirects so that 404s go back to index.html. That's essential for single-page apps with client-side routing. Without it, refreshing any page that isn't the root would give you a blank screen.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Backend: EC2 with Auto Scaling
&lt;/h3&gt;

&lt;p&gt;Here's where it gets interesting. My Node.js API runs on EC2 instances inside a public subnet. I have two instances behind an Application Load Balancer, with an Auto Scaling Group that can spin up to four instances based on CPU utilization.&lt;/p&gt;

&lt;p&gt;Why not Fargate or Lambda? Honestly, for a WebSocket-heavy application, EC2 gives you more control. Lambda has cold starts that would kill the real-time experience. Fargate is great but adds complexity I didn't need yet. EC2 with a good Auto Scaling policy hits the sweet spot.&lt;/p&gt;

&lt;p&gt;The ALB distributes traffic evenly and handles health checks. If one instance goes down, traffic automatically routes to the healthy ones. No manual intervention needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Data Layer: MongoDB Atlas and ElastiCache
&lt;/h3&gt;

&lt;p&gt;MongoDB Atlas sits in a private subnet. It's peered with my VPC, so the connection is fast and secure. No public internet involved.&lt;/p&gt;

&lt;p&gt;ElastiCache Redis handles three things for me. Session storage, so users stay logged in across multiple EC2 instances. Response caching, so repeated database queries don't hit MongoDB every time. And rate limiting, so I can throttle abusive requests without adding load to my application servers.&lt;/p&gt;

&lt;p&gt;Before I added Redis, my average API response time was around 400ms. After? Under 200ms. That's the kind of improvement that users actually feel.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring and Email: CloudWatch and SES
&lt;/h3&gt;

&lt;p&gt;CloudWatch collects logs and metrics from everything. EC2 instances, the load balancer, Redis, all of it. I set up alarms for CPU spikes, memory usage, and error rates. If something breaks at 3am, I get a notification.&lt;/p&gt;

&lt;p&gt;Amazon SES handles transactional emails. Order confirmations, password resets, shipping updates. It's cheap and reliable. Way better than trying to manage your own SMTP server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Backups
&lt;/h3&gt;

&lt;p&gt;Everything gets backed up to S3. MongoDB Atlas handles its own backups, but I also dump snapshots to S3 for extra safety. CloudWatch logs go there too. Storage is cheap. Losing data is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CI/CD Pipeline
&lt;/h2&gt;

&lt;p&gt;This part I'm actually proud of. GitHub Actions handles everything.&lt;/p&gt;

&lt;p&gt;When I push to the main branch, here's what happens. The pipeline runs tests. If they pass, it builds the React frontend and syncs it to S3. Then it deploys the backend to EC2 through the load balancer. Zero downtime. The whole process takes about 4 minutes.&lt;/p&gt;

&lt;p&gt;I also have separate workflows for staging and production. Feature branches deploy to staging automatically. Production requires a manual approval step. That one extra click has saved me from shipping broken code more than once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stripe Integration
&lt;/h2&gt;

&lt;p&gt;Payments go through Stripe. The integration is bidirectional. My EC2 instances send payment requests to Stripe's API, and Stripe sends webhook events back for things like successful charges, refunds, and subscription updates.&lt;/p&gt;

&lt;p&gt;I handle webhooks on a dedicated endpoint with signature verification. Never trust incoming data without verifying it. That's a lesson you only need to learn once.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Costs
&lt;/h2&gt;

&lt;p&gt;Here's the part everyone wants to know. My monthly AWS bill for this setup is roughly $90 to $110. That breaks down to about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC2 instances (t3.small): $25-30&lt;/li&gt;
&lt;li&gt;MongoDB Atlas (M10): $57&lt;/li&gt;
&lt;li&gt;ElastiCache Redis (t3.micro): $12&lt;/li&gt;
&lt;li&gt;S3 and CloudFront: $5-10&lt;/li&gt;
&lt;li&gt;Route 53 and misc: $2-3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a production SaaS platform with auto-scaling, CDN, managed database, caching, monitoring, and automated deployments, that's pretty reasonable. It can comfortably handle hundreds of concurrent users and thousands of daily requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons I Learned the Hard Way
&lt;/h2&gt;

&lt;p&gt;Let me share a few things that bit me during this process.&lt;/p&gt;

&lt;p&gt;WebSocket connections through CloudFront need specific configuration. You have to set up the right cache behaviors and forward the Upgrade header. I spent an entire weekend debugging why Socket.io worked locally but not in production. The fix was three lines of CloudFront config.&lt;/p&gt;

&lt;p&gt;Don't skip the VPC design. I initially put everything in a public subnet because it was easier. Then I realized my database was exposed to the internet. Moved it to a private subnet immediately. Take the time to set up your network properly from day one.&lt;/p&gt;

&lt;p&gt;Redis connection pooling matters. My first implementation created a new Redis connection for every request. Under load, I was hitting connection limits within minutes. Connection pooling fixed it instantly.&lt;/p&gt;

&lt;p&gt;Auto Scaling needs a cooldown period. Without it, your instances will scale up and down like a yo-yo. I set a 5-minute cooldown, and the scaling became smooth and predictable.&lt;/p&gt;

&lt;p&gt;Environment variables are not optional. I had a brief moment where I accidentally committed a JWT secret to GitHub. Rotated it immediately and moved everything to AWS Systems Manager Parameter Store. Use it. It's free.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The architecture I have now works well for the current stage. But I'm already thinking about what comes next as the platform grows.&lt;/p&gt;

&lt;p&gt;I want to add a message queue. Probably SQS. Right now, some of my background tasks like sending emails and processing images run synchronously. That's fine with 50 users. It won't be fine with 500.&lt;/p&gt;

&lt;p&gt;I'm also looking at moving to containers eventually. ECS with Fargate would give me better resource utilization and simpler deployments. But that's a migration I'll do when the current setup starts showing strain, not before.&lt;/p&gt;

&lt;p&gt;And I need better observability. CloudWatch is good for basics, but I want distributed tracing. Probably AWS X-Ray or something like Datadog. When you have multiple services talking to each other, you need to see the full picture of a request's journey.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Building a SaaS platform is one thing. Making it production-ready on AWS is a completely different challenge. It forces you to think about things you never consider during development. Network security. Scaling behavior. Cost optimization. Disaster recovery.&lt;/p&gt;

&lt;p&gt;But here's what I've realized. You don't need a perfect architecture on day one. You need one that works, that you understand, and that you can evolve. Start simple. Add complexity only when you have a real reason to.&lt;/p&gt;

&lt;p&gt;If you want to dig into the code, the full project is on GitHub: &lt;a href="https://github.com/suletetes/TechVerse" rel="noopener noreferrer"&gt;TechVerse on GitHub&lt;/a&gt;&lt;/p&gt;




</description>
      <category>aws</category>
      <category>webdev</category>
      <category>cloud</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
