DEV Community: Suleiman Abdulkadir

I built a self-healing web app on AWS and watched it recover from failure in real time

Suleiman Abdulkadir — Wed, 01 Jul 2026 09:00:00 +0000

I wanted actually to understand AWS networking. Not "I followed a tutorial, and it worked" understand. More like "I can explain why this NAT Gateway exists and what breaks if I delete it" understand.

So I built CloudPulse. It's a TypeScript app that monitors its own infrastructure and displays the health of every instance on a dashboard. The interesting part: when you kill an instance, the system detects the failure and replaces it automatically while users never notice anything went wrong.

No Terraform. No CloudFormation. Raw AWS CLI calls in bash scripts, each one commented so I'd remember what it does in six months.

How it's wired together

Internet traffic hits an Application Load Balancer sitting in public subnets. The ALB forwards requests to EC2 instances in private subnets on port 3000. Those instances have no public IP; they can't be reached directly from the internet at all. When they need to talk to AWS APIs (publishing CloudWatch metrics, describing their own ASG), they go through NAT Gateways.

There's one NAT Gateway per availability zone. If the one in AZ-1 dies, only the instance in AZ-1 loses outbound connectivity. The instance in AZ-2 keeps working through its own NAT Gateway. That's the point of having two.

The ALB checks /health every 30 seconds. Three consecutive failures and the instance gets pulled from the target group. The Auto Scaling Group notices the instance is unhealthy, terminates it, and launches a fresh one. No human involved.

The part I actually wanted to see: killing an instance

This is why I built the whole thing. I wanted to watch a system heal itself.

I ran aws ec2 terminate-instances on one of the two running instances. Then I sat there watching the dashboard refresh every 30 seconds.

Within about a minute, the terminated instance showed up as unhealthy. The ASG launched a replacement. The new instance booted Amazon Linux 2023, pulled my app from S3, installed dependencies, started the Node process, and began responding to the ALB health checks.

Total recovery time: under 2 minutes. And during those 2 minutes, the ALB was sending all traffic to the surviving instance. Nobody waiting for a page load would have noticed anything.

That's the thing about self-healing infrastructure. It's boring when it works. You kill something, wait a bit, and everything is back to normal. But getting to that boring place required wiring up health checks, ASG policies, target group settings, and IAM permissions correctly. The boring outcome is the proof that the wiring works.

Auto scaling under load

I connected to one of the instances via SSM Session Manager (no SSH keys anywhere in this setup) and ran stress --cpu 4 --timeout 180. This pegged the CPU at 100% for 3 minutes.

CloudWatch saw the CPUUtilization metric exceed 70% for 2 consecutive 60-second periods. The alarm fired. The ASG added a third instance. When the stress test ended and CPU dropped below 30% for 2 minutes, the alarm fired again, and the ASG removed the extra instance.

The scaling policies have a 300-second cooldown so they don't thrash back and forth.

The instances themselves

Both run t3.micro (free tier eligible, sort of; you get 750 hours/month, but 2 instances burn 1440 hours). Private subnets, no public IP, no SSH key pair. I access them through Systems Manager Session Manager when I need to poke around.

The IAM role attached to the instances allows exactly four things: publish CloudWatch metrics, describe EC2 instances, describe ASG state, and use SSM for shell access. Nothing else.

One-command deployment

bash deploy.sh runs five scripts in order:

iam.sh creates the role and instance profile
vpc.sh builds the entire network (this takes ~3 minutes because NAT Gateways are slow to provision)
alb.sh creates security groups, the load balancer, target group, and listener
compute.sh creates the launch template and ASG (instances start booting here)
monitoring.sh creates the CloudWatch alarms

At the end, it prints the ALB URL. Wait 3-5 minutes for instances to pass health checks, then open it.

bash teardown.sh deletes everything in reverse order. Takes about 3 minutes. I run it every time I finish a learning session because NAT Gateways cost $2/day just sitting there.

What I used

The app itself is TypeScript on Express. Server-side rendered HTML with EJS (no frontend framework; the dashboard is one page that refreshes every 30 seconds). 101 tests across unit, property-based (fast-check), and integration (supertest).

The infrastructure is pure AWS CLI in bash. Every script sources a shared config file and a common utilities file. Resource IDs get saved to an env file so scripts can reference what previous scripts created.

AWS services in this project: VPC, public/private subnets, Internet Gateway, NAT Gateways, route tables, NACLs, security groups, Application Load Balancer, EC2 via Launch Template, Auto Scaling Group, CloudWatch custom metrics, CloudWatch alarms, IAM roles with instance profiles, EBS gp3 volumes, SSM Session Manager, and S3 for code delivery.

What I learned the hard way

Git Bash on Windows rewrites any path starting with / to a Windows path. My health check path /health became C:/Program Files/Git/health during deployment. Took me a while to figure out why the target group health check was failing. Fix: export MSYS_NO_PATHCONV=1.

ALB network interfaces take 5-10 minutes to fully release after you delete the ALB. If you try to delete the security groups too early, you get "DependencyViolation" errors. The teardown script has to wait.

IAM is eventually consistent. If you create an instance profile and immediately reference it in a launch template, it sometimes fails because the profile hasn't propagated yet. I added a 10-second sleep after IAM operations. Ugly, but it works.

Security groups that reference each other can't be deleted independently. You have to remove the cross-reference rules first, then delete them. The teardown script handles this, but it was a pain to debug the first time.

Source

GitHub: suletetes/AWS-HA-WebApp

I built a group expense app where the database refuses to let balances lie

Suleiman Abdulkadir — Sat, 27 Jun 2026 16:14:37 +0000

Built for the H0 Hackathon AWS + Vercel track.
Live: ledgerloop-delta.vercel.app
Repo: github.com/suletete/LedgerLoop

Three friends in three time zones split a hotel bill. Two of them add expenses from different cities at the same time. Most apps let one of those writes quietly overwrite the other. Balances drift by a cent. Nobody notices until someone gets asked to pay more than they actually owe.

That's not a UI problem. It's a math problem.

The invariant that started everything

In a closed expense group, the sum of all net balances must equal zero. Always. If Alice owes Bob $50 and Bob owes Carol $30, those flows have to cancel out across the whole group. That's not a business requirement someone wrote in a spec. It's arithmetic.

Most apps treat this as an application concern; they lock a row, update a running total, and release the lock. The problem: if two writes race, application-layer optimistic locking can fail silently. You get a wrong number and no error.

I wanted the database to refuse the inconsistency. Don't paper over it. Refuse it.

Aurora PostgreSQL runs with serializable isolation. When two transactions touch the same data at the same instant, one of them gets SQLSTATE 40001, a serialization failure, and the database aborts it. No silent merge. No wrong number. A clean error you can retry from a fresh read.

The rest of the architecture follows from that.

What I built

LedgerLoop is a group expense ledger. Friends split shared costs, rent, trips, dinners, and the app figures out exactly what each person owes, reduced to the fewest possible payments.

The core flow: add an expense, split it (equal, by percentage, or exact amounts), and watch the balances update. If someone tries to settle $600 when they only owe $500, the system rejects it before anything hits the database. Twelve tangled debts collapse to four payments.

Stack: Next.js 15 (App Router) on Vercel, TypeScript strict mode, Tailwind CSS, Aurora PostgreSQL Serverless v2, Vitest + fast-check.

The concurrency model changed my architecture

Traditional systems: lock the row, update it, release the lock. With OCC, you read a snapshot, compute your change, commit if someone else changed the same data while you were computing, you get 40001, and start over from a fresh read.

That sounds like a retry strategy. It's more than that.

Once the model clicked, I stopped storing derived values. A running balance column is a conflict surface. Two writes that touch it will race. So I didn't store it. Balance is computed fresh on every read, from the raw ledger expenses, splits, and settlements. Nothing to corrupt. Nothing to get out of sync.

The ledger is append-only. Expenses and settlements are inserted. Corrections are new reversing rows, not edits. Two inserts with different UUIDs rarely collide. The conflict window gets narrow enough that retries are rare in practice.

Every write goes through withOccRetry: on 40001, exponential backoff with jitter, up to 4 attempts. If all 4 exhaust, a clean error comes back. The ledger is exactly as it was before the first attempt.

The architecture didn't lead me to the concurrency model. The concurrency model told me what the architecture had to look like.

The remainder bug property testing found

My first equal-split implementation passed every unit test I wrote.

// Naive: $10.00 / 3 = 333 + 333 + 333 = 999 cents. One cent gone, every time.
const perPerson = Math.floor(amount / n);

The property test caught it on the second run. fast-check threw random amounts and random participant counts at the invariant sum(shares) == amount and found a specific combination I hadn't thought to try.

The fix was two lines:

const base = Math.floor(amount / n);
const remainder = amount - base * n;
// First `remainder` participants get base + 1, the rest get base.

For $10.00 / 3: 334 + 333 + 333 = 1000 cents. Always.

I wrote dozens of unit tests before this. None of them found it. The property test found it on run 2 because it generated the exact amount and participant count combination where naive flooring fails.

That's property testing. You write down what must always be true and let the library go looking for trouble.

Six invariants, each one trying to break itself

#	What must hold	Enforced by
INV-1	`sum(splits) == expense amount`	Split Calculator + atomic transaction
INV-2	`sum(balances) == 0` across a group	Balance Engine derivation
INV-3	No double-counting under concurrency	Aurora SERIALIZABLE + withOccRetry
INV-4	Money is always integer minor units	BIGINT storage, no floats
INV-5	Settlement <= what's owed	Settlement Validator against derived ledger
INV-6	Every row references a real entity	Auth Guard + DB foreign keys

27 property-based tests total. Each one runs 100+ random inputs via fast-check.

INV-2 is the most satisfying. The balance engine reads every expense, split, and settlement for a group and computes each member's net. Positive means the group owes you. Negative means you owe. The sum across the whole group is always zero, not something verified after the fact, but a consequence of how the formula adds up.

Here's the schema those invariants protect. Groups and membership:

The append-only ledger records expenses, splits, and settlements. No UPDATE ever runs on these tables. Corrections are new reversing rows:

Auth tables persisted to Aurora, so sessions survive Vercel cold starts:

Why integers everywhere

IEEE 754 can't represent 0.1 exactly. $10.50 stored as a float might come back as 10.4999...97. The fix isn't smarter, rounding it's never using floats.

$10.00 is stored as 1000 cents. The database column is BIGINT. The TypeScript type is number (safe for integers up to 2^53 - 1). Formatting back to major units happens exactly once, at the display layer, and nowhere else. Any field with money in it ends in Minor: amountMinor, shareMinor.

The percentage split problem is subtler. Math.round(amount * pct / 100) per person produces totals that are off by 2 minor units on certain inputs. The actual fix: floor every share, sum the floors, hand out the shortfall one unit at a time to whoever had the largest fractional part. The property test for this ran 100 iterations. It hasn't failed once.

Testing OCC without a live database

138 tests. 24 files. Under 25 seconds. No database, no network.

The in-memory fake has an injectOccConflict(n) method. Pass n = 2 and the next two writes throw 40001 before touching the state. That's how the retry path gets exercised, including the full backoff sequence without a live database.

The persistence layer sits behind an interface. The entire test suite runs against the fake. The real Aurora adapter and the fake both implement the same interface, so swapping them is one line:

// src/lib/persistence-factory.ts
export function getPersistence(): Persistence {
  return process.env.AURORA_HOST
    ? new AuroraPersistence()
    : new InMemoryPersistence();
}

The architecture

One Next.js deployment. One Aurora database. No cross-service coordination. Concurrency correctness is Aurora's job, not the application's.

Every write goes through four steps left to right: auth check → split calculation → OCC retry wrapper → Aurora atomic transaction. The red dashed arrow is the SQLSTATE 40001 retry loop. Aurora fires it on a conflict, withOccRetry backs off and retries, the second attempt lands cleanly.

Registration: how a new account gets created and persisted to Aurora:

Sign-in and session lookup warm path hits in-memory, cold start falls back to Aurora:

Try it yourself

git clone https://github.com/suletete/LedgerLoop
cd LedgerLoop
npm install
npm test      # 138 tests, all pass, no database needed
npm run dev   # http://localhost:3000

No AWS account. No Docker. The in-memory fake handles everything locally.

Or go straight to ledgerloop-delta.vercel.app and register. The first request may take 5-10 seconds for an Aurora Serverless cold start. The second is fast.

What I'd do differently

The settle-up flow is half-wired. The UI renders, and the server action exists. I ran out of time connecting them before the deadline. That's the most obvious gap.

The OCC demo page shows two concurrent writes and lets you watch one retry. It works, but real-time feedback instead of a page reload would be better. That's a WebSocket problem I decided not to introduce mid-hackathon.

The thing that stuck with me

I came into this thinking the concurrency problem was a detail I'd handle at the end. It turned out to be the thing everything else bent around. Append-only inserts, no stored balances, no mutable totals to race on, none of those were the original plan. The OCC model required them.

The database refusing a conflicting write isn't a retry strategy. It's a contract. The architecture is just what you have to build to honor it.

Built for the H0 Hackathon (AWS + Vercel track). #H0Hackathon

I built an event-driven order system with both ECS and Lambda. Here's why.

Suleiman Abdulkadir — Tue, 23 Jun 2026 09:00:00 +0000

Every AWS interview I've done asks some version of the same question: containers or serverless? And every time, the "right answer" is "it depends." Which is true but useless.

So I built a system that uses both. On purpose. Not as a compromise, but because different parts of the same application have different runtime needs. The API needs consistent latency. Background jobs need to scale to zero. Trying to force both into one compute model is the wrong move.

This is EventForge. It's an e-commerce order processing platform with event-driven architecture, a Step Functions saga, and about 15 AWS services wired together.

The full picture is hard to read at this scale, so I split it into three views.

Request flow

A user signs in via Cognito, the React frontend sends authenticated requests to the ALB, which forwards to ECS Fargate containers running the Express API. The API reads/writes DynamoDB and publishes events to EventBridge.

Order workflow

The Step Functions saga that processes an order through validation, inventory reservation, payment, and confirmation, with compensation paths when something fails.

Background processing

After an order completes, EventBridge fans out to SQS queues. Lambda processors handle emails, PDF receipts, and webhooks. Dead letter queues catch failures, CloudWatch alarms notify on DLQ depth.

The containers vs. serverless thing

I'll keep this short because it's genuinely simple once you see it:

	ECS Fargate (API)	Lambda (background)
Response time	Consistent sub-200ms	Cold starts add 500ms-3s
Sustained traffic	Predictable cost	Expensive at high RPS
Idle periods	You're paying anyway	Free (scale to zero)
Burst scaling	Minutes	Milliseconds

My API runs on Fargate. Two tasks behind an ALB. Health checks pass in under 100ms because there's no cold start penalty. Users hit the order creation endpoint, get a 201 back in ~150ms, and the system handles the rest asynchronously.

The "rest" is ten Lambda functions that process emails, generate PDF receipts, deliver webhooks, and run the entire order fulfillment saga. They sit idle most of the time. When an order comes in, they wake up, do their thing, and go back to sleep. I pay nothing when nobody's ordering.

The order saga (the interesting part)

This is where I spent most of my time. An order goes through four steps: validate, reserve inventory, charge payment, confirm. If step 3 (payment) fails after step 2 (inventory) succeeded, you have a problem. Inventory is reserved but the order is dead.

Step Functions handles this with a saga pattern:

Each step is a separate Lambda. If ChargePayment throws, the workflow doesn't just fail. It routes to ReleaseInventory first, which undoes the reservation. Then it calls OrderFailed to persist the failure status. Only then does the execution terminate.

I defined the workflow in ASL (Amazon States Language). Each state uses Catch blocks that route to compensation steps:

"ChargePayment": {
  "Type": "Task",
  "Resource": "${ChargePaymentFunctionArn}",
  "Next": "ConfirmOrder",
  "Catch": [{
    "ErrorEquals": ["States.ALL"],
    "Next": "ReleaseInventory"
  }]
}

The compensation path runs in reverse. Payment failed? Release the reservation. Reservation failed? Nothing to compensate, just mark as failed. It's boring when it works, which is the point.

EventBridge does the fan-out

When ConfirmOrder completes, it publishes an order.completed event to a custom EventBridge bus. One event, three consumers:

SQS queue -> Lambda sends confirmation email (SES)
SQS queue -> Lambda generates PDF receipt (uploads to S3)
SQS queue -> Lambda delivers to registered webhook URLs

Each queue has a dead letter queue. Each DLQ has a CloudWatch alarm. If messages start piling up in the DLQ, something is broken and I want to know.

The PDF processor generates a minimal valid PDF (no library dependencies, just raw PDF syntax) and uploads it to S3 under receipts/{orderId}.pdf. The orders API exposes a presigned URL endpoint so users can download their receipt.

External systems can also push events in via API Gateway. There's a separate HTTP API with a POST /webhooks/ingest route that validates the payload and publishes to EventBridge. This is how third party services would feed events into the system.

The processors unwrap the EventBridge envelope (the SQS body is the full EventBridge event, not just the detail), extract the order data, and do their thing. I wasted two hours on this during deployment. The processors kept crashing and the DLQs were filling up. Turned out EventBridge wraps your payload in an envelope with version, id, source, detail-type fields, and the actual data is nested inside detail. My code was doing JSON.parse(body) and treating the result as the order directly. Everything was undefined.

The API layer

TypeScript, Express, running in a Docker container on Fargate. Standard stuff. The parts worth mentioning:

JWT validation against Cognito (JWKS endpoint with key caching), an event publisher that retries transient EventBridge failures with exponential backoff, and a DynamoDB single-table design where orders, events, and webhook registrations all live in one table with composite keys.

The Docker image lives in ECR. The GitHub Actions pipeline pushes a new image on every merge to main, and ECS picks it up on the next deployment.

The frontend is React on S3 with static website hosting. It polls /api/events and /api/orders every 10 seconds. There's a form to create orders and a section to register webhook URLs. Nothing fancy, but it proves the whole pipeline works.

Infrastructure as code (all of it)

One template.yaml at the root. Nested stacks for each layer:

VPC (2 AZs, public/private subnets, NAT gateways)
DynamoDB
SQS queues + DLQs
Cognito
IAM roles (least privilege per service)
ECS cluster + service + ALB
EventBridge bus + rules
Lambda functions
API Gateway (for external webhook ingestion)
CloudWatch alarms

sam package and sam deploy. Two commands to go from code to running infrastructure. The Lambda code is pre-bundled with esbuild into self-contained files because SAM can't resolve npm workspace symlinks (this took me a while to figure out). I wrote a small script (scripts/bundle-lambdas.js) that creates ten individual bundles, each with all dependencies inlined except the AWS SDK (provided by the runtime).

The deployment pipeline

GitHub Actions. Push to main and it:

Builds TypeScript
Bundles Lambdas with esbuild
Builds and pushes the Docker image to ECR
Packages and deploys with SAM
Reads the new Cognito pool ID and ALB DNS from stack outputs
Rebuilds the frontend with those values baked in
Syncs to S3

The whole thing takes about 8 minutes.

Testing

343 tests. 19 of them are property-based (fast-check). Those generate 100 random inputs per test and verify invariants like "for any valid order request, the system always produces exactly one pending record and one event" and "for any webhook registration, the URL hash is deterministic."

The property-based tests caught two bugs that unit tests missed: an edge case in the order validator where a price of exactly 0.00 passed validation (it shouldn't), and a race condition in the idempotency check where two identical requests within the same millisecond could both succeed.

What it costs

About $35/month with two Fargate tasks running. Most of that is the ALB ($16/month regardless of traffic) and Fargate compute ($18). Everything else (Lambda, DynamoDB, SQS, EventBridge) falls under free tier at low traffic.

If you're showing this off for 30 minutes and then tearing it down, it costs about $0.50.

Stuff I'd do differently

The NAT gateways are expensive for a demo. I'd use VPC endpoints for DynamoDB and EventBridge instead, which drops the monthly cost significantly. I kept the NAT gateways because ECS tasks in private subnets need them to pull images from ECR, but there's an ECR VPC endpoint that solves that too.

SES is still in sandbox mode, so emails only go to verified addresses. For a real production system you'd request production access.

The frontend is HTTP-only (S3 static hosting). A real deployment would put CloudFront in front for HTTPS. I tried it during development but hit a circular dependency between the OAC and the bucket policy, so I dropped it and went with direct S3 hosting. Works fine for a demo.

I built a pipeline that rolls itself back when production breaks

Suleiman Abdulkadir — Fri, 05 Jun 2026 22:04:55 +0000

Deployments that break silently at night bother me. By the time someone checks Slack in the morning, users have been hitting 502s for hours. I built ShipGuard because I wanted the infrastructure to fix itself before I even knew something was wrong.

It's a CodePipeline that does blue/green deployment with canary traffic shifting to EC2. If the new version starts returning 5xx errors, CodeDeploy shifts traffic back to the old version and kills the broken instances. I don't have to do anything.

Three CloudFormation templates. Everything in source control. Nothing configured by hand.

The pipeline flow

Push to main. Everything after that is automatic:

Pull source from GitHub
npm audit, Trivy, git-secrets run first. High or critical vuln? Build dies.
Tests run, TypeScript compiles, artifact gets packaged
Deploy to staging (one instance, in-place)
Email lands asking me to approve
I approve. Blue/green starts on production.
10% of traffic routes to the new version for 5 minutes
CloudWatch alarm stays quiet? Remaining traffic shifts over.
Old instances terminated. Done.

If the alarm fires during steps 7 or 8, traffic goes 100% back to the previous version. Green instances die. I get an email explaining which alarm triggered the rollback.

Things that bit me

TimeBasedCanary doesn't work for EC2

I spent an afternoon trying to configure TimeBasedCanary in a custom deployment config. CloudFormation accepted the template at lint time and then failed at deploy time with "Traffic routing configuration should be null for Server deployment configuration."

Turns out canary percentage configs only exist for ECS and Lambda. For EC2, the ALB and target group weights handle traffic shifting, not a CodeDeploy config. Nowhere in the docs does it say this clearly; you just find out when it breaks.

IAM role chain from hell

Four roles. They all need to trust different AWS services, and they all need slightly different permissions:

Pipeline role needs iam:PassRole to hand off to CodeDeploy
CodeBuild role needs S3 access to the artifact bucket plus CloudWatch Logs
CodeDeploy role needs EC2, ASG, ALB, S3
EC2 instance profile needs to pull from S3 and push CloudWatch metrics

Miss one permission and you get "Access Denied" with no indication of which call failed or which role is the problem. I iterated on this more times than I'd like to admit.

You need an AMI in your Launch Template

This one's embarrassing. cfn-lint doesn't catch a missing ImageId in a Launch Template. CloudFormation doesn't catch it either, until the ASG actually tries to spin up an instance and fails. The fix is an SSM parameter that resolves to the latest Amazon Linux 2023 AMI:

LatestAmiId:
  Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
  Default: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64

CodeStarNotifications has different tag syntax

I deployed the pipeline stack three times before figuring this out. Every other resource in CloudFormation uses Tags as a list:

Tags:
  - Key: Name
    Value: my-rule

AWS::CodeStarNotifications::NotificationRule wants a map:

Tags:
  Name: my-rule

CloudFormation gives you "Properties validation failed" with no hint about what's wrong. I found the answer buried in a GitHub issue after 20 minutes of searching.

Security scanning

Three scans run before tests. If any exit non-zero, the build stops, and nothing reaches staging:

pre_build:
  commands:
    - npm audit --audit-level=high
    - trivy fs --severity HIGH, CRITICAL --exit-code 1
    - git secrets --scan

No third-party service needed. npm audit is already there, Trivy downloads in a few seconds during install, and git-secrets is an AWS open source tool that's one clone away. The pipeline sends a notification identifying which scan killed the build.

The rollback alarm

Production5xxAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    MetricName: HTTPCode_Target_5XX_Count
    Namespace: AWS/ApplicationELB
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 10
    ComparisonOperator: GreaterThanThreshold
    TreatMissingData: notBreaching

10 or more 5xx errors in 60 seconds trigger the alarm. CodeDeploy has DEPLOYMENT_STOP_ON_ALARM in its rollback config, so it catches the alarm and reverses the deployment.

TreatMissingData: notBreaching is worth noting. Without it, the alarm fires during periods with zero traffic (nights, weekends) because "no data" defaults to "assume breach." That caused a false rollback the first time I tested this on a weekend.

What I'd change next time

I'd probably use ECS Fargate. CodeDeploy's blue/green for ECS actually supports TimeBasedCanary properly, so you can do true 10% -> 50% -> 100% shifts with observation windows between each step. EC2 blue/green is coarser. You get "new instances with traffic control," but the fine-grained percentage steps aren't natively there.

I stuck with EC2 because I wanted to learn how the instance-level deployment mechanics work. Worth it for the education. Probably not what I'd pick for a real production system in 2026.

Repo

Public: github.com/suletetes/ShipGuard

Three templates, one buildspec, one appspec, four shell scripts. Deploy staging first, then production, then pipeline. Push code. Pipeline picks it up.

Costs about $45/month with everything running. ALBs are most of that ($16 each). Tear down staging when you're not testing.

Stack

CodePipeline, CodeBuild, CodeDeploy
EC2 Auto Scaling behind ALBs
CloudWatch alarms
SNS
S3
CloudFormation
TypeScript/Express (the app being deployed)

If you've done the "deploy a Lambda" tutorials and want something closer to what production infrastructure actually looks like, this hits the right problems. Cross-stack references, IAM chains, blue/green mechanics, alarm-driven automation. The stuff that takes four tries to get right and nobody warns you about in advance.

The deployed infrastructure

Here's what the stacks actually create in AWS:

CloudFormation stacks

VPC subnets (multi-AZ)

Security groups

Load balancers

Target groups (blue and green)

Auto Scaling groups

EC2 instances

S3 artifact bucket

SNS topics

Migrating a MERN app to AWS serverless (and what broke)

Suleiman Abdulkadir — Thu, 28 May 2026 21:02:34 +0000

I built Taskly about a year ago. Standard MERN stack, ran on a $10/month VPS with PM2 and nginx. It worked fine. Nobody was complaining.

I migrated it to AWS serverless anyway. Partly to learn, partly because I was mass applying to DevOps roles and needed something real to talk about in interviews. "I deployed a hello world Lambda" doesn't cut it.

The app

Task management for small teams. Tasks, projects, teams, calendar, notifications, avatar uploads, productivity stats. About 15 API routes, 6 Mongoose models. React frontend with Context API nothing fancy but enough moving parts that the migration wasn't trivial.

Original stack: Express, session auth, MongoDB, Cloudinary, Resend for emails.

Where I ended up

Request flow: users hit CloudFront for the React app, WAF-filtered API Gateway for the backend. Lambda runs Express via serverless-express, talks to DocumentDB in a private VPC, pushes events to EventBridge, and queues emails through SQS to SES.

Network layer: VPC with public and private subnets across two AZs. Security groups restrict DocumentDB to Lambda only (port 27017). VPC endpoints for Secrets Manager. NAT gateway for outbound.

Deployment: GitHub Actions authenticates via OIDC (no stored keys), packages Lambda, does canary traffic shifting, monitors error rate, auto-rolls back if anything breaks.

VPC with private subnets. Security groups. NAT gateway. Secrets Manager. 12 Terraform modules, about 2000 lines of HCL. GitHub Actions CI/CD with OIDC auth and canary deployments.

Took about 3 weeks of evenings.

Sessions broke immediately

First thing. My Express app used express-session with a MongoDB store. Lambda spins up new instances per request. Sessions were just gone.

I ended up with dual mode auth. Sessions for local dev (easy to debug, familiar), Cognito JWT for production (stateless, works with Lambda). The middleware checks which environment it's running in:

if (process.env.COGNITO_USER_POOL_ID && process.env.COGNITO_CLIENT_ID) {
  return validateCognitoToken(req, res, next);
} else {
  return req.isAuthenticated() ? next() : res.status(401).json({...});
}

Not elegant. Works.

DocumentDB is almost MongoDB

95% compatible. The other 5% shows up at the worst times.

Some aggregation stages behave differently. The connection needs a TLS certificate bundle you have to download from AWS and ship with your Lambda zip. I only caught these because I tested against the actual cluster, not just local Mongo.

If I'd only tested locally, these would have been production bugs discovered at 2 am.

Terraform got out of hand fast

Started with one main.tf. Lasted a day.

Split into modules: vpc, lambda, iam, s3, documentdb, waf, ses, api-gateway, cloudfront, secrets, monitoring, disaster-recovery. State in S3 with DynamoDB locking.

The thing about Terraform: plan says "2 to add, 0 to destroy" and you feel safe. Then apply takes 15 minutes because NAT gateways are slow. And if it fails halfway, you get to learn about terraform state rm.

Security groups

The mental model that clicked:

Lambda SG: egress all (needs DocumentDB, NAT, VPC endpoints)
DocumentDB SG: ingress port 27017 from Lambda SG only
VPC Endpoints SG: ingress port 443 from Lambda SG only

Three groups. Database unreachable from internet. Lambda can reach database. Done.

Canary deploys

The CI/CD pipeline packages the code, uploads to S3, publishes a new Lambda version, shifts 10% of traffic to it, waits 5 minutes watching CloudWatch error metrics, and either promotes to 100% or rolls back.

Saved me twice. Once from a missing env var, once from a dependency that worked locally but not in the Lambda runtime.

What I'd change

Skip the VPC for Lambda if possible. The ENI attachment adds cold start latency, and NAT gateways cost $32/month each. DocumentDB forces you into a VPC though, so I was stuck.

Write smaller Terraform modules. My IAM module has 8 policies in one file. Should be separate.

Set up CI/CD first, not last. I did manual deploys for weeks. Dumb.

Cost

Old VPS: $10/month
AWS serverless: ~$45/month (mostly NAT gateway and DocumentDB)

More expensive. But I actually understand VPCs, IAM, security groups, and Terraform now. That's worth more than $35/month to me.

Code

github.com/suletetes/taskly

Infrastructure in infrastructure/, Lambda handler in backend/lambda/handler.js, CI/CD in .github/workflows/.

If you're doing something similar, start with VPC and DocumentDB. They take the longest to provision and have the most surprises. Get those working, then add Lambda and API Gateway on top.

I Built a SaaS Platform From Scratch. Here's How I Architected It on AWS.

Suleiman Abdulkadir — Tue, 28 Apr 2026 00:03:29 +0000

So I've been working on something for a while now. It's called TechVerse. It's a SaaS e-commerce platform, and I built the whole thing from the ground up using the MERN stack.

I want to talk about the cloud architecture side of things. Not the textbook version. The real version. The one where you're staring at your screen at 2am trying to figure out why your WebSocket connections keep dropping, or why your API response times just spiked to 4 seconds.

Let me walk you through it.

The Problem I Was Trying to Solve

Here's the situation. There are thousands of tech retailers in Nigeria who sell laptops, phones, accessories, all kinds of stuff. Most of them run their entire business from a physical shop. No website. No online store. Nothing.

Why? Because getting a custom e-commerce site built costs a fortune. And the international platforms? They charge in dollars. That's a dealbreaker when your local currency fluctuates every other week.

So I thought, what if I built a SaaS platform that lets these businesses spin up a professional online store for a fraction of the cost? Pricing in local currency. Optimized for local internet speeds. Local payment gateways baked right in.

That was the idea. Now I had to figure out how to actually build and deploy it.

Choosing the Stack

I went with what I know best. MongoDB, Express, React, Node. The classic MERN stack. But I made some specific choices that matter for production.

React 19 with Vite 7 on the frontend. Vite is ridiculously fast for builds. The dev experience alone is worth it, but more importantly, the production bundles are tiny. That matters when your users are on 3G connections.

Node.js 20 with Express on the backend. Nothing fancy here. It works. It scales. The ecosystem is massive. I added Socket.io for real-time features like order notifications and live inventory updates.

MongoDB Atlas for the database. I considered self-hosting on EC2, but honestly, managed databases save you so much headache. Automated backups, point-in-time recovery, monitoring. All handled. I went with an M10 cluster to start.

Redis through ElastiCache for caching and session management. This was a game changer for performance. More on that later.

The Architecture (And Why Each Piece Exists)

Alright, let's break down the actual AWS setup. I'll explain why I chose each service, not just what it does.

The Entry Point: Route 53 and CloudFront

Every request starts at Route 53 for DNS resolution. Simple enough. But the real magic is CloudFront.

CloudFront is AWS's CDN, and it has edge locations in Lagos. That's huge. It means my users in Nigeria are hitting a server that's physically close to them, not one sitting in Ireland or Virginia.

I configured CloudFront to do two things. Static file requests go to an S3 bucket where my React build lives. API requests get forwarded to my backend through the Application Load Balancer. One domain, two destinations. Clean and simple.

I also attached an ACM certificate here. Free SSL. No reason not to use it.

The Frontend: S3

The React app gets built by Vite, and the output goes straight into an S3 bucket. No servers involved. S3 serves static files incredibly well, and combined with CloudFront caching, my frontend loads in under 2 seconds on a 3G connection.

I set up error page redirects so that 404s go back to index.html. That's essential for single-page apps with client-side routing. Without it, refreshing any page that isn't the root would give you a blank screen.

The Backend: EC2 with Auto Scaling

Here's where it gets interesting. My Node.js API runs on EC2 instances inside a public subnet. I have two instances behind an Application Load Balancer, with an Auto Scaling Group that can spin up to four instances based on CPU utilization.

Why not Fargate or Lambda? Honestly, for a WebSocket-heavy application, EC2 gives you more control. Lambda has cold starts that would kill the real-time experience. Fargate is great but adds complexity I didn't need yet. EC2 with a good Auto Scaling policy hits the sweet spot.

The ALB distributes traffic evenly and handles health checks. If one instance goes down, traffic automatically routes to the healthy ones. No manual intervention needed.

The Data Layer: MongoDB Atlas and ElastiCache

MongoDB Atlas sits in a private subnet. It's peered with my VPC, so the connection is fast and secure. No public internet involved.

ElastiCache Redis handles three things for me. Session storage, so users stay logged in across multiple EC2 instances. Response caching, so repeated database queries don't hit MongoDB every time. And rate limiting, so I can throttle abusive requests without adding load to my application servers.

Before I added Redis, my average API response time was around 400ms. After? Under 200ms. That's the kind of improvement that users actually feel.

Monitoring and Email: CloudWatch and SES

CloudWatch collects logs and metrics from everything. EC2 instances, the load balancer, Redis, all of it. I set up alarms for CPU spikes, memory usage, and error rates. If something breaks at 3am, I get a notification.

Amazon SES handles transactional emails. Order confirmations, password resets, shipping updates. It's cheap and reliable. Way better than trying to manage your own SMTP server.

Backups

Everything gets backed up to S3. MongoDB Atlas handles its own backups, but I also dump snapshots to S3 for extra safety. CloudWatch logs go there too. Storage is cheap. Losing data is not.

The CI/CD Pipeline

This part I'm actually proud of. GitHub Actions handles everything.

When I push to the main branch, here's what happens. The pipeline runs tests. If they pass, it builds the React frontend and syncs it to S3. Then it deploys the backend to EC2 through the load balancer. Zero downtime. The whole process takes about 4 minutes.

I also have separate workflows for staging and production. Feature branches deploy to staging automatically. Production requires a manual approval step. That one extra click has saved me from shipping broken code more than once.

Stripe Integration

Payments go through Stripe. The integration is bidirectional. My EC2 instances send payment requests to Stripe's API, and Stripe sends webhook events back for things like successful charges, refunds, and subscription updates.

I handle webhooks on a dedicated endpoint with signature verification. Never trust incoming data without verifying it. That's a lesson you only need to learn once.

What This Actually Costs

Here's the part everyone wants to know. My monthly AWS bill for this setup is roughly $90 to $110. That breaks down to about:

EC2 instances (t3.small): $25-30
MongoDB Atlas (M10): $57
ElastiCache Redis (t3.micro): $12
S3 and CloudFront: $5-10
Route 53 and misc: $2-3

For a production SaaS platform with auto-scaling, CDN, managed database, caching, monitoring, and automated deployments, that's pretty reasonable. It can comfortably handle hundreds of concurrent users and thousands of daily requests.

Lessons I Learned the Hard Way

Let me share a few things that bit me during this process.

WebSocket connections through CloudFront need specific configuration. You have to set up the right cache behaviors and forward the Upgrade header. I spent an entire weekend debugging why Socket.io worked locally but not in production. The fix was three lines of CloudFront config.

Don't skip the VPC design. I initially put everything in a public subnet because it was easier. Then I realized my database was exposed to the internet. Moved it to a private subnet immediately. Take the time to set up your network properly from day one.

Redis connection pooling matters. My first implementation created a new Redis connection for every request. Under load, I was hitting connection limits within minutes. Connection pooling fixed it instantly.

Auto Scaling needs a cooldown period. Without it, your instances will scale up and down like a yo-yo. I set a 5-minute cooldown, and the scaling became smooth and predictable.

Environment variables are not optional. I had a brief moment where I accidentally committed a JWT secret to GitHub. Rotated it immediately and moved everything to AWS Systems Manager Parameter Store. Use it. It's free.

What's Next

The architecture I have now works well for the current stage. But I'm already thinking about what comes next as the platform grows.

I want to add a message queue. Probably SQS. Right now, some of my background tasks like sending emails and processing images run synchronously. That's fine with 50 users. It won't be fine with 500.

I'm also looking at moving to containers eventually. ECS with Fargate would give me better resource utilization and simpler deployments. But that's a migration I'll do when the current setup starts showing strain, not before.

And I need better observability. CloudWatch is good for basics, but I want distributed tracing. Probably AWS X-Ray or something like Datadog. When you have multiple services talking to each other, you need to see the full picture of a request's journey.

Final Thoughts

Building a SaaS platform is one thing. Making it production-ready on AWS is a completely different challenge. It forces you to think about things you never consider during development. Network security. Scaling behavior. Cost optimization. Disaster recovery.

But here's what I've realized. You don't need a perfect architecture on day one. You need one that works, that you understand, and that you can evolve. Start simple. Add complexity only when you have a real reason to.

If you want to dig into the code, the full project is on GitHub: TechVerse on GitHub

DEV Community: Suleiman Abdulkadir

I built a self-healing web app on AWS and watched it recover from failure in real time

How it's wired together

The part I actually wanted to see: killing an instance

Auto scaling under load

The instances themselves

One-command deployment

What I used

What I learned the hard way

Source

I built a group expense app where the database refuses to let balances lie

The invariant that started everything

What I built

The concurrency model changed my architecture

The remainder bug property testing found

Six invariants, each one trying to break itself

Why integers everywhere

Testing OCC without a live database

The architecture

Try it yourself

What I'd do differently

The thing that stuck with me

I built an event-driven order system with both ECS and Lambda. Here's why.

Request flow

Order workflow

Background processing

The containers vs. serverless thing

The order saga (the interesting part)

EventBridge does the fan-out

The API layer

Infrastructure as code (all of it)

The deployment pipeline

Testing

What it costs

Stuff I'd do differently

Links

I built a pipeline that rolls itself back when production breaks

The pipeline flow

Things that bit me

TimeBasedCanary doesn't work for EC2

IAM role chain from hell

You need an AMI in your Launch Template

CodeStarNotifications has different tag syntax

Security scanning

The rollback alarm

What I'd change next time

Repo

Stack

The deployed infrastructure

CloudFormation stacks

VPC subnets (multi-AZ)

Security groups

Load balancers

Target groups (blue and green)

Auto Scaling groups

EC2 instances

S3 artifact bucket

SNS topics

Migrating a MERN app to AWS serverless (and what broke)

The app

Where I ended up

Sessions broke immediately

DocumentDB is almost MongoDB

Terraform got out of hand fast

Security groups

Canary deploys

What I'd change

Cost

Code

I Built a SaaS Platform From Scratch. Here's How I Architected It on AWS.

The Problem I Was Trying to Solve

Choosing the Stack

The Architecture (And Why Each Piece Exists)

The Entry Point: Route 53 and CloudFront

The Frontend: S3

The Backend: EC2 with Auto Scaling

The Data Layer: MongoDB Atlas and ElastiCache

Monitoring and Email: CloudWatch and SES

Backups

The CI/CD Pipeline