Suleiman Abdulkadir

Posted on Jun 23

I built an event-driven order system with both ECS and Lambda. Here's why.

#aws #serverless #typescript #architecture

Every AWS interview I've done asks some version of the same question: containers or serverless? And every time, the "right answer" is "it depends." Which is true but useless.

So I built a system that uses both. On purpose. Not as a compromise, but because different parts of the same application have different runtime needs. The API needs consistent latency. Background jobs need to scale to zero. Trying to force both into one compute model is the wrong move.

This is EventForge. It's an e-commerce order processing platform with event-driven architecture, a Step Functions saga, and about 15 AWS services wired together.

The full picture is hard to read at this scale, so I split it into three views.

Request flow

A user signs in via Cognito, the React frontend sends authenticated requests to the ALB, which forwards to ECS Fargate containers running the Express API. The API reads/writes DynamoDB and publishes events to EventBridge.

Order workflow

The Step Functions saga that processes an order through validation, inventory reservation, payment, and confirmation, with compensation paths when something fails.

Background processing

After an order completes, EventBridge fans out to SQS queues. Lambda processors handle emails, PDF receipts, and webhooks. Dead letter queues catch failures, CloudWatch alarms notify on DLQ depth.

The containers vs. serverless thing

I'll keep this short because it's genuinely simple once you see it:

	ECS Fargate (API)	Lambda (background)
Response time	Consistent sub-200ms	Cold starts add 500ms-3s
Sustained traffic	Predictable cost	Expensive at high RPS
Idle periods	You're paying anyway	Free (scale to zero)
Burst scaling	Minutes	Milliseconds

My API runs on Fargate. Two tasks behind an ALB. Health checks pass in under 100ms because there's no cold start penalty. Users hit the order creation endpoint, get a 201 back in ~150ms, and the system handles the rest asynchronously.

The "rest" is ten Lambda functions that process emails, generate PDF receipts, deliver webhooks, and run the entire order fulfillment saga. They sit idle most of the time. When an order comes in, they wake up, do their thing, and go back to sleep. I pay nothing when nobody's ordering.

The order saga (the interesting part)

This is where I spent most of my time. An order goes through four steps: validate, reserve inventory, charge payment, confirm. If step 3 (payment) fails after step 2 (inventory) succeeded, you have a problem. Inventory is reserved but the order is dead.

Step Functions handles this with a saga pattern:

Each step is a separate Lambda. If ChargePayment throws, the workflow doesn't just fail. It routes to ReleaseInventory first, which undoes the reservation. Then it calls OrderFailed to persist the failure status. Only then does the execution terminate.

I defined the workflow in ASL (Amazon States Language). Each state uses Catch blocks that route to compensation steps:

"ChargePayment": {
  "Type": "Task",
  "Resource": "${ChargePaymentFunctionArn}",
  "Next": "ConfirmOrder",
  "Catch": [{
    "ErrorEquals": ["States.ALL"],
    "Next": "ReleaseInventory"
  }]
}

The compensation path runs in reverse. Payment failed? Release the reservation. Reservation failed? Nothing to compensate, just mark as failed. It's boring when it works, which is the point.

EventBridge does the fan-out

When ConfirmOrder completes, it publishes an order.completed event to a custom EventBridge bus. One event, three consumers:

SQS queue -> Lambda sends confirmation email (SES)
SQS queue -> Lambda generates PDF receipt (uploads to S3)
SQS queue -> Lambda delivers to registered webhook URLs

Each queue has a dead letter queue. Each DLQ has a CloudWatch alarm. If messages start piling up in the DLQ, something is broken and I want to know.

The PDF processor generates a minimal valid PDF (no library dependencies, just raw PDF syntax) and uploads it to S3 under receipts/{orderId}.pdf. The orders API exposes a presigned URL endpoint so users can download their receipt.

External systems can also push events in via API Gateway. There's a separate HTTP API with a POST /webhooks/ingest route that validates the payload and publishes to EventBridge. This is how third party services would feed events into the system.

The processors unwrap the EventBridge envelope (the SQS body is the full EventBridge event, not just the detail), extract the order data, and do their thing. I wasted two hours on this during deployment. The processors kept crashing and the DLQs were filling up. Turned out EventBridge wraps your payload in an envelope with version, id, source, detail-type fields, and the actual data is nested inside detail. My code was doing JSON.parse(body) and treating the result as the order directly. Everything was undefined.

The API layer

TypeScript, Express, running in a Docker container on Fargate. Standard stuff. The parts worth mentioning:

JWT validation against Cognito (JWKS endpoint with key caching), an event publisher that retries transient EventBridge failures with exponential backoff, and a DynamoDB single-table design where orders, events, and webhook registrations all live in one table with composite keys.

The Docker image lives in ECR. The GitHub Actions pipeline pushes a new image on every merge to main, and ECS picks it up on the next deployment.

The frontend is React on S3 with static website hosting. It polls /api/events and /api/orders every 10 seconds. There's a form to create orders and a section to register webhook URLs. Nothing fancy, but it proves the whole pipeline works.

Infrastructure as code (all of it)

One template.yaml at the root. Nested stacks for each layer:

VPC (2 AZs, public/private subnets, NAT gateways)
DynamoDB
SQS queues + DLQs
Cognito
IAM roles (least privilege per service)
ECS cluster + service + ALB
EventBridge bus + rules
Lambda functions
API Gateway (for external webhook ingestion)
CloudWatch alarms

sam package and sam deploy. Two commands to go from code to running infrastructure. The Lambda code is pre-bundled with esbuild into self-contained files because SAM can't resolve npm workspace symlinks (this took me a while to figure out). I wrote a small script (scripts/bundle-lambdas.js) that creates ten individual bundles, each with all dependencies inlined except the AWS SDK (provided by the runtime).

The deployment pipeline

GitHub Actions. Push to main and it:

Builds TypeScript
Bundles Lambdas with esbuild
Builds and pushes the Docker image to ECR
Packages and deploys with SAM
Reads the new Cognito pool ID and ALB DNS from stack outputs
Rebuilds the frontend with those values baked in
Syncs to S3

The whole thing takes about 8 minutes.

Testing

343 tests. 19 of them are property-based (fast-check). Those generate 100 random inputs per test and verify invariants like "for any valid order request, the system always produces exactly one pending record and one event" and "for any webhook registration, the URL hash is deterministic."

The property-based tests caught two bugs that unit tests missed: an edge case in the order validator where a price of exactly 0.00 passed validation (it shouldn't), and a race condition in the idempotency check where two identical requests within the same millisecond could both succeed.

What it costs

About $35/month with two Fargate tasks running. Most of that is the ALB ($16/month regardless of traffic) and Fargate compute ($18). Everything else (Lambda, DynamoDB, SQS, EventBridge) falls under free tier at low traffic.

If you're showing this off for 30 minutes and then tearing it down, it costs about $0.50.

Stuff I'd do differently

The NAT gateways are expensive for a demo. I'd use VPC endpoints for DynamoDB and EventBridge instead, which drops the monthly cost significantly. I kept the NAT gateways because ECS tasks in private subnets need them to pull images from ECR, but there's an ECR VPC endpoint that solves that too.

SES is still in sandbox mode, so emails only go to verified addresses. For a real production system you'd request production access.

The frontend is HTTP-only (S3 static hosting). A real deployment would put CloudFront in front for HTTPS. I tried it during development but hit a circular dependency between the OAC and the bucket policy, so I dropped it and went with direct S3 hosting. Works fine for a demo.

DEV Community