DEV Community: Joud Awad

Day 2/30 AWS System Design Patterns

Joud Awad — Wed, 22 Jul 2026 15:32:06 +0000

A fintech startup is building a real-time payment processing platform. Three teams need to react when a payment completes: the notifications team (send push + email), the fraud team (score the transaction), and the analytics team (write to the data warehouse for reporting and reconciliation — including re-processing historical events whenever finance restates a report).

The backend architect picks EventBridge (serverless event bus — rules are evaluated at publish time; an event that matches no enabled rule is discarded, not stored) for all three. Each team wires up a rule that triggers their Lambda (serverless compute). Three teams, three rules, one bus. Clean.

Six months later, the data warehouse goes down for a planned 4-hour migration on Sunday night. To stop their Lambda from throwing 14,000 errors against a dead warehouse, the analytics team disables their EventBridge rule for the window. Standard procedure. They re-enable it Monday at 6 AM.

14,000 payment events fire during those 4 hours. The fraud and notifications rules process every one of them. For the analytics rule, EventBridge evaluates each event, finds no enabled match, and moves on. No delivery is attempted. Nothing fails. Nothing is retried. Nothing is stored.

$1.4M in transactions are missing from the Monday reconciliation report. The audit team flags it by 9 AM.

The architect chose EventBridge for all three consumers. That was the mistake. What should the analytics team have been wired to instead?

A) Enable a DLQ (Dead Letter Queue — captures events EventBridge attempted to deliver to a target but could not) on the analytics rule — Sunday's 14,000 events would have been captured and re-driven after the maintenance window

B) Use SQS (message queue — buffers messages until a consumer processes and deletes them; each message is consumed once, then gone) as the rule target — the queue absorbs events while the pipeline is down, and the Lambda drains it on resume without disabling anything

C) Consume from a Kinesis Data Stream (streaming log — records are retained 24 hours to 7 days regardless of whether anyone reads them; each consumer owns its read position and resumes from it) — during maintenance the pipeline simply stops, then resumes from its last checkpoint and reads everything it missed

D) Enable EventBridge Archive + Replay (stores events published to the bus; an operator can later replay them, scoped to selected rules) — the archive retains Sunday's events and a Monday-morning replay re-delivers them to the analytics rule

Answer in the comments.

Day 1/30 AWS System Design Patterns

Joud Awad — Tue, 21 Jul 2026 15:30:27 +0000

A gaming platform stores player leaderboard data in DynamoDB (managed NoSQL database, distributes data across internal partitions — each partition has its own throughput ceiling). The table uses game_id as the partition key and player_id as the sort key. The platform runs 15 active games. One game — game_id = "battle-royale" — accounts for 78% of all read traffic. It is the flagship title that launched 6 months ago and still dominates the player base.

The table is provisioned at 10,000 RCUs. CloudWatch (AWS monitoring service, reports consumed RCUs as a table-level aggregate across all partitions) shows average consumed RCUs at 4,300 — well under the provisioned capacity. But ThrottlingException errors are spiking on leaderboard reads for battle-royale during peak hours, and P99 read latency has climbed to 820 ms.

The on-call engineer opens a ticket to increase provisioned capacity to 20,000 RCUs. The support team approves the change. It goes live.

The throttling does not stop. CloudWatch still shows consumed RCUs well below the new limit.

You have 10,000 RCUs of headroom sitting idle and a table that is still throttling. What is actually happening?

A) Yes — doubling provisioned RCUs gives the table enough capacity headroom to absorb the battle-royale traffic spike

B) No — the table needs to migrate to DynamoDB on-demand capacity mode (no pre-provisioned capacity, adapts per-partition throughput dynamically), which removes per-partition limits entirely

C) No — battle-royale is a hot partition; per-partition throughput limits apply regardless of total table RCU provisioning; traffic concentrated on one partition key hits that partition's ceiling even when the table has headroom

D) No — the issue is a missing GSI (Global Secondary Index — a secondary index for alternate access patterns) on game_id causing full partition scans on leaderboard reads

Saga Pattern Explained: Distributed Transactions in Microservices

Joud Awad — Mon, 20 Jul 2026 15:49:08 +0000

Distributed Transactions are one of the hardest problems you can face in any distrubuted systems.

It is not because of the complexity of implementation, but it is all about the mental model change that you go through, going from a single ACID transaction into a distrubuted one.

Inside one database, ACID handles this for you: any step fails, the whole thing rolls back, clean. Spread that order across four microservices and the safety net is gone. No transaction can span four databases.

The instinct is two-phase commit. Don't. 2PC holds locks across every service while it waits for all of them to vote yes, so one slow participant freezes the whole set. Most managed and NoSQL databases won't support it anyway.

That's the saga: a chain of local transactions, each with a compensating action for when a later step fails.

Here's what most people miss. A saga gives you A, C, and D. It quietly drops the I. No isolation across services means two sagas running at once can corrupt each other: lost updates, dirty reads, non-repeatable reads. Real anomalies, in production, that no framework hides for you.

And "compensation" is not rollback. Committed data doesn't un-commit. You run a new transaction that makes it right: a refund, a cancellation, a restock.

Two ways to coordinate the same saga.

Choreography: services react to each other's events, no central coordinator. Clean with 3 services. A nightmare to trace at 3am when it's 9.
Orchestration: one brain (Step Functions, Temporal) tells each service what to do next. More moving parts, but you can see the flow and debug it.

The fix for the missing isolation is boring and it works: a status column. PENDING, then CONFIRMED or CANCELLED. A semantic lock so nothing downstream acts on half-finished state.

I put the whole thing into one video. Why 2PC is a dead end, choreography vs orchestration running the same saga, the isolation anomalies that bite at scale, and one food-order example carried across four services start to finish.

https://www.youtube.com/watch?v=--O9JUsvtEg

The Outbox Pattern Explained (Complete Guide)

Joud Awad — Mon, 13 Jul 2026 16:20:20 +0000

Your service writes an order to the database, then publishes an event to Kafka. Two lines. Nobody flags it in review.
In production, it quietly loses events.

The failure nobody catches until it's a support ticket: the DB commit succeeds, then the process dies before the Kafka publish lands. A network blip, a pod restart, a broker that's down for four seconds. Now the order exists in your database, but the shipping service never heard about it. Your data says one thing, your event stream says another, and nothing threw an error.

This is the dual-write problem. You're writing to two systems that don't share a transaction, and no amount of reordering makes "save then publish" atomic. Flip the two lines and you get the mirror bug: event out, DB write fails.

The outbox pattern fixes it by refusing to do two writes at all.
Instead of publishing to Kafka, you insert the event into an outbox table inside the same local transaction as your business data. One transaction, one database. Both rows commit or neither does. Your database already guarantees that. A separate relay then reads the outbox and forwards events to the broker.

That's the whole trick. You turn a distributed-transaction problem into a boring local one, and databases are very good at boring local transactions.

Now the part most "complete guides" skip.
The outbox does not give you exactly-once delivery. It gives you at-least-once. The relay can publish an event and then crash before it marks the row as sent, so it sends again on restart. Log-based CDC has the same gap: it ships the change, dies before the offset is acknowledged, resends on recovery. If your consumers aren't idempotent, you didn't kill the bug. You just moved it downstream.
Two ways to run the relay, and neither is the "right" one:
Polling: a loop that queries the table for unsent rows. Easy to reason about. It also hammers the database, ties your latency to the poll interval, and needs locking so two instances don't grab the same row.

CDC: Debezium tailing the Postgres WAL or MySQL binlog. Near real-time, no query load, keeps commit order. You pay for it in operations: replication slots, WAL retention, and another connector to babysit.

Polling is completely fine to start with. You can switch to CDC later without changing the outbox schema.
One thing everyone forgets: that table grows forever unless you delete processed rows. You find this out somewhere around a few hundred million rows, usually at the worst time.
There's no reliability magic here. Just an honest admission that you can't touch two systems atomically, so you stop pretending and design around it.

If you've run this in production: polling or CDC, and what pushed you one way?

Full end-to-end breakdown here: https://www.youtube.com/watch?v=BJvQdS0m-Kw

Free e-book with 60 scenario-based system design interview questions

Joud Awad — Thu, 09 Jul 2026 12:20:33 +0000

I wrote down the 60 questions I wish someone had handed me years ago.

For a long time I thought system design was about memorizing architectures. Draw the boxes, name a database, sprinkle in a cache, done. Then I ended up in rooms where the interviewer just kept asking “why,” and the pretty diagrams stopped saving me.

The real work is in the tradeoffs. What breaks at 10x traffic, why you’d accept eventual consistency and what it quietly costs you later, whether a queue is solving the problem or just hiding it somewhere you’ll meet again at 3am.

None of that fits on a flashcard.

So the e-book is 60+ scenario-based questions instead. They start beginner-friendly and climb to the kind that make experienced engineers go quiet for a second. Each one drops you into a real situation and makes you reason through it, the way a sharp interviewer would. Or the way production does, with less mercy.

Two ways to use it:

Prepping for interviews? It’s reps on the questions that keep coming up.
Already building these systems? It’s a way to pressure-test what you assume you know.
It’s free. No email wall, no upsell at the end. Just the questions and the reasoning behind them.

Link’s below. One thing I’m curious about: which system design question did you walk in thinking you had cold, right until you didn’t?

https://drive.google.com/file/d/1RZGN4VZgFc5w8LRkhpoXfpmtfB_n9hzY/view?usp=sharing

CQRS Explained

Joud Awad — Mon, 06 Jul 2026 10:39:02 +0000

Your read model and write model being the same is a common issue in system design, and CQRS highlights this problem. However, many explanations simplify CQRS as a binary choice: either you implement it or you don't.

In reality, CQRS is more like a ladder. In my latest video, I break down CQRS into four levels, each corresponding to specific challenges that arise as a system scales. Each level addresses a unique problem and comes with its own complexities.

The common pitfall? Teams often leap to the most advanced level, assuming it’s the best approach, and end up with unnecessary complexity. The lower levels can provide significant benefits at a fraction of the cost.

I explore all four levels, detailing the use case for each and pinpointing where the complexity begins to escalate. Check it out here:

https://youtu.be/fi6HbqmVJL0

56/60 Days System Design Questions

Joud Awad — Wed, 01 Jul 2026 16:29:39 +0000

Your background job ran for 4 minutes and nobody knows if it finished.

That's not a job queue problem. That's a missing design problem.

Long-running jobs break every assumption you built for synchronous APIs. Your load balancer times out after 30s. Your mobile client doesn't know whether to retry. Your retry logic re-runs a job that already half-completed.

Here's the real scenario:

You're processing a video upload. The job takes 2–8 minutes. Millions of users.

What do you expose to the client?

A) Polling endpoint — client hits /jobs/:id/status every 5s until done
B) Webhook — job fires a POST to client's callback URL on completion
C) SSE / WebSocket — server pushes progress updates in real time
D) Synchronous wait — keep the HTTP connection open until the job finishes

One scales to millions without coupling your infrastructure to client uptime.

The others have hard production failure modes most teams don't discover until 3 AM.

The deeper problem isn't transport — it's these 4 things nobody gets right the first time:

→ Idempotency. Every job must be safe to re-run. If your retry logic can double-charge, double-send, or double-process — you don't have retries, you have bugs waiting.

→ Progress granularity. "0% → 100%" is useless for a 6-minute job. You need intermediate states: queued, processing, transcoding, uploading, complete. Clients need something to show users.

→ Timeout vs failure. A job that stops responding isn't the same as a job that failed. Dead workers, OOM kills, spot instance evictions — your queue needs a heartbeat or a visibility timeout, not just a try/catch.

→ Deduplication. The client will retry. Your queue will redeliver. You need a dedup key scoped to the original request — not the job run.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

30DaysOfSystemDesign #SystemDesign #BackendEngineering #DistributedSystems

55/60 Days System Design Questions

Joud Awad — Tue, 30 Jun 2026 16:46:23 +0000

You built an agent. It works.

Now you need 5 of them running in parallel, sharing state, and handing off work to each other.

Your pipeline breaks on the first real workload.

Here's the setup:
You're building a research agent system. A user asks a complex question. You need to:

• Fan out to 3 specialized sub-agents simultaneously
• One agent might spawn 2 more based on what it finds
• They all write back to shared context
• A final agent synthesizes everything

Classic multi-agent orchestration. You have 4 options for how agents coordinate.

A) Centralized Orchestrator — one controller agent dispatches tasks, collects results, manages shared state. Agents are dumb workers.

B) Decentralized Peer Handoff — each agent decides who gets the task next. No central controller. Agents communicate directly.

C) Shared Message Queue + Blackboard — all agents read/write to a shared blackboard. Coordination happens through state, not calls.

D) Hierarchical Nesting — orchestrator spawns sub-orchestrators. Each sub-tree is self-coordinating. Recursive decomposition of the problem.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

Drop your answer 👇

54/60 Days System Design Questions

Joud Awad — Mon, 29 Jun 2026 16:20:56 +0000

You built a RAG pipeline. Works great in dev.

6 months later, your users complain: "The search results are garbage."

You haven't changed a line of code.

Here's what happened:

Your product evolved. New features, new docs, new support tickets. The data drifted — but your embedding index didn't.

Now you're serving a 400GB FAISS index that was last rebuilt in January. Your chunks are stale. Your nearest-neighbor results point to deprecated docs. Your LLM is confidently hallucinating from outdated context.

You need to fix this. 4 engineers each propose a solution:

A) Scheduled full rebuild
Every Sunday, re-embed the entire corpus from scratch. Replace the index atomically. Slow (4h+ at scale), expensive, but always fresh.

B) Incremental upserts + soft delete
On every document change, re-embed only the affected chunks. Mark deleted chunks as tombstoned. Keep a version field on each vector. Index size grows over time; compact quarterly.

C) Embedding version registry + hot swap
Track which embedding model version produced each vector. When the model drifts (fine-tuned or upgraded), invalidate the mismatched vectors and rebuild only those. Two indexes run in parallel during migration. Route traffic by model version.

D) Approximate staleness detection
Run a nightly job that samples 1% of your corpus, re-embeds it, and measures cosine distance against the stored vector. If drift exceeds a threshold, trigger a full rebuild. Otherwise, skip it. Cheap monitoring, reactive rebuilds.

Real constraint: your corpus is 50M chunks. Full rebuild = 4 hours + ~$800 in embedding API cost. You deploy model updates every 6 weeks.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

30DaysOfSystemDesign #SystemDesign #MachineLearning #MLEngineering

53/60 Days System Design Questions

Joud Awad — Sun, 28 Jun 2026 16:20:06 +0000

Your migration ran fine in staging.

Then you ran it in production.

The app went down.

Not because the SQL was wrong. Because you ran it on a live table with 40 million rows while 8 services were actively writing to it.

Your setup:
→ PostgreSQL. users table. 40M rows. Active writes from 8 services.
→ Product request: split full_name into first_name + last_name.
→ You have a 2-hour maintenance window tonight.

The engineering question: how do you ship this without downtime?

A) Run ALTER TABLE to drop full_name and add first_name + last_name in a single migration during the maintenance window.

B) Add first_name + last_name as nullable columns first → backfill → update all services to write to both → drop full_name only after everything is migrated.

C) Create a new users_v2 table with the target schema → dual-write to both tables → flip the read pointer → drain the old table.

D) Add a DB view that aliases full_name as first_name || ' ' || last_name → let each service migrate off it at its own pace.

Drop your answer 👇

52/60 Days System Design Questions

Joud Awad — Sat, 27 Jun 2026 16:12:50 +0000

Your API just shipped a breaking change.

/users now returns fullName instead of first_name + last_name. 3 mobile clients broke. 1 partner integration went down. Your on-call is not happy.

You had a versioning strategy. It just wasn't the right one.

There are 4 ways to version an API. Here's what actually happens when you pick each one in production:

A — URL path versioning (/v1/users, /v2/users)
Simple. Explicit. Every request makes the version visible in logs and caches. But now you're maintaining 2 full route trees. A bugfix in the business logic layer has to be patched in both. Teams quietly let v1 rot.

B — Header versioning (API-Version: 2)
Clean URLs. Version negotiation in the transport layer, not the path. Harder to test in a browser, invisible in logs unless you instrument for it, and clients forget to send the header — defaulting to whatever your server decides "latest" means.

C — Query param versioning (/users?version=2)
Fast to implement. Zero client SDK changes. Cache-unfriendly — every CDN layer treats ?version=1 and ?version=2 as separate cache keys. Works until you have 40 endpoints and version drift becomes untrackable.

D — Content negotiation (Accept: application/vnd.api.v2+json)
The REST-purist approach. Semantically correct — you're asking for a representation, not a route. Almost nobody implements it right. Client library support is inconsistent. One wrong Accept header and you get a 406.

Which strategy does your team use?

51/60 Days System Design Questions

Joud Awad — Fri, 26 Jun 2026 16:37:54 +0000

You're building a B2B SaaS product. 50 enterprise customers. Each one wants their data isolated. Some are on free plans. A few are paying $50k/year and demanding SLA guarantees.

Your current setup:
→ One database. One schema. A tenant_id column on every table.
→ One app server handling all traffic.
→ A free-tier customer running a badly-written bulk export just hammered your DB for 40 seconds. A paying enterprise customer's checkout flow timed out.

Your investors are not happy. Neither is that enterprise customer.

The engineering question: how do you isolate tenants without rebuilding the whole product?

A) Keep one shared DB — add row-level security + query budgets per tenant to enforce limits.

B) Schema-per-tenant — every customer gets their own schema in the same Postgres instance, migrations run per-schema.

C) Database-per-tenant (silo model) — each enterprise customer gets a dedicated DB. Free tier stays pooled.

D) Middleware bridge — route requests to tenant-specific DB clusters based on a tenant registry, free tier stays on shared pool.

One of these is a band-aid. One will collapse under 500 tenants. One is how Notion, Salesforce, and every serious B2B at scale actually operates.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments (including the noisy neighbor failure mode nobody warns you about).

If your team is building multi-tenant systems, share this. The wrong isolation model is a rewrite waiting to happen.

Drop your answer 👇

30DaysOfSystemDesign #SystemDesign #BackendEngineering #SoftwareArchitecture