<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joud Awad</title>
    <description>The latest articles on DEV Community by Joud Awad (@thejoud1997).</description>
    <link>https://dev.to/thejoud1997</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1238326%2F5d65a5d6-611d-4526-9bc2-d2d8643d5226.png</url>
      <title>DEV Community: Joud Awad</title>
      <link>https://dev.to/thejoud1997</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/thejoud1997"/>
    <language>en</language>
    <item>
      <title>Day 36/60 System Design Questions</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Thu, 11 Jun 2026 17:46:09 +0000</pubDate>
      <link>https://dev.to/thejoud1997/day-3660-days-system-design-questions-5h31</link>
      <guid>https://dev.to/thejoud1997/day-3660-days-system-design-questions-5h31</guid>
      <description>&lt;p&gt;Your API response went from 320ms to 95ms after a CDN switch.&lt;/p&gt;

&lt;p&gt;Nothing changed on the server. Same origin. Same payload.&lt;/p&gt;

&lt;p&gt;The only difference: the CDN started speaking HTTP/3 to your clients.&lt;/p&gt;

&lt;p&gt;Here's the setup:&lt;br&gt;
Mobile app → Load Balancer → Origin (NestJS) → DB&lt;/p&gt;

&lt;p&gt;Every request goes through the CDN first. Latency is acceptable, until high packet-loss environments (mobile 4G to 5G transitions, flaky WiFi, Asia-Pacific routes). You're seeing 800ms+ tail latency for the p99. The CDN supports HTTP/2 and HTTP/3. What do you enable?&lt;/p&gt;

&lt;p&gt;A) HTTP/2 only — multiplexing over a single TCP connection eliminates the old HTTP/1.1 head-of-line problem. Proven, widely supported.&lt;/p&gt;

&lt;p&gt;B) HTTP/3 only. Built on QUIC (UDP), eliminates TCP head-of-line blocking entirely, 0-RTT connection resumption. Modern clients handle it fine.&lt;/p&gt;

&lt;p&gt;C) HTTP/2 to origin, HTTP/2 to clients. Maximize multiplexing end-to-end, avoid QUIC instability in enterprise firewalls.&lt;/p&gt;

&lt;p&gt;D) HTTP/3 to clients, HTTP/2 to origin. QUIC handles the lossy last mile, HTTP/2 handles the stable datacenter leg.&lt;/p&gt;

&lt;p&gt;One of these is how Netflix, Cloudflare, and most major CDNs actually handle this in production.&lt;/p&gt;

&lt;p&gt;Pick one (A, B, C, or D) and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If your team is debating CDN config or protocol upgrades, share this. The tradeoff is sharper than most people think.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #HTTP3 #WebPerformance
&lt;/h1&gt;

</description>
      <category>abotwrotethis</category>
      <category>systemdesign</category>
      <category>network</category>
      <category>backend</category>
    </item>
    <item>
      <title>Day 35/60 System Design Questions</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Wed, 10 Jun 2026 16:23:56 +0000</pubDate>
      <link>https://dev.to/thejoud1997/day-3560-system-design-questions-2kck</link>
      <guid>https://dev.to/thejoud1997/day-3560-system-design-questions-2kck</guid>
      <description>&lt;p&gt;This article was written with the assistance of AI tooling for structure and syntax. The concepts, tradeoffs, and production context are based on my own engineering experience and research. #ABotWroteThis&lt;/p&gt;

&lt;p&gt;You're building the "find nearby drivers" feature for a ride-hailing app.&lt;/p&gt;

&lt;p&gt;At peak, you have 500,000 active drivers updating their GPS location every 5 seconds. Riders query for drivers within 2km. At scale, you're doing ~100,000 proximity queries per second.&lt;/p&gt;

&lt;p&gt;Your naive implementation does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;drivers&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;lat&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;lng&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works fine at 1,000 drivers. At 500,000, it's a full table scan on every query. Latency hits 800ms. Riders see a spinner. Drivers miss trips.&lt;/p&gt;

&lt;p&gt;Your team proposes 4 approaches to fix this:&lt;/p&gt;

&lt;p&gt;A) Geohash partitioning — Encode each driver's location into a geohash string. Index by geohash prefix. Proximity queries become a string lookup on the index.&lt;/p&gt;

&lt;p&gt;B) PostGIS with spatial indexes — Add a PostGIS extension to Postgres. Use a proper R-tree/GiST spatial index for bounding-box and radius queries.&lt;/p&gt;

&lt;p&gt;C) Quadtree in memory — Keep all active driver positions in a quadtree data structure in a Redis-backed in-memory service. Decompose space recursively until each cell has ≤ N drivers.&lt;/p&gt;

&lt;p&gt;D) H3 hexagonal grid (Uber's system) — Divide the earth into hexagonal cells at multiple resolutions. Assign each driver to a cell. Queries check the target cell + 6 neighbors at the right resolution.&lt;/p&gt;

&lt;p&gt;You need sub-50ms p99 latency, real-time updates, and it has to stay accurate at cell boundaries.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If your team has argued about spatial indexing before, share this. Worth the debate.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #SoftwareArchitecture #DistributedSystems
&lt;/h1&gt;

</description>
      <category>systemdesign</category>
      <category>distributedsystems</category>
      <category>database</category>
      <category>backend</category>
    </item>
    <item>
      <title>34/60 Days System Design Questions</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Tue, 09 Jun 2026 16:30:01 +0000</pubDate>
      <link>https://dev.to/thejoud1997/3460-days-system-design-questions-1k24</link>
      <guid>https://dev.to/thejoud1997/3460-days-system-design-questions-1k24</guid>
      <description>&lt;p&gt;Your AI feature works in the demo.&lt;/p&gt;

&lt;p&gt;It fails in production 3 weeks later. Nobody touched the model. Nobody changed the code.&lt;/p&gt;

&lt;p&gt;The only thing that changed: the inputs got messier.&lt;/p&gt;

&lt;p&gt;Here's the setup:&lt;/p&gt;

&lt;p&gt;You're at a SaaS company. 50,000 support tickets a week. Your team builds an AI triage system — GPT-4o classifies each ticket into 6 categories (billing, bug, feature request, account access, security, other) so the right team gets it instantly.&lt;/p&gt;

&lt;p&gt;In dev, it nails 71% accuracy. You need 90%+ to cut manual review.&lt;/p&gt;

&lt;p&gt;The model is locked. The budget for inference isn't unlimited. You need to close the 19-point gap.&lt;/p&gt;

&lt;p&gt;Here are your four options:&lt;/p&gt;

&lt;p&gt;A) Zero-shot with a better system prompt — rewrite the instructions, add explicit category definitions, specify edge case rules. No examples.&lt;/p&gt;

&lt;p&gt;B) Few-shot examples — add 3–5 real classified tickets directly in the prompt. One example per category edge case.&lt;/p&gt;

&lt;p&gt;C) Chain-of-Thought — add "think step-by-step before answering" to the prompt. Force the model to reason through the ticket before outputting the category.&lt;/p&gt;

&lt;p&gt;D) Self-Consistency — run each ticket through the model 5 times with temperature=0.7, take the majority vote across outputs.&lt;/p&gt;

&lt;p&gt;Same model. Same ticket. Four different accuracy + cost profiles.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #AI #SystemDesign #BackendEngineering
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>systemdesign</category>
      <category>backend</category>
    </item>
    <item>
      <title>33/60 Days System Design Questions</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Mon, 08 Jun 2026 15:55:29 +0000</pubDate>
      <link>https://dev.to/thejoud1997/3360-days-system-design-questions-3e2c</link>
      <guid>https://dev.to/thejoud1997/3360-days-system-design-questions-3e2c</guid>
      <description>&lt;p&gt;Your order service takes 200 writes/sec at peak.&lt;br&gt;
You audit 6 months of data. Something's off — two orders show the same ID, different totals.&lt;/p&gt;

&lt;p&gt;You have the current state. You don't have how it got there.&lt;/p&gt;

&lt;p&gt;Your DB is a graveyard of overwritten rows.&lt;/p&gt;

&lt;p&gt;Here's the system:&lt;/p&gt;

&lt;p&gt;• OrderService → Postgres (current state only)&lt;br&gt;
• Events: placed, updated, cancelled, refunded&lt;br&gt;
• Every UPDATE overwrites the previous row&lt;br&gt;
• No audit log. No event history. No replay.&lt;/p&gt;

&lt;p&gt;A billing dispute just landed. You need to reconstruct exactly what happened to Order #8471. You can't.&lt;/p&gt;

&lt;p&gt;That's the problem Event Sourcing solves.&lt;/p&gt;

&lt;p&gt;Instead of storing the current state, you store the sequence of events that produced it.&lt;/p&gt;

&lt;p&gt;What's your approach when redesigning this service?&lt;/p&gt;

&lt;p&gt;A) Event Sourcing — append-only event log as the source of truth, current state derived from replaying events.&lt;br&gt;
B) Change Data Capture (CDC) — keep Postgres as-is, but stream all row changes to Kafka for an audit trail.&lt;br&gt;
C) Add an audit_log table — trigger-based shadow writes on every INSERT/UPDATE/DELETE.&lt;br&gt;
D) Dual-write — write to both the current-state table and a separate events table on every operation.&lt;/p&gt;

&lt;p&gt;One of these gives you full replay, projection flexibility, and a real source of truth. The others are patches.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. I'll drop the full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If your team is arguing about audit trails or event-driven redesigns, tag someone who needs to see this.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #EventSourcing #SoftwareArchitecture
&lt;/h1&gt;

</description>
      <category>distributedsystems</category>
      <category>eventdriven</category>
      <category>backend</category>
      <category>database</category>
    </item>
    <item>
      <title>32/60 Days System Design Questions!</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Sun, 07 Jun 2026 18:22:35 +0000</pubDate>
      <link>https://dev.to/thejoud1997/3260-days-system-design-questions-288j</link>
      <guid>https://dev.to/thejoud1997/3260-days-system-design-questions-288j</guid>
      <description>&lt;p&gt;Your startup just got its first SOC 2 audit.&lt;/p&gt;

&lt;p&gt;The auditor asks: "Where are your database passwords, API keys, and service tokens stored?"&lt;/p&gt;

&lt;p&gt;Your senior engineer goes quiet.&lt;/p&gt;

&lt;p&gt;Turns out half of them are in .env files committed to git 18 months ago. Three are hardcoded in Lambda environment variables. One is in a Slack message from 2023.&lt;/p&gt;

&lt;p&gt;You have 6 services in production, 4 environments, and zero rotation policy.&lt;/p&gt;

&lt;p&gt;Here's the setup:&lt;/p&gt;

&lt;p&gt;• NestJS API → Postgres (password in env var)&lt;br&gt;
• NestJS API → Stripe (API key in env var)&lt;br&gt;
• Background workers → SQS, S3 (AWS credentials in env var)&lt;br&gt;
• 3rd-party webhooks → HMAC secrets in env var&lt;br&gt;
• Zero rotation. Zero audit trail. Zero centralized access control.&lt;/p&gt;

&lt;p&gt;You need to fix this. And you can't take downtime.&lt;/p&gt;

&lt;p&gt;A) Move everything to AWS Secrets Manager — SDK calls at runtime, IAM controls access, auto-rotation built in.&lt;/p&gt;

&lt;p&gt;B) Use HashiCorp Vault — dynamic secrets, fine-grained policies, works across any cloud or on-prem.&lt;/p&gt;

&lt;p&gt;C) Use environment variables injected at deploy time via CI/CD — secrets stored in GitHub Actions / GitLab CI secrets vault, never touch disk.&lt;/p&gt;

&lt;p&gt;D) Encrypt secrets with KMS and store ciphertext in your own database — decrypt at runtime, full control.&lt;/p&gt;

&lt;p&gt;All four are used in production at real companies.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. I'll drop the full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If your team is having this argument right now, share this post. Someone needs to see it.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #BackendEngineering #CloudArchitecture
&lt;/h1&gt;

</description>
      <category>systemdesign</category>
      <category>security</category>
      <category>automation</category>
      <category>backend</category>
    </item>
    <item>
      <title>31/60 Days System Design Questions!</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Sat, 06 Jun 2026 17:40:39 +0000</pubDate>
      <link>https://dev.to/thejoud1997/3160-days-system-design-questions-koo</link>
      <guid>https://dev.to/thejoud1997/3160-days-system-design-questions-koo</guid>
      <description>&lt;p&gt;Your e-commerce platform just crossed 2M daily active users. 70% are in the US, 30% in Europe.&lt;/p&gt;

&lt;p&gt;Latency complaints are piling up from European users — 380ms average round-trip to your US-East region. Support tickets are up 40%. Black Friday is in 6 weeks.&lt;/p&gt;

&lt;p&gt;Your infrastructure: single AWS us-east-1 region, RDS PostgreSQL (primary), Redis cache, 12 microservices behind an API Gateway.&lt;/p&gt;

&lt;p&gt;You need to get European latency under 80ms. The engineering team is debating four approaches.&lt;/p&gt;

&lt;p&gt;Here's your constraint: you cannot afford a full database rewrite, and you need this shipped before Black Friday.&lt;/p&gt;

&lt;p&gt;A) Active-Active multi-region — deploy the full stack in eu-west-1, use a distributed database (CockroachDB or Aurora Global), route users to the nearest region. Writes go to both regions simultaneously.&lt;/p&gt;

&lt;p&gt;B) Active-Passive with read replicas — keep us-east-1 as primary, spin up eu-west-1 as a hot standby with read replicas. European reads go local, writes still go to US. Failover in minutes if US goes down.&lt;/p&gt;

&lt;p&gt;C) CDN + Edge caching — keep the single region, push static assets and cacheable API responses to CloudFront edge nodes in Europe. No database changes.&lt;/p&gt;

&lt;p&gt;D) Active-Active with eventual consistency — deploy full stack in both regions, allow each region to own its writes, sync asynchronously. Accept that a European user might see a US write 200ms late.&lt;/p&gt;

&lt;p&gt;Three of these are real patterns production teams use. Only one actually solves the problem you have — under your constraints, before Black Friday.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If your team has been in this exact debate, share this with them. The tradeoffs are what matter.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #DistributedSystems #SoftwareArchitecture
&lt;/h1&gt;

</description>
      <category>beginners</category>
      <category>backend</category>
      <category>database</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>30/60 Days System Design Questions!</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Fri, 05 Jun 2026 16:01:09 +0000</pubDate>
      <link>https://dev.to/thejoud1997/3060-days-system-design-questions-13im</link>
      <guid>https://dev.to/thejoud1997/3060-days-system-design-questions-13im</guid>
      <description>&lt;p&gt;You're building a file upload service. 10TB of user files today. 100TB in 12 months.&lt;/p&gt;

&lt;p&gt;Your team is having a fight.&lt;/p&gt;

&lt;p&gt;The backend lead says: "Just use S3. Done."&lt;br&gt;
The DevOps engineer says: "Mount an EBS volume. Simpler, faster."&lt;br&gt;
The platform architect says: "We need EFS — multiple services need to read the same files."&lt;br&gt;
The startup CTO says: "We can't afford cloud storage at scale. Self-host with MinIO."&lt;/p&gt;

&lt;p&gt;All four have shipped this in production. All four have opinions backed by scars.&lt;/p&gt;

&lt;p&gt;Here's the setup:&lt;br&gt;
— Upload service (NestJS) receives files from mobile + web clients&lt;br&gt;
— ML pipeline needs to read uploaded images for processing&lt;br&gt;
— Audit service needs read access to the same files&lt;br&gt;
— Files range from 5KB profile pics to 2GB video exports&lt;br&gt;
— You're on AWS&lt;/p&gt;

&lt;p&gt;What do you pick?&lt;/p&gt;

&lt;p&gt;A) S3 — object storage, infinite scale, pay per GB, no servers to manage.&lt;br&gt;
B) EBS — block storage, SSD-backed, attach to your EC2, low latency reads.&lt;br&gt;
C) EFS — managed NFS, shared across multiple EC2 instances simultaneously.&lt;br&gt;
D) MinIO on EC2 — S3-compatible self-hosted object storage, you own the infra.&lt;/p&gt;

&lt;p&gt;One of these is the obvious right answer for this scenario.&lt;br&gt;
One of them will destroy your architecture quietly over 6 months.&lt;br&gt;
One is great — but not for this problem.&lt;br&gt;
One is a trap that looks like cost savings.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If your team has had this exact fight, share this. Someone needs to win it with data, not opinions.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;br&gt;
You're building a file upload service. 10TB of user files today. 100TB in 12 months.&lt;/p&gt;

&lt;p&gt;Your team is having a fight.&lt;/p&gt;

&lt;p&gt;The backend lead says: "Just use S3. Done."&lt;br&gt;
The DevOps engineer says: "Mount an EBS volume. Simpler, faster."&lt;br&gt;
The platform architect says: "We need EFS — multiple services need to read the same files."&lt;br&gt;
The startup CTO says: "We can't afford cloud storage at scale. Self-host with MinIO."&lt;/p&gt;

&lt;p&gt;All four have shipped this in production. All four have opinions backed by scars.&lt;/p&gt;

&lt;p&gt;Here's the setup:&lt;br&gt;
— Upload service (NestJS) receives files from mobile + web clients&lt;br&gt;
— ML pipeline needs to read uploaded images for processing&lt;br&gt;
— Audit service needs read access to the same files&lt;br&gt;
— Files range from 5KB profile pics to 2GB video exports&lt;br&gt;
— You're on AWS&lt;/p&gt;

&lt;p&gt;What do you pick?&lt;/p&gt;

&lt;p&gt;A) S3 — object storage, infinite scale, pay per GB, no servers to manage.&lt;br&gt;
B) EBS — block storage, SSD-backed, attach to your EC2, low latency reads.&lt;br&gt;
C) EFS — managed NFS, shared across multiple EC2 instances simultaneously.&lt;br&gt;
D) MinIO on EC2 — S3-compatible self-hosted object storage, you own the infra.&lt;/p&gt;

&lt;p&gt;One of these is the obvious right answer for this scenario.&lt;br&gt;
One of them will destroy your architecture quietly over 6 months.&lt;br&gt;
One is great — but not for this problem.&lt;br&gt;
One is a trap that looks like cost savings.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If your team has had this exact fight, share this. Someone needs to win it with data, not opinions.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #AWS #CloudArchitecture
&lt;/h1&gt;

</description>
      <category>machinelearning</category>
      <category>devops</category>
      <category>backend</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>29/60 Days System Design Questions</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Thu, 04 Jun 2026 17:48:30 +0000</pubDate>
      <link>https://dev.to/thejoud1997/2960-days-system-design-questions-5gch</link>
      <guid>https://dev.to/thejoud1997/2960-days-system-design-questions-5gch</guid>
      <description>&lt;p&gt;You have an AI product with 4 specialized agents: a Planner, a Researcher, a Coder, and a Reviewer.&lt;/p&gt;

&lt;p&gt;The Planner breaks down the task. The Researcher pulls context. The Coder implements. The Reviewer catches bugs.&lt;/p&gt;

&lt;p&gt;Simple on paper. In production, it's falling apart.&lt;/p&gt;

&lt;p&gt;Here's what's happening:&lt;/p&gt;

&lt;p&gt;• The Researcher sometimes returns before the Planner finishes → Coder gets incomplete context&lt;br&gt;
• The Reviewer flags issues → but there's no retry loop, so bugs ship anyway&lt;br&gt;
• One agent timeout hangs the entire pipeline for 40 seconds&lt;br&gt;
• You have no visibility into which agent failed or why&lt;/p&gt;

&lt;p&gt;You need to redesign the orchestration layer. What do you do?&lt;/p&gt;

&lt;p&gt;A) Centralized orchestrator — one controller calls each agent in sequence, owns retry logic, tracks state in a DB, times out per step individually.&lt;br&gt;
B) Choreography via event bus — agents publish/subscribe to events, no central controller, each agent triggers the next autonomously.&lt;br&gt;
C) DAG-based execution — model the pipeline as a directed acyclic graph, parallelize independent steps, block only on real dependencies.&lt;br&gt;
D) Supervisor pattern — a meta-agent monitors all others, detects failures, decides whether to retry, reroute, or escalate to a human.&lt;/p&gt;

&lt;p&gt;All four exist in production AI systems. Only one handles your specific failure modes without introducing new ones.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If you're building agentic systems, share this. Most teams hit these exact problems at month 2.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #AgenticAI #AIEngineering
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>agentaichallenge</category>
      <category>systemdesign</category>
      <category>llm</category>
    </item>
    <item>
      <title>28/30 Days System Design Questions!</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Wed, 03 Jun 2026 16:52:11 +0000</pubDate>
      <link>https://dev.to/thejoud1997/2830-days-system-design-questions-5hh1</link>
      <guid>https://dev.to/thejoud1997/2830-days-system-design-questions-5hh1</guid>
      <description>&lt;p&gt;You're building a semantic search feature for a B2B SaaS product.&lt;/p&gt;

&lt;p&gt;The corpus: 4 million support articles, docs, and user-generated tickets. Users type natural language queries. They expect Google-quality results — not keyword matching.&lt;/p&gt;

&lt;p&gt;Your current stack: PostgreSQL 15, Redis, and a Node.js backend. The search team says ILIKE and pg_trgm aren't cutting it. Embeddings are the answer. Now you need a place to store and query 1536-dimensional vectors (OpenAI ada-002) at &amp;lt;100ms p99.&lt;/p&gt;

&lt;p&gt;4 million rows. ~24GB of raw embeddings. Query volume: 300 req/s with weekend spikes to 900 req/s.&lt;/p&gt;

&lt;p&gt;Where do you store and query those vectors?&lt;/p&gt;

&lt;p&gt;A) pgvector extension on your existing PostgreSQL — store embeddings in a new column, query with &amp;lt;-&amp;gt; cosine similarity.&lt;br&gt;
B) Pinecone — fully managed vector database, serverless tier, no infra to run.&lt;br&gt;
C) Weaviate — open-source vector DB, self-hosted on Kubernetes, full control over indexing.&lt;br&gt;
D) Qdrant — open-source vector DB, Rust-based, self-hosted or cloud, optimized for high-throughput filtering.&lt;/p&gt;

&lt;p&gt;All four are used in production at scale. But only one fits this scenario without hidden costs that bite you at 300 req/s.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If this is the debate your team is about to have, share it. These decisions are hard to reverse.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #VectorDatabases #MachineLearning
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>systemdesign</category>
      <category>database</category>
      <category>llm</category>
    </item>
    <item>
      <title>27/30 Days System Design Questions!</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Tue, 02 Jun 2026 17:16:57 +0000</pubDate>
      <link>https://dev.to/thejoud1997/2730-days-system-design-questions-4l26</link>
      <guid>https://dev.to/thejoud1997/2730-days-system-design-questions-4l26</guid>
      <description>&lt;p&gt;Your LLM answers are wrong. Not hallucination-wrong — outdated-wrong.&lt;/p&gt;

&lt;p&gt;You shipped a customer support bot on GPT-4. It's trained through early 2024. Your product changed 14 times since then. Every week, users get answers that were accurate 8 months ago and are flat-out wrong today.&lt;/p&gt;

&lt;p&gt;The team is debating the fix.&lt;/p&gt;

&lt;p&gt;Here's the setup:&lt;br&gt;
NestJS API → OpenAI GPT-4 + PostgreSQL (product knowledge base)&lt;br&gt;
~2,000 support queries/day, 15% return wrong answers tied to stale knowledge&lt;br&gt;
Knowledge base updates weekly — new pricing, new features, deprecated flows&lt;br&gt;
Budget: mid-size startup, not training custom models from scratch&lt;br&gt;
You need accurate, up-to-date answers without re-training on every product update&lt;/p&gt;

&lt;p&gt;What do you do?&lt;/p&gt;

&lt;p&gt;A) RAG — embed your knowledge base, retrieve relevant chunks at query time, inject into context. Model stays the same, knowledge is always fresh.&lt;br&gt;
B) Fine-tune the base model — train GPT-4 (or open-source equivalent) on your product docs. The model internalizes your domain.&lt;br&gt;
C) Fine-tune + RAG hybrid — fine-tune for style/tone/domain fluency, RAG for factual grounding. Best of both worlds.&lt;br&gt;
D) Prompt engineering only — detailed system prompt + few-shot examples. No infra, no training, just better instructions.&lt;/p&gt;

&lt;p&gt;All four are in production somewhere. Only one actually solves the problem in front of you.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. I'll drop the full breakdown in the comments (including the one that feels like the obvious upgrade but makes your freshness problem worse, not better).&lt;/p&gt;

&lt;p&gt;If your team is arguing RAG vs fine-tuning right now, share this. The tradeoff is worth mapping before you commit.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #RAG #MachineLearning
&lt;/h1&gt;

</description>
      <category>systemdesign</category>
      <category>distributedsystems</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>26/30 Days System Design Questions!</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Mon, 01 Jun 2026 17:28:07 +0000</pubDate>
      <link>https://dev.to/thejoud1997/2630-days-system-design-questions-22kd</link>
      <guid>https://dev.to/thejoud1997/2630-days-system-design-questions-22kd</guid>
      <description>&lt;p&gt;Your cache and DB are out of sync. Again.&lt;/p&gt;

&lt;p&gt;A user updates their profile. The cache still serves the old name for the next 10 minutes. Support gets a ticket. You patch it with a cache flush. It happens again next week.&lt;/p&gt;

&lt;p&gt;You're asked to fix write consistency before it becomes a customer-facing incident.&lt;/p&gt;

&lt;p&gt;Here's the setup:&lt;br&gt;
NestJS API → PostgreSQL (source of truth) + Redis (cache)&lt;br&gt;
~600 req/s reads, ~80 req/s writes at peak&lt;br&gt;
Current pattern: write to DB, manually invalidate cache key on success&lt;br&gt;
3 incidents this month — all traced back to stale cache after writes&lt;br&gt;
You need a strategy that survives race conditions, retries, and partial failures&lt;/p&gt;

&lt;p&gt;What do you change?&lt;/p&gt;

&lt;p&gt;A) Write-through — write to cache and DB together, synchronously. Cache is always warm, always consistent.&lt;br&gt;
B) Write-behind — write to cache first, async flush to DB. Fast writes, eventual persistence.&lt;br&gt;
C) Write-around — skip the cache on writes entirely. Write to DB only. Cache fills on next read miss.&lt;br&gt;
D) Dual-write with an outbox — write to DB + publish an event. A consumer updates the cache from the event log.&lt;/p&gt;

&lt;p&gt;All four are used in production. Only one actually survives the failure modes in this setup.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. I'll drop the full breakdown in the comments (including the one that looks safest but will burn you at scale).&lt;/p&gt;

&lt;p&gt;If your team argues about this at design review, share it with them. The debate is worth having before an incident forces it.&lt;/p&gt;

&lt;p&gt;Drop your answer 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #Caching #SoftwareArchitecture
&lt;/h1&gt;

</description>
      <category>distributedsystems</category>
      <category>systemdesign</category>
      <category>redis</category>
      <category>architecture</category>
    </item>
    <item>
      <title>25/30 Days System Design Questions!</title>
      <dc:creator>Joud Awad</dc:creator>
      <pubDate>Sun, 31 May 2026 17:50:02 +0000</pubDate>
      <link>https://dev.to/thejoud1997/2530-days-system-design-questions-3iah</link>
      <guid>https://dev.to/thejoud1997/2530-days-system-design-questions-3iah</guid>
      <description>&lt;p&gt;Your order processing service runs on SQS.&lt;/p&gt;

&lt;p&gt;Normal load: 200 orders/min. Consumers keep up fine.&lt;/p&gt;

&lt;p&gt;Then Black Friday hits. Producers start pushing 4,000 orders/min. Queue depth climbs to 80,000 messages in 20 minutes. Your downstream DB is at 95% CPU. Consumers are falling behind and you're watching the queue grow in real time.&lt;/p&gt;

&lt;p&gt;You need to handle this backpressure. What do you do?&lt;/p&gt;

&lt;p&gt;A) Scale consumers horizontally — add more Lambda functions / EC2 workers to chew through the backlog faster.&lt;/p&gt;

&lt;p&gt;B) Set a visibility timeout and route failures to a dead-letter queue to protect against poison pills.&lt;/p&gt;

&lt;p&gt;C) Rate-limit producers at the source — use a token bucket or sliding window to cap how fast messages enter the queue.&lt;/p&gt;

&lt;p&gt;D) Switch to SQS delay queues — defer message visibility to spread out delivery and reduce consumer pressure.&lt;/p&gt;

&lt;p&gt;Three of these are real patterns engineers reach for. Only one actually solves backpressure.&lt;/p&gt;

&lt;p&gt;Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.&lt;/p&gt;

&lt;p&gt;If this made you second-guess your instinct, share it — someone on your team is designing this right now.&lt;/p&gt;

&lt;h1&gt;
  
  
  30DaysOfSystemDesign #SystemDesign #AWS #SoftwareArchitecture
&lt;/h1&gt;

</description>
      <category>systemdesign</category>
      <category>distributedsystems</category>
      <category>architecture</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
