Joud Awad

Posted on May 25

20/30 Days System Design Questions!

#architecture #distributedsystems #microservices #systemdesign

Your checkout service calls a 3rd-party fraud-check API on every order.
That API just started timing out at 30s instead of its usual 200ms.

Your Node.js checkout pods have a 50-connection pool. Within 90 seconds, every connection is parked waiting on the fraud API. New checkout requests pile up in the queue. P99 latency on /checkout goes from 300ms to 28s. Customers retry. Pods OOM. The fraud API is degraded — your entire checkout is down.

Here's the setup:

• Checkout (NestJS) → Fraud API (3rd party) — 30s timeouts
• Same pods also handle /cart, /orders, /health — all healthy dependencies
• Fraud API's own dashboard says it'll be back in ~10 minutes
• Your SLO budget for the quarter is about to evaporate

You need to stop the bleeding without losing the rest of checkout. What do you do?

A) Drop the timeout to 2s and add 3 retries with exponential backoff.
B) Add a Circuit Breaker that opens after N failures, then half-opens with a single probe request before fully closing.
C) Bulkhead the fraud API calls into a separate connection pool / thread pool so they can't starve the rest of checkout.
D) Both B and C — circuit breaker for the failing dependency, bulkhead to isolate the blast radius.

Three of these are patterns senior engineers genuinely debate in postmortems. One of them is the answer most staff engineers actually ship. One is the answer that makes the outage worse.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

If your team has ever had one slow downstream take down a healthy service, repost this. That conversation needs to happen before the outage, not after.

Drop your answer 👇

30DaysOfSystemDesign #SystemDesign #Resilience #DistributedSystems

Top comments (4)

Joud Awad • May 25

Answer: D — Circuit Breaker + Bulkhead, together ✅

Here's why, and why the other three trick smart engineers:

Why D wins (Circuit Breaker + Bulkhead):

These two patterns solve two different failure modes, and you need both.

The Circuit Breaker stops you from hammering a dead dependency. After N consecutive failures (or a failure rate threshold over a rolling window), it flips to OPEN — every subsequent call to the fraud API fails instantly with a fallback. No 30s wait. No connection held hostage. After a cooldown (say 30s), it goes HALF-OPEN: it allows exactly one probe request through. If that probe succeeds, the breaker closes and traffic resumes. If it fails, back to OPEN for another cooldown. Half-open is the part most tutorials gloss over — it's what prevents the thundering herd from re-killing a service that's just coming back up.

The Bulkhead is the part most engineers forget exists until they get burned. It isolates resource pools. Your fraud API gets its own dedicated pool of, say, 10 connections — separate from the 40 connections that serve /cart, /orders, /health. When the fraud API hangs, it can saturate its 10 connections, but the other 40 stay free. /cart keeps working. /health keeps working. The blast radius stops at the fraud feature. The ship doesn't sink because one compartment flooded — that's literally where the name comes from.

Resilience4j, Polly (.NET), and Hystrix (legacy) all ship both. AWS App Mesh + Envoy give you bulkheads at the proxy layer. Netflix wrote Hystrix specifically because they learned this lesson the hard way: a single slow dependency cascading through shared thread pools took down recommendations during peak hours.

Joud Awad • May 25

Why B alone is the staff-engineer trap answer:

A circuit breaker without a bulkhead is the answer that sounds complete in an interview. It's the most-mentioned pattern in resilience talks. And it does fix the specific failure described — once the breaker opens, fraud calls fail fast and stop tying up connections.

But here's the production reality: between failure #1 and the breaker tripping (failure #N), every one of those N requests is still holding a connection from the shared pool. If your threshold is 20 failures over 10 seconds, you can still saturate a 50-connection pool before the breaker even notices. You've reduced the outage window, not eliminated the cascade.

Bulkheads make the breaker's threshold latency-tolerant. They give you slack to detect and react without the rest of the system already being on fire.

This is the trap because B is right and incomplete. Senior engineers stop at B in design reviews because it sounds sufficient. Staff engineers ship D because they've seen B fail in production at 3am.

Joud Awad • May 25

Why C alone is partial:

Bulkhead alone is better than nothing — it contains the damage. But your 10 dedicated fraud connections will still spend 30 seconds each waiting on doomed calls. You're paying full latency cost on every checkout, just on a smaller pool. Customers still see slow checkouts; you just don't lose /cart with them.

Bulkhead without a breaker = "the leak is contained, but the room is still flooding."

Joud Awad • May 25

Why A is wrong (and dangerous):

Lower timeout + retries with backoff is the single worst thing you can do to a degraded dependency. It's the canonical anti-pattern.

The fraud API is already struggling. You just decided to send it 3x the traffic. Every retry storm from every checkout pod converges on a service that needs less load to recover. This is how a partial outage becomes a total outage — and how a 10-minute incident becomes a 4-hour incident. Retries belong on transient errors (a single 503, a connection reset), not on systemic degradation. And they always need a circuit breaker upstream of them to cap the blast.

If you've ever read a postmortem with the phrase "retry storm contributed to extended recovery time" — this is what they meant.