22/30 Days System Design Questions

#systemdesign #distributedsystems #architecture #softwareengineering

Your payment service just charged a customer.

It writes to the DB. Now it needs to tell the notification service: "Send the confirmation email."

The HTTP call times out. Did it arrive? You don't know. You retry. Customer gets two emails.

You've just hit the Two Generals Problem. It's not a bug. It's a proof.

No protocol over an unreliable channel can guarantee both sides agree on the final message. Not HTTP. Not TCP. Not your retry loop. The uncertainty is mathematically irreducible.

Here's the setup:

PaymentService (Node.js, PostgreSQL) → NotificationService (Go)

~40ms p99 latency, occasional 504s under load.

You need to send exactly one confirmation email per payment — no double-sends, no missed sends.

What do you build?

A) Retry with exponential backoff until NotificationService returns 200. If you keep retrying until you get an ACK, you know it arrived.

B) Wrap both in a distributed transaction (2PC) — PaymentService and NotificationService commit together or neither does.

C) Outbox pattern — PaymentService writes the notification event to an outbox table in the same DB transaction as the payment. A relay process delivers it separately.

D) Push to SQS with at-least-once delivery. NotificationService deduplicates on a stable idempotency key. Accept you might send twice, but never miss.

One of these is a trap that senior engineers fall into every time. One of them doesn't solve the fundamental impossibility at all. And one is what you actually ship.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

30DaysOfSystemDesign #SystemDesign #DistributedSystems #SoftwareArchitecture

Top comments (4)

Joud Awad • May 28 • Edited

D — SQS + at-least-once + idempotency key (CORRECT)

The only answer that accepts the impossibility and designs around it. At-least-once means the message will arrive — maybe twice. Idempotency on the consumer side (payment_id📧v1 in a dedup table) makes the second delivery a no-op. No missed sends, no double-sends in practice. This is exactly how Stripe and AWS handle it. The queue absorbs the uncertainty. Idempotency absorbs the duplicates.

Joud Awad • May 28

A — Retry until ACK (SENIOR ENGINEER TRAP)

Feels airtight. It's not. What if the 200 response is what timed out — not the request? NotificationService sent the email AND returned 200, but you never saw the response. So you retry. Second email sent. Now you need to confirm your ACK arrived too... which is the Two Generals recursion. No finite number of retries closes this loop. More handshakes = more latency and more failure surfaces, not more certainty.

Joud Awad • May 28

B — 2PC (WRONG)

The coordinator is a single point of failure. If it crashes between Phase 1 (prepare) and Phase 2 (commit), both services are stuck with locks held, waiting for a decision that never comes. You've traded message uncertainty for coordinator-failure uncertainty — same problem, more complexity.

Joud Awad • May 28

C — Outbox Pattern (CORRECT but heavier)

Actually solid — atomicity at the DB level, event and payment always in sync. But you're now running a relay process + CDC pipeline + outbox cleanup. For most teams with SQS already in place, D gets 95% of the same guarantees with a fraction of the infrastructure. C shines at scale or when strict ordering matters. D is the right answer for most teams today.