26/30 Days System Design Questions!

#distributedsystems #systemdesign #redis #architecture

Your cache and DB are out of sync. Again.

A user updates their profile. The cache still serves the old name for the next 10 minutes. Support gets a ticket. You patch it with a cache flush. It happens again next week.

You're asked to fix write consistency before it becomes a customer-facing incident.

Here's the setup:
NestJS API → PostgreSQL (source of truth) + Redis (cache)
~600 req/s reads, ~80 req/s writes at peak
Current pattern: write to DB, manually invalidate cache key on success
3 incidents this month — all traced back to stale cache after writes
You need a strategy that survives race conditions, retries, and partial failures

What do you change?

A) Write-through — write to cache and DB together, synchronously. Cache is always warm, always consistent.
B) Write-behind — write to cache first, async flush to DB. Fast writes, eventual persistence.
C) Write-around — skip the cache on writes entirely. Write to DB only. Cache fills on next read miss.
D) Dual-write with an outbox — write to DB + publish an event. A consumer updates the cache from the event log.

All four are used in production. Only one actually survives the failure modes in this setup.

Pick one — A, B, C, or D — and tell me why. I'll drop the full breakdown in the comments (including the one that looks safest but will burn you at scale).

If your team argues about this at design review, share it with them. The debate is worth having before an incident forces it.

Drop your answer 👇

30DaysOfSystemDesign #SystemDesign #Caching #SoftwareArchitecture

Top comments (4)

Joud Awad • Jun 1

Answer: D — Dual-write with an outbox ✅

The core problem isn't "which order to write in" — it's "what happens when one write succeeds and the other fails?"

Why D wins: Every naive dual-write has the same race condition: write to DB, write to cache, crash between the two → stale cache indefinitely. The outbox makes the cache update a consequence of the DB write, not a sibling operation. One atomic DB transaction: record + outbox event. A consumer updates Redis from the event. Also gives you replay — if Redis goes down and comes back, re-process the outbox.

Joud Awad • Jun 1

Why A fails (Write-through): Looks safest. Is the most dangerous at scale. Two synchronous I/Os on every write. If Redis is slow, your write API is slow. If Redis is down, do you block the user? You've made Redis a hard dependency of your write path.

Joud Awad • Jun 1

Why B fails (Write-behind): Fast, but: if the async worker crashes before DB flush, that write is gone. Acceptable for analytics counters. Never acceptable for a source of truth.

Joud Awad • Jun 1

Why C is a partial fix (Write-around): Solves stale cache by not writing to cache at all. Clean — but read-after-write is broken under replica lag, and high-write workloads tank your cache hit rate.