DEV Community

Cover image for 19/30 Days System Design Questions!
Joud Awad
Joud Awad

Posted on

19/30 Days System Design Questions!

Your checkout endpoint has a 400ms P95. Profiling shows 70% of that is DB reads.

You add a read replica and point all SELECT queries at it. P95 drops to 90ms. The team celebrates.

Two hours later, support tickets flood in. Customers update their shipping address but see the old one on the confirmation screen. One customer gets charged twice because the "order already exists" check read stale data and missed the duplicate.

Here's the setup:
• Primary → handles all writes, replication lag ~200ms
• Replica → handling 100% of reads
• Affected flows → profile updates, order dedup, payment idempotency

The replica is working exactly as designed. That's the problem.

What do you do?

A) Read-your-writes consistency: route a user's reads to primary for a short window after they write.
B) Synchronous replication: make primary wait for replica to confirm before ACKing the write.
C) Monitor replica lag + retry: detect when lag exceeds a threshold and fall back to primary.
D) Route critical reads to primary: replicas only serve non-critical reads like analytics.

All four are real patterns running in production. Only one solves the stale-read problem without killing the performance win you just shipped.

Pick one (A, B, C, or D) and tell me why. Full breakdown in the comments, including which answer is the senior engineer trap.

If your team has ever added a read replica and spent a week debugging stale data, share this with them.

Drop your answer 👇

30DaysOfSystemDesign #SystemDesign #Databases #SoftwareArchitecture

Top comments (4)

Collapse
 
thejoud1997 profile image
Joud Awad • Edited

Why A wins:
After a user performs a write, their subsequent reads are routed to the primary for a short window: typically a few seconds, or until replica lag catches up. Everyone else reads from the replica. The performance win is preserved for the vast majority of traffic.

Collapse
 
thejoud1997 profile image
Joud Awad

Why B is wrong:
Synchronous replication means the primary waits for the replica to confirm every write. Replica lag goes to zero, but your write P95 goes from 20ms to 80–120ms. Worse, you've coupled your primary's availability to your replica's health.

Collapse
 
thejoud1997 profile image
Joud Awad

Why C is wrong:
Average lag isn't your problem. Your problem is that a user reads their own write within 200ms of making it. The lag metric looks fine (200ms is "normal"), but the data the user needs is still in transit.

Collapse
 
thejoud1997 profile image
Joud Awad

Why D is the trap:
"Critical" is not a stable category. The primary read list quietly shrinks, stale read bugs come back, and you spend the next quarter playing whack-a-mole with consistency bugs. You've traded a systematic solution for a per-feature judgment call.