Your checkout endpoint has a 400ms P95. Profiling shows 70% of that is DB reads.
You add a read replica and point all SELECT queries at it. P95 drops to 90ms. The team celebrates.
Two hours later, support tickets flood in. Customers update their shipping address but see the old one on the confirmation screen. One customer gets charged twice because the "order already exists" check read stale data and missed the duplicate.
Here's the setup:
• Primary → handles all writes, replication lag ~200ms
• Replica → handling 100% of reads
• Affected flows → profile updates, order dedup, payment idempotency
The replica is working exactly as designed. That's the problem.
What do you do?
A) Read-your-writes consistency: route a user's reads to primary for a short window after they write.
B) Synchronous replication: make primary wait for replica to confirm before ACKing the write.
C) Monitor replica lag + retry: detect when lag exceeds a threshold and fall back to primary.
D) Route critical reads to primary: replicas only serve non-critical reads like analytics.
All four are real patterns running in production. Only one solves the stale-read problem without killing the performance win you just shipped.
Pick one (A, B, C, or D) and tell me why. Full breakdown in the comments, including which answer is the senior engineer trap.
If your team has ever added a read replica and spent a week debugging stale data, share this with them.
Drop your answer 👇
Top comments (4)
Why A wins:
After a user performs a write, their subsequent reads are routed to the primary for a short window: typically a few seconds, or until replica lag catches up. Everyone else reads from the replica. The performance win is preserved for the vast majority of traffic.
Why B is wrong:
Synchronous replication means the primary waits for the replica to confirm every write. Replica lag goes to zero, but your write P95 goes from 20ms to 80–120ms. Worse, you've coupled your primary's availability to your replica's health.
Why C is wrong:
Average lag isn't your problem. Your problem is that a user reads their own write within 200ms of making it. The lag metric looks fine (200ms is "normal"), but the data the user needs is still in transit.
Why D is the trap:
"Critical" is not a stable category. The primary read list quietly shrinks, stale read bugs come back, and you spend the next quarter playing whack-a-mole with consistency bugs. You've traded a systematic solution for a per-feature judgment call.