The Bug That Only Happens Sometimes
User reports: "I updated my profile but it still shows the old name."
You refresh: new name appears.
You ask them to refresh: still old name.
You debug together: it's back to new name.
The user: "This is clearly broken."
You: "It's working for me..."
Congratulations, you're debugging eventual consistency.
What Eventual Consistency Actually Means
In a distributed system, a write to one node takes time to propagate to others. During that window:
- Some reads see the new value
- Some reads see the old value
- All reads will eventually see the new value
That "eventually" window can be milliseconds or seconds. Sometimes minutes if something is wrong.
Users don't care about CAP theorem. They care that they updated their profile and it shows wrong.
Why It's So Hard to Debug
Three reasons:
1. Non-deterministic reproduction
You can't reliably reproduce it. Sometimes the bug fires, sometimes it doesn't. This makes unit tests useless and traditional debugging painful.
2. Time-dependent
The bug depends on when the read happens relative to the write. Microseconds matter. Your laptop is too fast to see it locally.
3. State propagation is invisible
Logs show you the read and the write. They don't show you "this read hit replica 2 which hadn't received the replication event yet."
The Common Patterns
Pattern 1: Read your own writes
User updates profile → write goes to primary → user reads → read hits a replica that hasn't replicated yet → user sees stale data.
# WRONG
def update_profile(user_id, new_data):
primary.write(user_id, new_data)
return read_from_replica(user_id) # Might be stale!
# RIGHT
def update_profile(user_id, new_data):
primary.write(user_id, new_data)
return primary.read(user_id) # Guaranteed fresh
Pattern 2: Cache staleness
Write goes to database, but the cache still has the old value.
# Fix: invalidate on write
def update_profile(user_id, new_data):
db.write(user_id, new_data)
cache.delete(user_id) # Force next read from DB
Or use a shorter TTL. Or use read-through caching.
Pattern 3: Async propagation
User clicks "upgrade subscription" → API returns 200 → subscription_service emits event → billing_service processes → account_service updates → user's next page load still shows "trial"
# Fix: wait for propagation before returning
async def upgrade(user_id, plan):
result = await subscription_service.upgrade(user_id, plan)
# Wait for downstream to process
await wait_for_account_update(user_id, plan, timeout=5)
return result
Pattern 4: Eventually consistent views
Using Elasticsearch as a read replica of Postgres. Writes go to Postgres, then replicate to ES. If you write + read, you read the old value.
# Fix 1: read from primary for recent writes
def get_user(user_id, recent_write=False):
if recent_write:
return postgres.get(user_id)
return elasticsearch.get(user_id)
# Fix 2: tracking epoch and waiting
def get_user(user_id, min_epoch=None):
if min_epoch:
wait_for_replication_to_reach(min_epoch)
return elasticsearch.get(user_id)
The Detection Strategy
Since bugs are non-deterministic, detection has to be probabilistic:
# In your logging middleware:
def log_inconsistency(user_id, field, expected, actual):
if expected!= actual:
metrics.increment('consistency.mismatch', tags=[
f'field:{field}'
])
logger.warn('Consistency mismatch', user=user_id,
field=field, expected=expected, actual=actual)
Monitor the mismatch rate. If it's > 0.01%, you have a real problem.
The Diagnostic Questions
When debugging, ask:
How many replicas are involved? More replicas = longer propagation delay.
What's the normal replication lag? Check monitoring. Normal is usually < 1 second.
Is the lag elevated right now? A spike from 100ms to 5s is a red flag.
Was the write synchronous or async? Async writes have no guarantee of being visible on reads.
Does the client have a retry? Retries at the wrong layer can produce duplicates or stale reads.
Is there a cache in the path? Caches are the most common source of "phantom staleness."
The Debugging Tools
1. Replication lag metrics
For PostgreSQL:
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
For MongoDB:
rs.printSecondaryReplicationInfo()
For Redis:
INFO replication
Alert on lag > 1 second sustained.
2. Request tracing
Use OpenTelemetry to trace a single user action across services. You'll see where time is spent and which service is talking to which replica.
3. Log all reads with node ID
Every read should log which node/replica served it. When debugging, you can see if the bug correlates with a specific replica.
4. Deterministic test environment
Build a test environment where you can introduce artificial replication delay. Helps reproduce bugs locally:
# In test
replica.delay_ms = 500 # Force 500ms replication lag
# Run the flow, assert consistency
The Fix Strategies
Strategy 1: Strong consistency where it matters
Not everything needs to be eventually consistent. For critical flows:
- User-visible state changes → read from primary
- Financial operations → strong consistency (transactions)
- Security operations (auth, permissions) → strong consistency
Accept the latency hit for correctness.
Strategy 2: Causal consistency
Track "you just wrote X" and ensure subsequent reads see X:
# Client sends "min_epoch" based on last write
def handle_request(request):
if request.min_epoch:
wait_for_replication(request.min_epoch)
return db.read(request.key)
Strategy 3: User-visible time
Sometimes the solution is UX, not infrastructure:
"Your changes are saved. They may take up to 30 seconds to appear."
Set expectations. Users understand "saved but syncing" better than "works for you but broken for me."
Strategy 4: Accept and monitor
Some inconsistency is tolerable. Log it, alert if it exceeds a threshold, and fix the worst offenders.
The Hardest Bugs
The worst eventual consistency bugs happen at failure recovery time. A write that was in-flight during a network partition might:
- Succeed on primary but fail to replicate
- Fail on primary but succeed on one replica
- Be applied in a different order than another write
- Be lost entirely
These are fundamental distributed systems problems. Defense: idempotent operations, client-side retries with unique IDs, careful read-after-write semantics.
The Takeaway
Eventual consistency isn't a bug. It's a tradeoff.
The bug is:
- Not understanding where consistency matters in your system
- Not having visibility into replication lag
- Not documenting consistency guarantees to users
Fix these, and "eventually consistent" stops being a dirty phrase and starts being a feature.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)