Samson Tanimawo

Posted on May 8

Eventual Consistency: Debugging the Hardest Class of Bugs

#distributed #consistency #debugging #microservices

The Bug That Only Happens Sometimes

User reports: "I updated my profile but it still shows the old name."
You refresh: new name appears.
You ask them to refresh: still old name.
You debug together: it's back to new name.
The user: "This is clearly broken."
You: "It's working for me..."

Congratulations, you're debugging eventual consistency.

What Eventual Consistency Actually Means

In a distributed system, a write to one node takes time to propagate to others. During that window:

Some reads see the new value
Some reads see the old value
All reads will eventually see the new value

That "eventually" window can be milliseconds or seconds. Sometimes minutes if something is wrong.

Users don't care about CAP theorem. They care that they updated their profile and it shows wrong.

Why It's So Hard to Debug

Three reasons:

1. Non-deterministic reproduction

You can't reliably reproduce it. Sometimes the bug fires, sometimes it doesn't. This makes unit tests useless and traditional debugging painful.

2. Time-dependent

The bug depends on when the read happens relative to the write. Microseconds matter. Your laptop is too fast to see it locally.

3. State propagation is invisible

Logs show you the read and the write. They don't show you "this read hit replica 2 which hadn't received the replication event yet."

The Common Patterns

Pattern 1: Read your own writes

User updates profile → write goes to primary → user reads → read hits a replica that hasn't replicated yet → user sees stale data.

# WRONG
def update_profile(user_id, new_data):
primary.write(user_id, new_data)
return read_from_replica(user_id) # Might be stale!

# RIGHT
def update_profile(user_id, new_data):
primary.write(user_id, new_data)
return primary.read(user_id) # Guaranteed fresh

Pattern 2: Cache staleness

Write goes to database, but the cache still has the old value.

# Fix: invalidate on write
def update_profile(user_id, new_data):
db.write(user_id, new_data)
cache.delete(user_id) # Force next read from DB

Or use a shorter TTL. Or use read-through caching.

Pattern 3: Async propagation

User clicks "upgrade subscription" → API returns 200 → subscription_service emits event → billing_service processes → account_service updates → user's next page load still shows "trial"

# Fix: wait for propagation before returning
async def upgrade(user_id, plan):
result = await subscription_service.upgrade(user_id, plan)
# Wait for downstream to process
await wait_for_account_update(user_id, plan, timeout=5)
return result

Pattern 4: Eventually consistent views

Using Elasticsearch as a read replica of Postgres. Writes go to Postgres, then replicate to ES. If you write + read, you read the old value.

# Fix 1: read from primary for recent writes
def get_user(user_id, recent_write=False):
if recent_write:
return postgres.get(user_id)
return elasticsearch.get(user_id)

# Fix 2: tracking epoch and waiting
def get_user(user_id, min_epoch=None):
if min_epoch:
wait_for_replication_to_reach(min_epoch)
return elasticsearch.get(user_id)

The Detection Strategy

Since bugs are non-deterministic, detection has to be probabilistic:

# In your logging middleware:
def log_inconsistency(user_id, field, expected, actual):
if expected!= actual:
metrics.increment('consistency.mismatch', tags=[
f'field:{field}'
])
logger.warn('Consistency mismatch', user=user_id,
field=field, expected=expected, actual=actual)

Monitor the mismatch rate. If it's > 0.01%, you have a real problem.

The Diagnostic Questions

When debugging, ask:

How many replicas are involved? More replicas = longer propagation delay.
What's the normal replication lag? Check monitoring. Normal is usually < 1 second.
Is the lag elevated right now? A spike from 100ms to 5s is a red flag.
Was the write synchronous or async? Async writes have no guarantee of being visible on reads.
Does the client have a retry? Retries at the wrong layer can produce duplicates or stale reads.
Is there a cache in the path? Caches are the most common source of "phantom staleness."

The Debugging Tools

1. Replication lag metrics

For PostgreSQL:

SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

For MongoDB:

rs.printSecondaryReplicationInfo()

For Redis:

INFO replication

Alert on lag > 1 second sustained.

2. Request tracing

Use OpenTelemetry to trace a single user action across services. You'll see where time is spent and which service is talking to which replica.

3. Log all reads with node ID

Every read should log which node/replica served it. When debugging, you can see if the bug correlates with a specific replica.

4. Deterministic test environment

Build a test environment where you can introduce artificial replication delay. Helps reproduce bugs locally:

# In test
replica.delay_ms = 500 # Force 500ms replication lag
# Run the flow, assert consistency

The Fix Strategies

Strategy 1: Strong consistency where it matters

Not everything needs to be eventually consistent. For critical flows:

User-visible state changes → read from primary
Financial operations → strong consistency (transactions)
Security operations (auth, permissions) → strong consistency

Accept the latency hit for correctness.

Strategy 2: Causal consistency

Track "you just wrote X" and ensure subsequent reads see X:

# Client sends "min_epoch" based on last write
def handle_request(request):
if request.min_epoch:
wait_for_replication(request.min_epoch)
return db.read(request.key)

Strategy 3: User-visible time

Sometimes the solution is UX, not infrastructure:

"Your changes are saved. They may take up to 30 seconds to appear."

Set expectations. Users understand "saved but syncing" better than "works for you but broken for me."

Strategy 4: Accept and monitor

Some inconsistency is tolerable. Log it, alert if it exceeds a threshold, and fix the worst offenders.

The Hardest Bugs

The worst eventual consistency bugs happen at failure recovery time. A write that was in-flight during a network partition might:

Succeed on primary but fail to replicate
Fail on primary but succeed on one replica
Be applied in a different order than another write
Be lost entirely

These are fundamental distributed systems problems. Defense: idempotent operations, client-side retries with unique IDs, careful read-after-write semantics.

The Takeaway

Eventual consistency isn't a bug. It's a tradeoff.

The bug is:

Not understanding where consistency matters in your system
Not having visibility into replication lag
Not documenting consistency guarantees to users

Fix these, and "eventually consistent" stops being a dirty phrase and starts being a feature.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community