Alan West

Posted on Apr 17

Why Your Database Is Lying to You (And How to Catch It)

#distributedsystems #database #reliability #backend

You deploy a distributed database. The docs promise linearizable reads, durable writes, and seamless failover. You build your app on those guarantees. Then one day, a network partition hits, a node restarts, and suddenly you've got duplicate charges, missing records, or data that traveled back in time.

Sound familiar? Yeah, me too.

The Problem Nobody Talks About in Standups

Distributed systems make promises. Strong consistency. Exactly-once delivery. Automatic failover with zero data loss. And a shocking number of those promises are, to put it charitably, aspirational.

Kyle Kingsbury's Jepsen project has been systematically proving this for years — testing databases under real failure conditions and finding that many of them don't deliver what they claim. We're talking about mainstream, production-grade systems shipping bugs that violate their own documented guarantees.

But here's the thing that keeps me up at night: most teams never discover these issues until they're staring at corrupted production data. The failures are subtle. They happen during edge cases — network partitions, clock skew, leader elections — that your integration tests never simulate.

Root Cause: Why Systems Lie

The root cause isn't malice. It's complexity. Distributed consensus is genuinely hard, and there are a few recurring reasons systems fail to deliver:

Stale reads after failover: A replica gets promoted to leader but hasn't received all writes from the old leader. Your app reads stale data and acts on it.
Split-brain writes: Two nodes both think they're the leader during a partition. Both accept writes. When the partition heals, one set of writes gets silently dropped.
Dirty reads on uncommitted transactions: Some systems expose data from transactions that haven't fully committed across the cluster.
Clock-dependent ordering: Systems that rely on wall clocks for ordering (instead of logical clocks) can reorder or lose events when clocks drift.

The tricky part is that these bugs don't show up during normal operation. Your monitoring is green. Your tests pass. Everything looks fine — until it isn't.

Step 1: Stop Trusting, Start Verifying

First thing I do with any distributed datastore: actually read the consistency model docs. Not the marketing page. The actual technical documentation about what guarantees the system provides and under what conditions.

# Questions to answer before you trust a system:
consistency_checklist:
  - What consistency level is the DEFAULT? (not the strongest available)
  - What happens to in-flight writes during leader election?
  - Does the system use synchronous or asynchronous replication?
  - What's the documented behavior during network partitions?
  - Are there known open issues related to data consistency?

You'd be amazed how often the default configuration is the weakest consistency option. The marketing says "strongly consistent" but the default config gives you eventual consistency because it's faster for benchmarks.

Step 2: Simulate Failures Locally

Don't wait for production to test your system's behavior under failure. You can simulate partitions and crashes locally using tools that already exist in your toolkit.

#!/bin/bash
# Simple network partition simulation using iptables (Linux)
# Blocks traffic between two database nodes to simulate a partition

NODE_A="192.168.1.10"
NODE_B="192.168.1.11"

# Create partition
echo "Inducing partition between $NODE_A and $NODE_B"
iptables -A INPUT -s $NODE_A -j DROP
iptables -A OUTPUT -d $NODE_A -j DROP

# Run your write workload during the partition
echo "Running writes for 30 seconds during partition..."
sleep 30

# Heal partition
echo "Healing partition"
iptables -D INPUT -s $NODE_A -j DROP
iptables -D OUTPUT -d $NODE_A -j DROP

# Wait for cluster to stabilize
sleep 10

# Now verify: did all writes survive?
echo "Verifying data integrity..."

For a more sophisticated approach, tools like Jepsen (Clojure-based), Namazu (fault injection for containers), or even Toxiproxy from Shopify let you inject latency, partitions, and connection resets between your services.

Step 3: Build a Verification Layer

Here's a pattern I've started using in every project that depends on distributed state: write a verification worker that continuously audits your data for invariant violations.

import hashlib
import logging
from datetime import datetime, timedelta

logger = logging.getLogger("consistency_auditor")

class ConsistencyAuditor:
    """Continuously checks that your data invariants actually hold."""

    def __init__(self, primary_store, replica_store):
        self.primary = primary_store
        self.replica = replica_store

    def check_replication_lag(self, sample_keys: list[str]):
        """Compare a sample of records between primary and replica."""
        mismatches = []
        for key in sample_keys:
            primary_val = self.primary.get(key)
            replica_val = self.replica.get(key)

            if primary_val != replica_val:
                mismatches.append({
                    "key": key,
                    "primary": primary_val,
                    "replica": replica_val,
                    "detected_at": datetime.utcnow().isoformat()
                })

        if mismatches:
            # Don't just log — alert. This is a real consistency violation.
            logger.critical(
                f"Found {len(mismatches)} replication mismatches",
                extra={"mismatches": mismatches}
            )
        return mismatches

    def verify_monotonic_writes(self, entity_id: str):
        """Ensure version numbers only go forward, never backward."""
        history = self.primary.get_version_history(entity_id)
        for i in range(1, len(history)):
            if history[i].version <= history[i - 1].version:
                logger.critical(
                    f"Non-monotonic version for {entity_id}: "
                    f"{history[i-1].version} -> {history[i].version}"
                )
                return False
        return True

This isn't paranoid. It's engineering. If your database says writes are durable, verify it. If it claims reads are consistent, check.

Step 4: Design for Dishonesty

Once you accept that your infrastructure might not keep its promises, you can design around it:

Idempotency keys everywhere: Every write operation should be safe to retry. Use a unique key per operation so that duplicate delivery doesn't create duplicate side effects.
Event sourcing for critical paths: Instead of trusting that the current state is correct, keep an append-only log of events. You can always rebuild state from the log.
CRDTs for conflict resolution: If you know split-brain writes can happen, use data structures that are mathematically guaranteed to converge. Counters, sets, registers — there are well-studied CRDT types for common patterns.
Explicit version vectors: Attach a version vector (not a timestamp) to every piece of data. When conflicts are detected, you have enough information to resolve them deterministically.

Step 5: Monitor the Promises, Not Just the Metrics

Most monitoring tracks CPU, memory, disk, request latency. That's necessary but not sufficient. You also need to monitor the semantic guarantees:

# Example: a Prometheus metric that tracks consistency violations
# Expose this from your auditor and alert on it
consistency_violations_total{type="stale_read"} 0
consistency_violations_total{type="lost_write"} 0
consistency_violations_total{type="version_regression"} 0

# Alert rule: any non-zero value is a problem
# ALERT ConsistencyViolation
#   IF consistency_violations_total > 0
#   FOR 1m
#   LABELS { severity = "critical" }

A green dashboard means nothing if you're not measuring the things that actually matter to your application's correctness.

The Uncomfortable Truth

We've collectively built an ecosystem where distributed systems marketing promises outrun implementation reality. And honestly, I don't think that's changing anytime soon — the incentives are wrong.

So the pragmatic move is to stop treating your database's consistency guarantees as facts and start treating them as hypotheses. Test them. Monitor them. Build safety nets for when they fail.

The future of distributed systems might be full of overpromises, but your application doesn't have to be the one that pays the price. Verify everything, trust nothing, and keep an audit log.

DEV Community