DEV Community

Ed
Ed

Posted on • Originally published at olko.substack.com on

Consistency Models: Why Your Database Lies to You (And When That’s Fine)

The Day Amazon’s Shopping Cart Became Haunted

In 2007, Amazon engineers discovered something disturbing: deleted items were coming back from the dead.

A customer would remove an item from their cart, continue shopping, then check out—only to find the “deleted” item had reappeared and been charged to their card. Support tickets flooded in. The bug wasn’t random—it happened during network partitions between Dynamo replicas.

Here’s what was happening: when a customer deleted an item, the write went to Replica A. Replica B, temporarily partitioned, never got the message. When the network healed, Replica B’s version—which still contained the item—merged with Replica A’s version. No clear “winner,” so the item resurrected.

Amazon’s response? They didn’t fix the bug. They redesigned the entire system to embrace it.

The result was Dynamo, the database that pioneered eventual consistency at scale, introduced vector clocks for conflict resolution, and changed how we think about distributed data.

This is where most engineers’ understanding stops: “CAP says you can’t have consistency and availability during partitions.” True. But that binary misses the entire spectrum of how consistent your system actually needs to be.

Thanks for reading Olko - Tech/Engineering! Subscribe for free to receive new posts and support my work.


The Consistency Spectrum Nobody Explains

CAP doesn’t say your data is either “perfectly consistent” or “complete chaos.” Between those extremes lies a gradient as wide as the difference between a bank transfer and a Twitter like.

The full spectrum:

Strong ←────────────────────────────────────────────────→ Weak
│ │ │ │ │
Linearizable Sequential Causal Session Eventual

High latency Low latency
Expensive Cheap
Simple reasoning Complex bugs

Enter fullscreen mode Exit fullscreen mode

Every distributed database picks a point on this line. Your job as an engineer: know which point, understand the trade-offs, and not promise guarantees your system doesn’t provide.


The Four Models That Matter

1. Linearizability (Strongest)

Definition: Every operation appears to execute atomically at some point between its start and completion. All clients see operations in the same real-time order.

Think of it as: A single-threaded system, no matter how distributed it actually is.

Code behavior:

// Client A writes
await db.write(’x’, 5); // completes at time T1

// Client B reads immediately after
const val = await db.read(’x’); // guaranteed to see 5
console.log(val); // always prints 5, never 0 or stale value

Enter fullscreen mode Exit fullscreen mode

Real systems: Google Spanner (using TrueTime atomic clocks), etcd, Zookeeper

Cost: High latency. Every write needs global coordination. Cross-region writes can take 100ms+.

When to use: Financial transactions, inventory management, anything where “approximately right” means “catastrophically wrong.”


2. Causal Consistency (The Pragmatic Middle)

Definition: If operation A causally affects operation B (e.g., B reads A’s write), all nodes see them in that order. Independent operations can appear in any order.

Think of it as: Preserving the story’s plot, but letting unrelated subplots unfold in parallel.

Code behavior:

// Thread 1: Post a tweet
await db.write(’tweet:123’, ‘Hello world’);
await db.write(’tweet:123:likes’, 0); // causally dependent

// Thread 2: Read elsewhere
const tweet = await db.read(’tweet:123’); // may be stale
const likes = await db.read(’tweet:123:likes’);
// Guarantee: if likes exists, tweet must also be visible
// No guarantee: how fresh either value is

Enter fullscreen mode Exit fullscreen mode

Real systems: Azure Cosmos DB (default), Cassandra with careful config, some Redis setups

Cost: Medium latency. Requires tracking causal dependencies (vector clocks, version vectors) but no global locks.

When to use: Social networks, collaborative apps, anywhere causality matters but exact timing doesn’t.


3. Sequential Consistency

Definition: Operations from each client execute in program order, but there’s no guarantee of real-time ordering across clients.

Think of it as: Everyone agrees on A happening before B for one client, but Client 1’s timeline and Client 2’s timeline might interleave differently.

Code behavior:

// Client A
await db.write(’x’, 1);
await db.write(’x’, 2);

// Client B always sees 1 then 2 (or just 2, never 2 then 1)

// But Client C might see:
// Client A: x=1, x=2
// Client D: something else from D, x=1, x=2
// Different global orders are allowed

Enter fullscreen mode Exit fullscreen mode

Real systems: Some distributed SQL databases, older MongoDB replica sets

Cost: Medium-low latency. Cheaper than linearizability but still requires some coordination.

When to use: Analytics systems, read-heavy workloads where per-user consistency matters but cross-user doesn’t.


4. Eventual Consistency (Weakest, Fastest)

Definition: If no new updates occur, eventually all replicas converge to the same value. Before then: anything goes.

Think of it as: “Trust me, it’ll make sense... eventually.”

Code behavior:

// Write to DynamoDB
await dynamoDB.putItem({
  TableName: ‘Users’,
  Item: { userId: ‘123’, name: ‘Alice’ }
});

// Read immediately from different replica
const user = await dynamoDB.getItem({
  TableName: ‘Users’,
  Key: { userId: ‘123’ },
  ConsistentRead: false // eventual consistency
});

console.log(user.Item.name);  
// Might print: undefined (write hasn’t propagated)
// Might print: “Bob” (old value still cached)
// Eventually prints: “Alice”

Enter fullscreen mode Exit fullscreen mode

Real systems: DynamoDB (default), Cassandra at consistency level ONE, most CDNs

Cost: Lowest latency. Writes return immediately, reads hit local cache.

When to use: View counts, likes, analytics dashboards, CDN content—anywhere staleness is annoying but not damaging.


The Hybrid Models You’ll Actually Use

Real systems rarely pick one model globally. Instead, they offer tunable consistency per operation.

DynamoDB: Eventual by Default, Strong on Demand

// Fast but potentially stale
const staleUser = await dynamoDB.getItem({
  TableName: ‘Users’,
  Key: { userId: ‘123’ },
  ConsistentRead: false // default, eventual
});

// Slow but guaranteed fresh
const freshUser = await dynamoDB.getItem({
  TableName: ‘Users’,
  Key: { userId: ‘123’ },
  ConsistentRead: true // forces read from primary
});

Enter fullscreen mode Exit fullscreen mode

AWS documentation: Read Consistency Options

Cassandra: Choose Per Query

// Write to majority of replicas (CP-leaning)
await cassandra.execute(
  ‘INSERT INTO users (id, name) VALUES (?, ?)’,
  [123, ‘Alice’],
  { consistency: cassandra.types.consistencies.quorum }
);

// Read from any replica (AP-leaning)
const result = await cassandra.execute(
  ‘SELECT name FROM users WHERE id = ?’,
  [123],
  { consistency: cassandra.types.consistencies.one }
);

Enter fullscreen mode Exit fullscreen mode

Pattern: Strong writes for critical data, eventual reads for performance.


Session Consistency: The Mobile App Sweet Spot

Definition: Within a single user session, you see your own writes and causally-related operations. Across sessions: no guarantees.

Why it matters: Users expect their own actions to be reflected immediately. They don’t care if other users see stale data for a few seconds.

// Mobile app example
async function updateProfile(userId, newBio) {
  // Write with session token
  await db.write(
    `user:${userId}:bio`, 
    newBio,
    { sessionToken: user.session }
  );

  // Immediate read in same session - guaranteed to see new bio
  const profile = await db.read(
    `user:${userId}:bio`,
    { sessionToken: user.session }
  );

  return profile; // always returns newBio
}

// Different user reads same profile - might see old bio
// But that’s fine, they’ll see the update within seconds

Enter fullscreen mode Exit fullscreen mode

Systems that provide this: Azure Cosmos DB (session consistency), MongoDB with read preference primary


The Decision Tree

Choose Linearizable when:

  • Money is involved (payments, account balances)

  • Inventory is limited (ticket sales, product stock)

  • Regulatory compliance requires audit trails

  • Pattern: Correctness > Speed, always

Choose Causal when:

  • Social interactions (posts, comments, reactions)

  • Collaborative editing (Google Docs, Figma)

  • Chat applications

  • Pattern: User expectations require logical ordering

Choose Session when:

  • User profiles and preferences

  • Shopping carts

  • Any single-user workflow

  • Pattern: “I see my changes immediately” matters, cross-user sync doesn’t

Choose Eventual when:

  • Analytics and metrics

  • Content recommendations

  • Search indices

  • View/like counts

  • Pattern: Speed matters, approximate is good enough


How Systems Fail (And Recover)

Twitter’s 2014 Timeline Ordering Bug

Problem: Tweets appeared out-of-order during replication lag

Root cause: Causal consistency violated—replies appeared before original tweets

Fix: Added explicit happens-before tracking for tweet threads

Reddit Vote Count Jumps

Problem: Upvote count changes dramatically on page refresh

Root cause: Eventual consistency—counting across replicas with stale reads

Fix: Added “updated X seconds ago” indicators, set user expectations

The Amazon Shopping Cart (2007)

Problem: Deleted items resurrected after partition

Root cause: Eventual consistency with no conflict resolution

Fix: Vector clocks + client-side merge logic in Dynamo

Key lesson: Weak consistency isn’t failure. Hiding it from users is.


Debugging Consistency Issues

Symptom: “My write disappeared”

  • Check if you’re reading from a different replica than you wrote to

  • Verify write acknowledgment (did it actually succeed?)

  • Look for partition healing—merges can “undo” writes

Symptom: “Data time-travels backward”

  • Reading from replicas with different lag

  • Check read-after-write guarantees in your client library

  • Consider session consistency or sticky routing

Symptom: “Conflicts I can’t explain”

  • Concurrent writes during partition

  • Check if system uses last-write-wins (LWW) or vector clocks

  • Look for application-level conflict resolution bugs


Practical Takeaways

1. Consistency is a spectrum, not a binary

Stop saying “my DB is consistent.” Specify: linearizable? causal? eventual?

2. Most apps need multiple consistency levels

Bank balance: strong. User bio: session. Feed ranking: eventual.

3. Latency and consistency are inversely related

Every 9 of consistency costs you a 0 in latency. Pick your battles.

4. Communicate your guarantees to users

“Updated 5 seconds ago” > silent staleness. Honesty scales.

5. Test with partitions, not just load

Use chaos engineering (Chaos Monkey, Gremlin) to simulate real failure modes.


Further Reading


This is Chapter 2 from my book on distributed systems fundamentals. Subscribe for weekly breakdowns of consensus, replication, and the other concepts that actually matter in production.

Quick context : If you’re jumping in here, this builds on CAP Theorem—the fundamental trade-off between consistency and availability during network partitions. Start with CAP first if you haven’t read it yet. This post assumes you understand why systems can’t have perfect consistency AND availability.

This is where most engineers’ understanding stops: “CAP says you can’t have consistency and availability during partitions.” True. But that binary misses the entire spectrum of **how consistent** your system actually needs to be.

Question for readers: Have you debugged a consistency bug in production? What was the “aha” moment when you realized it was a replication/consistency issue? Drop a comment—building a collection of war stories.

Leave a comment

Top comments (0)