DEV Community

Ed
Ed

Posted on • Originally published at olko.substack.com on

CAP Theorem: What Senior Engineers Actually Need to Know

Subscribe now

When GitHub Chose Consistency Over Your Pull Request

At 3:11 a.m. UTC on October 21, 2018, GitHub’s monitoring detected replication lag spiking from milliseconds to hours between U.S. data centers. Within minutes, the platform serving 30 million developers began showing stale data.

GitHub’s response? They deliberately froze all writes for five hours.

The banner every developer saw that morning:

“GitHub is experiencing degraded availability while we recover replication consistency.”

This wasn’t incompetence—it was the most expensive proof of CAP Theorem in production history.

When their MySQL cluster split across coasts, GitHub faced the choice every distributed system eventually faces: serve requests with potentially wrong data, or stop serving requests entirely. They chose the latter. Their post-mortem is worth reading.

If you’re designing anything beyond a single server—DynamoDB tables, Kafka clusters, replicated Postgres, microservices—you will make this exact choice. Understanding CAP isn’t optional. It’s the gravitational law of distributed computing.


Triangle of quality: CAP

The Theorem in 30 Seconds

CAP Theorem (Brewer 2000, formalized by Gilbert & Lynch 2002): In a distributed system experiencing a network partition, you can guarantee at most two of:

  1. Consistency (C) – All nodes see identical data simultaneously

  2. Availability (A) – Every request gets a non-error response

  3. Partition Tolerance (P) – System functions despite network failures between nodes

Since network partitions are inevitable in distributed systems, you’re really choosing between C or A when failures occur.


Why This Matters: The Coffee Shop Test

Two branches of a coffee chain share a loyalty-points database. The network between them fails.

Scenario 1: Choose Availability (AP)

  • Both locations keep accepting purchases

  • Customer spends 500 points at Downtown, then immediately drives to Uptown and spends the same 500 points again

  • System reconciles later, discovers the conflict

  • Result: Lost revenue, but no angry customers

Scenario 2: Choose Consistency (CP)

  • Downtown can’t verify points balance with Uptown

  • Register locks until sync returns

  • Result: Guaranteed correctness, angry customers waiting

No architecture eliminates this trade-off. You only choose where to suffer.


Code: Seeing the Choice

AP System (Availability-First)

// DynamoDB-style eventual consistency
async function purchaseWithPoints(customerId, points) {
  try {
    // Write to local node immediately
    await localDB.write({ customerId, points, timestamp: Date.now() });

    // Replicate asynchronously (fire and forget)
    replicateAsync(customerId, points).catch(err => {
      logger.warn(’Replication delayed’, err);
    });

    return { 
      success: true, 
      consistency: ‘eventual’,
      message: ‘Purchase recorded, replicating...’
    };
  } catch (err) {
    // Still succeed even if remote nodes unreachable
    return { success: true, replicated: false };
  }
}
Enter fullscreen mode Exit fullscreen mode

This code prioritizes uptime. Even if 2 of 3 replicas are down, customers transact. Conflicts get resolved through:

  • Last-write-wins (simple, potentially lossy)

  • Vector clocks (complex, deterministic)

  • Conflict-free replicated data types (CRDTs)

Real-world AP systems: DynamoDB, Cassandra, Riak


CP System (Consistency-First)

// MongoDB-style primary-secondary with write concerns
async function purchaseWithPoints(customerId, points) {
  const session = await startTransaction();

  try {
    // Write to primary
    await primaryDB.write({ customerId, points }, { session });

    // Block until majority of replicas confirm
    await waitForReplication({
      writeConcern: ‘majority’,
      timeout: 5000 // fail if can’t replicate in 5s
    });

    await session.commitTransaction();
    return { 
      success: true, 
      consistency: ‘strong’,
      message: ‘Purchase confirmed on all replicas’
    };
  } catch (err) {
    await session.abortTransaction();
    // Return error if replication fails
    return { 
      success: false, 
      error: ‘Network partition detected, refusing write’
    };
  }
}

Enter fullscreen mode Exit fullscreen mode

This code prioritizes correctness. During partitions, requests fail rather than create divergent data. MongoDB’s evolution toward this model came after data loss incidents in early versions.

Real-world CP systems: Spanner, HBase, MongoDB with majority write concern


The Hybrid Reality: Tunable Consistency

Modern databases rarely pick a single point on the spectrum. Instead, they let you choose per operation.

DynamoDB example:

// Strong read (CP behavior)
const item = await dynamoDB.getItem({
  TableName: ‘Users’,
  Key: { userId: ‘123’ },
  ConsistentRead: true // Forces read from primary
});

// Eventual read (AP behavior)
const item = await dynamoDB.getItem({
  TableName: ‘Users’,
  Key: { userId: ‘123’ },
  ConsistentRead: false // May read from stale replica
});

Enter fullscreen mode Exit fullscreen mode

Cassandra’s quorum tuning:

// Write to all 3 replicas, succeed when 2 confirm (CP-leaning)
await cassandra.execute(query, params, { 
  consistency: Consistency.QUORUM 
});

// Write to 1 replica, succeed immediately (AP-leaning)
await cassandra.execute(query, params, { 
  consistency: Consistency.ONE 
});

Enter fullscreen mode Exit fullscreen mode

The Decision Tree: When to Pick What

Choose CP (Consistency > Availability) when:

  • Financial transactions (double-spending is catastrophic)

  • Inventory systems (overselling angers customers)

  • Config management (inconsistent configs break prod)

  • Medical records (wrong data kills people)

Pattern : When incorrect data causes more damage than downtime.

Choose AP (Availability > Consistency) when:

  • Social media feeds (stale likes don’t matter)

  • Analytics dashboards (approximate counts acceptable)

  • Shopping cart contents (user fixes conflicts themselves)

  • Messaging “read receipts” (eventual accuracy sufficient)

Pattern : When downtime frustrates users more than temporary staleness.

The gray area:

Most systems need both —strong consistency for writes, eventual for reads. This is where PACELC extends CAP: even when there’s no partition, you trade latency for consistency.


What About “CA” Systems?

The CA corner (Consistency + Availability, no Partition Tolerance) is theoretically impossible in truly distributed systems. Network partitions aren’t optional—they happen.

Single-node databases like standalone Postgres or Redis are “CA” but only because they’re not distributed. The moment you add a replica, you’re playing CAP.


The Real Lesson: Partitions Choose For You

Eric Brewer (the “CAP” in CAP Theorem) said it best in his 2000 PODC keynote: partitions aren’t theoretical. They’re Tuesday.

  • Cross-datacenter links fail

  • Rack switches crash

  • OS updates hang

  • Cloud providers have “availability zone degradation”

If you haven’t explicitly chosen CP or AP, the network will decide for you during the outage—usually in the worst possible way.

GitHub chose CP, went down for 5 hours, but preserved data integrity.

AWS DynamoDB chooses AP by default, stays up during partitions, accepts stale reads.

Neither is “better.” Both are intentional choices aligned with business requirements.


Takeaway Box

Use CP when:

Incorrect data causes financial loss, safety issues, or user-facing inconsistency that support can’t fix.

Use AP when:

Downtime loses revenue, user trust, or competitive edge faster than eventual consistency causes problems.

Use tunable consistency when:

Different operations have different requirements. Read-heavy analytics can be eventual; write-heavy checkout must be strong.

Test your choice:

Simulate partitions using chaos engineering (Netflix’s Chaos Kong, Gremlin.com). Discover your actual behavior before production does.


Further Reading


This is Chapter 1 from my upcoming book on distributed systems fundamentals. If you found this useful, subscribe for weekly deep-dives on consensus, eventual consistency, and other concepts that matter in production.

Questions? Disagree with something? Leave a comment—I’m testing this material and want to know what I got wrong.


Thanks for reading Olko Tech/Engineering! Subscribe for free to receive new posts and support my work.

Top comments (0)