Posted on Oct 29 • Originally published at olko.substack.com on Oct 28

CAP Theorem: What Senior Engineers Actually Need to Know

#distributedsystems #systemdesign #database #architecture

When GitHub Chose Consistency Over Your Pull Request

At 3:11 a.m. UTC on October 21, 2018, GitHub’s monitoring detected replication lag spiking from milliseconds to hours between U.S. data centers. Within minutes, the platform serving 30 million developers began showing stale data.

GitHub’s response? They deliberately froze all writes for five hours.

The banner every developer saw that morning:

“GitHub is experiencing degraded availability while we recover replication consistency.”

This wasn’t incompetence—it was the most expensive proof of CAP Theorem in production history.

When their MySQL cluster split across coasts, GitHub faced the choice every distributed system eventually faces: serve requests with potentially wrong data, or stop serving requests entirely. They chose the latter. Their post-mortem is worth reading.

If you’re designing anything beyond a single server—DynamoDB tables, Kafka clusters, replicated Postgres, microservices—you will make this exact choice. Understanding CAP isn’t optional. It’s the gravitational law of distributed computing.

The Theorem in 30 Seconds

CAP Theorem (Brewer 2000, formalized by Gilbert & Lynch 2002): In a distributed system experiencing a network partition, you can guarantee at most two of:

Consistency (C) – All nodes see identical data simultaneously
Availability (A) – Every request gets a non-error response
Partition Tolerance (P) – System functions despite network failures between nodes

Since network partitions are inevitable in distributed systems, you’re really choosing between C or A when failures occur.

Why This Matters: The Coffee Shop Test

Two branches of a coffee chain share a loyalty-points database. The network between them fails.

Scenario 1: Choose Availability (AP)

Both locations keep accepting purchases
Customer spends 500 points at Downtown, then immediately drives to Uptown and spends the same 500 points again
System reconciles later, discovers the conflict
Result: Lost revenue, but no angry customers

Scenario 2: Choose Consistency (CP)

Downtown can’t verify points balance with Uptown
Register locks until sync returns
Result: Guaranteed correctness, angry customers waiting

No architecture eliminates this trade-off. You only choose where to suffer.

Code: Seeing the Choice

AP System (Availability-First)

// DynamoDB-style eventual consistency
async function purchaseWithPoints(customerId, points) {
  try {
    // Write to local node immediately
    await localDB.write({ customerId, points, timestamp: Date.now() });

    // Replicate asynchronously (fire and forget)
    replicateAsync(customerId, points).catch(err => {
      logger.warn(’Replication delayed’, err);
    });

    return { 
      success: true, 
      consistency: ‘eventual’,
      message: ‘Purchase recorded, replicating...’
    };
  } catch (err) {
    // Still succeed even if remote nodes unreachable
    return { success: true, replicated: false };
  }
}

This code prioritizes uptime. Even if 2 of 3 replicas are down, customers transact. Conflicts get resolved through:

Last-write-wins (simple, potentially lossy)
Vector clocks (complex, deterministic)
Conflict-free replicated data types (CRDTs)

Real-world AP systems: DynamoDB, Cassandra, Riak

CP System (Consistency-First)

// MongoDB-style primary-secondary with write concerns
async function purchaseWithPoints(customerId, points) {
  const session = await startTransaction();

  try {
    // Write to primary
    await primaryDB.write({ customerId, points }, { session });

    // Block until majority of replicas confirm
    await waitForReplication({
      writeConcern: ‘majority’,
      timeout: 5000 // fail if can’t replicate in 5s
    });

    await session.commitTransaction();
    return { 
      success: true, 
      consistency: ‘strong’,
      message: ‘Purchase confirmed on all replicas’
    };
  } catch (err) {
    await session.abortTransaction();
    // Return error if replication fails
    return { 
      success: false, 
      error: ‘Network partition detected, refusing write’
    };
  }
}

This code prioritizes correctness. During partitions, requests fail rather than create divergent data. MongoDB’s evolution toward this model came after data loss incidents in early versions.

Real-world CP systems: Spanner, HBase, MongoDB with majority write concern

The Hybrid Reality: Tunable Consistency

Modern databases rarely pick a single point on the spectrum. Instead, they let you choose per operation.

DynamoDB example:

// Strong read (CP behavior)
const item = await dynamoDB.getItem({
  TableName: ‘Users’,
  Key: { userId: ‘123’ },
  ConsistentRead: true // Forces read from primary
});

// Eventual read (AP behavior)
const item = await dynamoDB.getItem({
  TableName: ‘Users’,
  Key: { userId: ‘123’ },
  ConsistentRead: false // May read from stale replica
});

Cassandra’s quorum tuning:

// Write to all 3 replicas, succeed when 2 confirm (CP-leaning)
await cassandra.execute(query, params, { 
  consistency: Consistency.QUORUM 
});

// Write to 1 replica, succeed immediately (AP-leaning)
await cassandra.execute(query, params, { 
  consistency: Consistency.ONE 
});

The Decision Tree: When to Pick What

Choose CP (Consistency > Availability) when:

Financial transactions (double-spending is catastrophic)
Inventory systems (overselling angers customers)
Config management (inconsistent configs break prod)
Medical records (wrong data kills people)

Pattern : When incorrect data causes more damage than downtime.

Choose AP (Availability > Consistency) when:

Social media feeds (stale likes don’t matter)
Analytics dashboards (approximate counts acceptable)
Shopping cart contents (user fixes conflicts themselves)
Messaging “read receipts” (eventual accuracy sufficient)

Pattern : When downtime frustrates users more than temporary staleness.

The gray area:

Most systems need both —strong consistency for writes, eventual for reads. This is where PACELC extends CAP: even when there’s no partition, you trade latency for consistency.

What About “CA” Systems?

The CA corner (Consistency + Availability, no Partition Tolerance) is theoretically impossible in truly distributed systems. Network partitions aren’t optional—they happen.

Single-node databases like standalone Postgres or Redis are “CA” but only because they’re not distributed. The moment you add a replica, you’re playing CAP.

The Real Lesson: Partitions Choose For You

Eric Brewer (the “CAP” in CAP Theorem) said it best in his 2000 PODC keynote: partitions aren’t theoretical. They’re Tuesday.

Cross-datacenter links fail
Rack switches crash
OS updates hang
Cloud providers have “availability zone degradation”

If you haven’t explicitly chosen CP or AP, the network will decide for you during the outage—usually in the worst possible way.

GitHub chose CP, went down for 5 hours, but preserved data integrity.

AWS DynamoDB chooses AP by default, stays up during partitions, accepts stale reads.

Neither is “better.” Both are intentional choices aligned with business requirements.

Takeaway Box

Use CP when:

Incorrect data causes financial loss, safety issues, or user-facing inconsistency that support can’t fix.

Use AP when:

Downtime loses revenue, user trust, or competitive edge faster than eventual consistency causes problems.

Use tunable consistency when:

Different operations have different requirements. Read-heavy analytics can be eventual; write-heavy checkout must be strong.

Test your choice:

Simulate partitions using chaos engineering (Netflix’s Chaos Kong, Gremlin.com). Discover your actual behavior before production does.

DEV Community