Distributed Systems: The CAP Theorem

#database #sharding #distributedsystems

So, I'm gonna start talking about distributed systems, as I learn them. I'm starting with the CAP theorem, which seems like a central consideration in designing distributed data stores (DBs with sharding).

This is mostly a summary of other sources, particularly Coda Hale's fantastic article

The three parts of CAP are as follows:

Consistency: Every time we try to read data from the database, we'll either get the most recent data, or nothing. (This is different from ACID consistency, which is about maintaining compliance with a schema/table structure)
Availability: Requests receive data in response (pretty much no matter what, without any assurance that the most recent data will be sent out)
Partition Tolerance: This one basically means that it works over a network / can function despite messages being dropped, truncated, delayed, or otherwise damaged by the network

Essentially, we take Partition Tolerance as a given- We can't really have a distributed data store without being reliant on a network, and it gets definitionally not-distributed if we give this one up. People are going to accidentally cut wires, or EMPs will happen, or the power grid will go down. We can't have a distributed system at all without having to consider these hiccups.

So the choice becomes this: When the network takes down one of our nodes, do we still accept requests, sending back partial (or not-fully up to date) data? Or do we wait for everything to come back online, reconcile any writes that might have happened in the meantime, and not send data until we know we're good to go?

Of course, it's not completely cut and dried. Stonebreaker makes assertions about CAP that, frankly, I'm not quite qualified to talk about. It seems like his concern is that availability isn't a real thing, and that single node failures can happen while maintaining consistency, without disruption to availability.

To summarize my first point, the CAP theorem assumes reliable applications, reliable DBMS, reliable humans and no change in resources. Unfortunately, these are all issues, which must be considered. In addition, modeling node failures as network partitions leads to an obviously bad conclusion.

Obvious to who? Probably someone more well-versed in DBs than I. Regardless, Stonebreaker ends up clarifying that it doesn't make much sense to "give up" consist.ency, since a sufficiently screwed up network will kill your application anyways. It's not infrastructural failure, but human error, bad actors, and glitchy code that are the real, frequent sources of downtime.

Hale responds with some stories, though:

Multi-node failures may be rarer than single-node failures, but they are still common enough to have serious effects on business. In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), PDU failures, switch failures, accidental power cycles of whole racks, whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford pickup truck into a DC’s HVAC system. And I’m not even an ops guy.

Ultimately, I do somewhat lean towards Hale's perspective, and it's essentially this: There are usually tolerances for consistency, and eventual consistency is good enough a lot of the time. Losing availability means losing money, and that's ultimately what determines the requirements of most systems, so that's what we need to prioritize.

(As an aside, I'm also really tickled by the names of distributed systems - CockroachDB, and Project Voldemort are particularly funny names to me)

Further reading:
https://www.microsoft.com/en-us/research/wp-content/uploads/2011/10/ConsistencyAndBaseballReport.pdf
https://codahale.com/you-cant-sacrifice-partition-tolerance/
https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
https://en.wikipedia.org/wiki/Consistency_model