Trade-offs of Distributed Data Stores : Theories and Theorems

#captheorem #pacelc #distributedsystems #nosql

Distributed Data Store refers to either a distributed database storing on a number of nodes (often in a replicated fashion)
OR
a computer network storing information on a number of peer network nodes.

Certain theorems describes potential limitations and trade-offs (say, in consistency) for distributed databases, CAP Theorem is one of them.
The CAP Theorem states that any distributed data store can provide only two of the following three guarantees :
Consistency-Availability-Tolerance towards Network Partition
This theorem is also named as Brewer's theorem after computer scientist Eric Brewer who gave it's hypothesis and the theorem was proved by Gilbert and Lynch.

Considering a distributed system with more than one node,
In a Consistent system, once a client writes a value to any of the nodes and gets a response, it expects to get the latest value next time it read back from any of the nodes.
Being Available means all non-failing nodes continue to serve requests if the system is partitioned.
System should function correctly despite arbitrary Network Partitions.

It is important to note that Consistency in CAP Theorem [Consistent-Available-Partition-tolerant] isn't the same as in ACID [Atomic-Consistent-Isolated-Durable] database transactional properties.
CAP comes from the distributed systems theory, while ACID belongs to database systems one.
“C” in the CAP Theorem refers to each node in the system having the same data, the “C” in ACID refers to a single node enforcing the same rules on every potential commit, such as the type of a field or a field not being null.

Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability,
whereas systems designed around the BASE philosophy, common in the NoSQL movement for example, choose availability over consistency.
Eventual consistency is a consistency model used in distributed computing to achieve high availability, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Eventually-consistent services are often classified as providing BASE semantics (Basically-available, Soft-state, Eventual consistency).

Basically available: reading and writing operations are available as much as possible (using all nodes of a database cluster), but might not be consistent (the write might not persist after conflicts are reconciled, and the read might not get the latest write)
Soft-state: without consistency guarantees, after some amount of time, we only have some probability of knowing the state, since it might not yet have converged
Eventually consistent: If we execute some writes and then the system functions long enough, we can know the state of the data; any further reads of that data item will return the same value

Consistency Model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of reading, writing, or updating memory will be predictable. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores (such as filesystems, databases, optimistic replication systems or web caching).

PACELC Theorem is an extension to the CAP theorem. It states that an additional trade-off exists: between latency and consistency, even in absence of partitions, thus providing a more complete portrayal of the potential consistency trade-offs for distributed systems.

PACELC Theorem states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and consistency (C). PACELC reorders CAP into PAC and adds a statement about what happens when the network is present. That is, you may choose between increased latency and consistency (Else Latency or Consistency).

Database PACELC ratings :
The default versions of DynamoDB, Cassandra, Riak and Cosmos DB are PA/EL systems: if a partition occurs, they give up consistency for availability, and under normal operation they give up consistency for lower latency.
Fully ACID systems such as VoltDB/H-Store, Megastore, MySQL Cluster and PostgreSQL are PC/EC: they refuse to give up consistency, and will pay the availability and latency costs to achieve it. BigTable and related systems such as HBase are also PC/EC.
Cosmos DB supports five tunable consistency levels that allow for tradeoffs between C/A during P, and L/C during E. Cosmos DB never violates the specified consistency level, so it's formally CP.
Cosmos DB offers five well-defined levels. From strongest to weakest, the levels are:
Strong, Bounded staleness, Session, Consistent prefix, Eventual
MongoDB can be classified as a PA/EC system. In the baseline case, the system guarantees reads and writes to be consistent.

A high availability requirement implies that the system must replicate data. As soon as a distributed system replicates data, a trade-off between consistency and latency arises.
Thus, the choice of the system characteristics depends on the application and requirements.

References :
Wikipedia - Distributed Data Stores
Wikipedia - Consistency Model
Wikipedia - PACELC
mwhittaker.github.io
edisciplinas.usp.br
medium.com/@skeller88
blog.thislongrun.com
Eventual_consistency
ardalis.com
Azure CosmosDB

Image Resources :
CAP-theorem-with-databases-that-choose-CA-CP-and-AP
Summary Table

DEV Community

Trade-offs of Distributed Data Stores : Theories and Theorems

Top comments (0)

Read next

Docker in Microservices Architecture: Building Scalable and Resilient Systems

Understanding This and Super in Typescript

Exploring the Function of Internet Marketing in Current Business Landscape

20 Amazing AI Tools You Can Use Right Now