This blog is part of a series on designing distributed applications. In this chapter, we look at distributed state - and the applications that manage it, databases. We won't go too deep into the individual products, and instead try to keep the discussion about the tradeoffs of particular approaches.
The simplest database is a file (or a block device). You can seek around and write to it, or you can mmap it and write to memory.
What are the issues with using a file as a database? Lack of a networked API - you can't expose a file on a TCP port, you need an intermediary (which is a shame, linux has all kinds of low-level file oriented primitives - why not this one?), lack of granular concurrency - you can only lock the entire file, and lack of high-availability - a file can't natively be replicated or sharded.
We discussed network APIs in part 1.5, so let's look at consensus and concurrency control. Reliable writes to a cluster can be leaderful, or leaderless. Leadership can be with manual failover - like in Postgres, or automatic, with elections - like in Paxos and Raft. I suggest watching the original presentation on Raft, it's really insightful.
Leaderful means that all writes have to go through the leader. That prevents scaling, but helps consistency - Postgres being a good example. The way you can achieve scaling with leaders, is by sharding - have portions of data controlled by independent processes, and thus have different leaders. But you sacrifice consistency between shards - and you get Cassandra LightWeight Transactions - docs. They only work within one shard. As for Postgres - sharding an individual Postgres instance is awkward, so why not run multiple Postgres, and shard that way? What you get is Citrus, Apache Cloudberry.
But what if you don't want leader-per-partition, but instead true multi-master writing? You will inevitably get conflicting updates, and you have to decide between merging, and Last Writer Wins. Merging can be built-in - like with CRDTs, which you should check out (for example, here). But CRDTs are not general-purpose. Conflicts can be exposed to the user - like with Git, which is a kind of database. Or you can just do Last Writer Wins - which is what Cassandra does by default.
Raft and Paxos give us atomic compare-and-swap within a shard - this can be extended to full serializability, since any reads and writes can be made atomic under a single leader. The mechanism would be 2PL, or MVCC with SSI extension. But what about across shards, with different leaders? Well, we can run one more Paxos or Raft algorithm, but this time over these shards - who then have their own Paxos or Raft groups, and achieve consensus. This is a lot of round-trips - so the fewer shards are involved in a transaction, the better. The paper on distributed commit with Paxos is here.
This is much better than the established two-phase-commit protocol, 2PC. It is a lot like state machine replication in Raft - the leader sends updates, quorum confirms receiving them, leader sends "commit" - except, 2PC demands that everyone acknowledges, not just quorum, and if the leader fails - there is no automatic reelection, everyone just waits for leader to come back alive. To me, Raft/Paxos looks strictly better. In fact, if you take 2PC, make it wait on a quorum instead of all nodes, and make it do leader reelection - it starts to look a lot like Raft.
I mentioned 2PL, Two Phase Locking, and MVCC - Multi-Version Concurrency Control. These are the two main concurrency control system within a shard. 2PC establishes locks for readers and writes - making the best effort to avoid blocking, while MVCC creates snapshots of state for each transaction - which avoid all blocking, but need garbage collection afterwards. MVCC is generally faster, especially for transactions with relaxed isolation level of "Snapshot isolation" - ideal for read-only transactions, that simply need a consistent view of data. Using snapshot isolation for writing transactions can lead to write skew - so to be safe, one should stick to serializable writing transactions.
And lastly, there is one more subject to address - the interaction between application-level transactions and database-level transactions. You should watch this video for full context. Basically, with every application having its own database, and following the saga pattern - they implement their own distributed transactions, just at the application level. This is not ideal, since these implementations are improvised, and prone to errors. But we don't want to do distributed transactions across the whole internet either - not with 2PC certainly, we can't lock the world, to wait for decisions from across the Internet. The real world is not consistent - if we are to model it, we need to deal with that fact.
So what's the solution? We want distributed state to be handled by a dedicated system, making application logic as pure as possible (doesn't matter if it's a functional or object-oriented language - both can be modeled with pure computation in mind, and be equivalent). And we want distributed transactions. If we have 2PC across the internet - this won't scale. We need to use Paxos Commit, or equivalents. Furthermore, current serializability enforcement is too coarse - it only checks, which rows were read and written, but not the contents, not business constraints. For example, consider that an entity/object/actor of "account" has the business constraint of being non-negative. Multiple long-lived transactions across services also decrement the balance, checking that it's non-negative. If the DB only checks the fact of reading and writing - these transactions would cancel each other, despite not actually violating the constraint. This doesn't scale.
Let's imagine a world with distributed transactions, that do actually check business constraints. They could still cancel each other at unpredictable times. This is fine, except - applications are written under different assumptions. The source of truth is in-process state, which gets reflected into a database. A workflow has a notion of forward-progress - and cannot be rolled back several steps, if the distributed transaction gets cancelled. What I'm proposing is a different model - where the source of truth is the database, and application code is a pure function, making no assumptions about monotonic time, and instead gets driven entirely by the database. Under that paradigm, distributed transactions can scale. This is very similar to the actor model.
Thank you for reading this post! Next one will be about practical design of a distributed system.
Top comments (0)