CodingBlocks

Designing Data-Intensive Applications – Leaderless Replication

Jul 6 '21

We wrap up our replication discussion of Designing Data-Intensive Applications, this time discussing leaderless replication strategies and issues, while Allen missed his calling, Joe doesn’t read the gray boxes, and Michael lives in a future where we use apps.

If you’re reading this via your podcast player, you can find this episode’s full show notes at https://www.codingblocks.net/episode162. As Joe would say, check it out and join in on the conversation.

Survey Says

Do you have TikTok installed?

Heck yeah, I love those videos.
Nope. No way. Never.

vote

News

Thank you for the latest review!
- iTunes: tuns3r

Check out the book!

Single Leader to Multi-Leader to Leaderless

When you have leaders and followers, the leader is responsible for making sure the followers get operations in the correct order
Dynamo brought the trend to the modern era (all are Dynamo inspired) but also…
- Riak
- Cassandra
- Voldemort
We talked about NoSQL Databases before:
- Episode 123 Data Models: Relational vs Document
What exactly is NewSQL? https://en.wikipedia.org/wiki/NewSQL
What if we just let every replica take writes? Couple ways to do this…
- You can write to several replicas
- You can use a coordinator node to pass on the writes
But how do you keep these operations in order? You don’t!
- Thought exercise, how can you make sure operation order not matter?
- Couple ideas: No partial updates, increments, version numbers

Multiple Writes, Multiple Reads

What do you do if your client (or coordinator) try to write to multiple nodes…and some are down?
Well, it’s an implementation detail, you can choose to enforce a “quorom”. Some number of nodes have to acknowledge the write.
- This ratio can be configurable, making it so some % is required for a write to be accepted
- What about nodes that are out of date?
- The trick to mitigating stale data…the replicas keep a version number, and you only use the latest data – potentially by querying multiple nodes at the same time for the requested data
- We’ve talked about logical clocks before, it’s a way of tracking time via observed changes…like the total number of changes to a collection/table…no timezone or nanosecond differences

How do you keep data in sync?

About those unavailable nodes…2 ways to fix them up
- Read Repair: When the client realizes it got stale data from one of the replicas, it can send the updated data (with the version number) back to that replica. Pretty cool! – works well for data that is read frequently
- Anti-Entropy: The nodes can also do similar background tasks, querying other replicas to see which are out of data – ordering not guaranteed!
- Voldemort: ONLY uses read repair – this could lead to loss of data if multiple replicas went down and the “new” data was never read from after being written

Quorums for reading and writing

Quick Reminder: We are still talking about 100% of the data on each replica
3 major numbers at play:
- Number of nodes
- Number of confirmed writes
- Number of reads required
If you want to be safe, the nodes you write to and the ones you write too should include some overlap
A common way to ensure that, keep the number of writes + the number of reads should be greater than the number of nodes
Example: You have 10 nodes – if you use 5 for writing and 5 for reading…you may not have an overlap resulting in potentially stale data!
Common approach – taken number of nodes (odd number) + 1, then divide that number by 2 and that’s the number of reader and writers you should have
- 9 Nodes – 5 writes and 5 reads – ensures non-stale data
- When using this approach, you can handle Nodes / 2 (rounded down) number of failed nodes
How would you tweak the numbers for a write heavy workload?
Typically, you write and read to ALL replicas, but you only need a successful response from these numbers
What if you have a LOT of nodes?!?
Note: there’s still room for problems here – author explicitly lists 5 types of edge cases, and one category of miscellaneous timing edge cases. All variations of readers and writers getting out of sync or things happen at the same timing
If you really want to be safe, you need consensus (r = w = n) or transactions (that’s a whole other chapter)
Note that if the number of required readers or writers doesn’t return an OK, then an error is returned from the operation
Also worth considering is you don’t have to have overlap – having readers + writers < nodes means you could have stale data, but at possibly lower latencies and lower probabilities of error responses

Monitoring staleness

Single/Multi Leader lag is generally easy to monitor – you just query the leader and the replicas to see which operation they are on
Leaderless databases don’t have guaranteed ordering so you can’t do it this way
If the system only uses read repair (where the data is fixed up by clients only as it is read) then you can have data that is ancient
It’s hard to give a good algorithm description here because so much relies on the implementation details
- Paper discussing Probabilistic Bounded Staleness (PBS) http://www.bailis.org/papers/pbs-cacm2014.pdf

And when things don’t work?

Multi-writes and multi-reads are great when a small % of nodes or down, or slow
What if that % is higher?
- Return an error when we can’t get quorum?
- Accept writes and catch the unavailable nodes back up later?
If you choose to continue operating, we call it “sloppy quorum” – when you allow reads or writes from replicas that aren’t the “home” nodes – the likened it to you got locked out of your house and you ask your neighbor if you can stay at their place for the night
This increases (write) availability, at the cost of consistency
Technically it’s not a quorum at all, but it’s the best we can do in that situation if you really care about availability – the data is stored somewhere just not where it’d normally be stored

Detecting Concurrent Writes

What do you get when you write the same key at the same time with different values?
Remember, we’re talking about logical clocks here so imagine that 2 clients both write version #17 to two different nodes
This may sound unlikely, but when you realize we’re talking logical clocks, and systems that can operate at reduced capacity…it happens
What can we do about it?
- Last write wins: But which one is considered last? Remember, how we catch up? (Readers fix or leaders communicate) …either way, the data will eventually become consistent but we can’t say which one will win…just that one will eventually take over
  - Note: We can take something else into account here, like clock time…but no perfect answer
  - LWW is good when your data is immutable, like logs – Cassandra recommends using a UUID as a key for each write operation
- Happens-Before Relationship – (Riak has CfRDT that bundle a version vector to help with this)

This “happens-before” relationship and concurrency

How do we know whether the operations are concurrent or not?
Basically if neither operation knows about the other, then they are concurrent…
Three possible states if you have writes A and B
- A happened before B
- B happened before A
- A and B happened concurrently
When there is a happens before, then you take the later value
When they are concurrent, then you have to figure out how to resolve the conflicts
- Merging concurrently written values
  - Last write wins?
  - Union the data?
  - No good answer

Version vectors

The collection of version numbers from all replicas is called a version vector
Riak uses dotted version vectors – the version vectors are sent back to the clients when values are read, and need to be sent back to the db when the value is written back
- Doing this allows the db to understand if the write was an overwrite or concurrent
- This also allows applications to merge siblings by reading from one replica and write to another without losing data if the siblings are merged correctly

Resources We Like

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
Past episode discussions on Designing Data-Intensive Applications (Coding Blocks)
Designing Data-Intensive Applications – Data Models: Relational vs Document (episode 123)
NewSQL (Wikipedia)
Do not allow Jeff Bezos to return to Earth (Change.org)
Man Invests $20 in Obscure Cryptocurrency, Becomes Trillionaire Overnight, at Least Temporarily (Newsweek)
Quantifying Eventual Consistency with PBS (Bailis.org)
Riak Distributed Data Types (Riak.com)

Tip of the Week

A GitHub repo for a list of “falsehoods”: common things that people believe but aren’t true, but targeted at the kinds of assumptions that programmers might make when they are working on domains they are less familiar with. (GitHub)
The Linux at command lets you easily schedule commands to run in the future. It’s really user friendly so you can be lazy with how you specify the command, for example echo "command_to_be_run" | at 09:00 or at 09:00 -f /path/to/some/executable (linuxize.com)
You can try Kotlin online at play.kotlinlang.org, it’s an online editor with links to lots of examples. (play.kotlinlan.org)
The Docker COPY cmd will need to be run if there are changes to files that are being copied. You can use a .dockerignore to skip files that you don’t care about to trim down on unnecessary work and build times. (doc.docker.com).

Episode source

DEV Community

CodingBlocks

Designing Data-Intensive Applications – Leaderless Replication

Sponsors