Some forms of distributed snapshots were around for a while already when Chandy-Lamport's distributed snapshots paper was first published in 1985. Lamport considers this protocol a straightforward application of the basic ideas from Lamport clocks. Other than reviewing the paper, in this post I'll also present some examples of real world implementations and a TLA+ specification of the protocol.
What problem is it trying to solve?
You need to record the global state of a program. Why? Because, for example, you have some complex computation ongoing, and you want to know which step has reached. Or you have a long-running computation, and you want to take a snapshot as a backup to allow restarting the computation again from the checkpoint rather than from the beginning in case any machine fails.
For the state of the program, we refer to the local variables and in general to the history of states that the program went through.
Why is taking snapshots hard? Well, first of all, the snapshotting algorithm should not interfere with the running computation.
Secondly, if your program is a single process on a single machine, this is straightforward! You could create an api to say "record the snapshot in 5 seconds" or "every 2 hours". For a multi-thread/multiprocess program running on a single machine, you can create a similar api.
In a distributed system, this api won't work because there is no global shared clock. You could end up with out-of-sync snapshots providing an inconsistent view of the system. Other than the state of the process itself, we could also have inflight messages that should be included in the snapshot. As an example of inconsistent snapshot, a process B could record that it received a message from A and A's snapshot does not include that the message was sent over to B.
The paper has a good visual representation: imagining you wanted to take a picture of a sky filled with migrating birds. One photo is not enough, you will need to take multiple pictures and stitch them together in a way that provides a consistent view of the landscape. This is the challenge that this paper is trying to solve.
Top comments (0)