Kurt

Posted on Oct 15, 2022 • Originally published at getcode.substack.com on Oct 14, 2022

A Nibble of Content-Defined Chunking

#hash #filetransfer #deduplication #chunking

Nibble: a small piece of food bitten off. In computing: half a byte of information. In every nibble, I explain one idea from computing science or software engineering in five minutes of reading time.

Consider the problem of backing up large binary files, like virtual machine images, zip files, images, or videos. Because typically only a small fraction of the data changes, it is terribly slow and expensive to transfer each and every file again and again, with every backup. Ideally, we'd transfer only the changes since the last backup, and nothing else. Of course, we can't afford to miss something either!

Photo by Mae Mu on Unsplash

A naïve approach would be to find the changed parts by comparing the current version of the files with the previously backed-up version. However, that requires a complete transfer of each file, which defeats the purpose.

Fixed-size chunking

A better solution is to make the backup content addressable. Split each file into parts of fixed size called chunks and compute a hash of each chunk's content. Initially, we transfer each chunk to the backup store keyed by its hash and store a file with the chunk names that make up every file in the backup.

Afterward, if disaster strikes, we can recover files by downloading their list of chunk names, downloading the chunks, and then simply joining them together.

This approach is called fixed-size chunking.

As a bonus, fixed-size chunking de-duplicates: if one or more chunks are identical by hash, they're transferred only once. It doesn't matter if the duplicate chunk is in the same or another file.

Chunking addresses incremental transfer nicely. If a chunk changes, only that chunk needs to be transferred:

Are we done? Not quite. With fixed-size chunking, a few bytes inserted or removed at the beginning of a file means all of the following chunks get a different hash, even though they haven't changed! That significantly affects our ability to de-duplicate and transfer incrementally.

Content-defined chunking

The solution is to use content-defined chunking. Instead of splitting the file into fixed chunks, we let the file content determine where to split. We can do this by computing a rolling hash and splitting when the rolling hash satisfies a condition - typically, when some number of bits in the hash is zero.

First, let's understand what a rolling hash is. A rolling hash has a fixed window size, say 64 bytes. For each window in the file, the hash is computed, so the first hash is bytes 0 to 63, then 1 to 64, then 2 to 65, and so on. A rolling hash is efficient because it can be incrementally updated, by removing the contribution of the oldest byte and adding the new byte.

Now, to find chunk boundaries, we check whether the lowest N bits of each hash are all zeros. If so, we have a new chunk boundary. The all-zeroes condition is arbitrary - may as well be all ones, or two, or spell your name. More importantly, by choosing N we can target a chunk size of 2ᴺ bytes on average. That's because the hash's bits are uniformly distributed, so a string of N bits occurs with a probability of 1 in 2ᴺ.

The rest is like fixed-size chunking: hash the content of each chunk, and use that to transfer and de-duplicate.

Content-defined chunking neatly solves the problem of inserting or removing data, because the chunk boundaries are no longer offset-based. Now, if data is inserted anywhere outside the chunk boundaries, the boundaries "move" with the data, and most chunks stay the same.