DEV Community

Kurt
Kurt

Posted on • Originally published at getcode.substack.com on

1

A Nibble of Content-Defined Chunking

Nibble: a small piece of food bitten off. In computing: half a byte of information. In every nibble, I explain one idea from computing science or software engineering in five minutes of reading time.

Consider the problem of backing up large binary files, like virtual machine images, zip files, images, or videos. Because typically only a small fraction of the data changes, it is terribly slow and expensive to transfer each and every file again and again, with every backup. Ideally, we'd transfer only the changes since the last backup, and nothing else. Of course, we can't afford to miss something either!

flat-lay photography of chocolate bars

Photo by Mae Mu on Unsplash

A naïve approach would be to find the changed parts by comparing the current version of the files with the previously backed-up version. However, that requires a complete transfer of each file, which defeats the purpose.

Fixed-size chunking

A better solution is to make the backup content addressable. Split each file into parts of fixed size called chunks and compute a hash of each chunk's content. Initially, we transfer each chunk to the backup store keyed by its hash and store a file with the chunk names that make up every file in the backup.

Afterward, if disaster strikes, we can recover files by downloading their list of chunk names, downloading the chunks, and then simply joining them together.

This approach is called fixed-size chunking.

As a bonus, fixed-size chunking de-duplicates: if one or more chunks are identical by hash, they're transferred only once. It doesn't matter if the duplicate chunk is in the same or another file.

Chunking addresses incremental transfer nicely. If a chunk changes, only that chunk needs to be transferred:

Are we done? Not quite. With fixed-size chunking, a few bytes inserted or removed at the beginning of a file means all of the following chunks get a different hash, even though they haven't changed! That significantly affects our ability to de-duplicate and transfer incrementally.

Content-defined chunking

The solution is to use content-defined chunking. Instead of splitting the file into fixed chunks, we let the file content determine where to split. We can do this by computing a rolling hash and splitting when the rolling hash satisfies a condition - typically, when some number of bits in the hash is zero.

First, let's understand what a rolling hash is. A rolling hash has a fixed window size, say 64 bytes. For each window in the file, the hash is computed, so the first hash is bytes 0 to 63, then 1 to 64, then 2 to 65, and so on. A rolling hash is efficient because it can be incrementally updated, by removing the contribution of the oldest byte and adding the new byte.

Now, to find chunk boundaries, we check whether the lowest N bits of each hash are all zeros. If so, we have a new chunk boundary. The all-zeroes condition is arbitrary - may as well be all ones, or two, or spell your name. More importantly, by choosing N we can target a chunk size of 2ᴺ bytes on average. That's because the hash's bits are uniformly distributed, so a string of N bits occurs with a probability of 1 in 2ᴺ.

The rest is like fixed-size chunking: hash the content of each chunk, and use that to transfer and de-duplicate.

Content-defined chunking neatly solves the problem of inserting or removing data, because the chunk boundaries are no longer offset-based. Now, if data is inserted anywhere outside the chunk boundaries, the boundaries "move" with the data, and most chunks stay the same.

Thanks for reading the first-ever nibble! I intend to write one nibble every month. Think of them as the amuse-bouches of Get Code.

For more, subscribe to my newletter and follow me on Twitter

AWS Security LIVE!

Tune in for AWS Security LIVE!

Join AWS Security LIVE! for expert insights and actionable tips to protect your organization and keep security teams prepared.

Learn More

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Immerse yourself in a wealth of knowledge with this piece, supported by the inclusive DEV Community—every developer, no matter where they are in their journey, is invited to contribute to our collective wisdom.

A simple “thank you” goes a long way—express your gratitude below in the comments!

Gathering insights enriches our journey on DEV and fortifies our community ties. Did you find this article valuable? Taking a moment to thank the author can have a significant impact.

Okay