DEV Community

Discussion on: what software based backup solutions do you use?

 
rubenkelevra profile image
@RubenKelevra

Austin I have to correct you:

Restic does indeed do deduplication on blocklevel. It uses a rolling hash algorithm called rabin as a chunker.

In short, a rolling hash algorithm reacts to patterns within the file and cuts it. If two files have the same patterns there's a high chance to have the cuts at the same positions, giving it the ability to deduplicate files which's data is not aligned to any specific block size.

So if you for example have multiple VM images to backup from multiple machines and the data is mixed up in them with like 4K block size on one and 512-byte block size within the VM in the other one, rabin can still identify the similar streams of data and deduplicate the redundancies.

If you compare this for example to ZFS which has a variable size of blocks, but they go up to 128 K and can be deduplicated. But when your data is slightly misaligned, like with a new version of a file that moved the data or when VM images don't align well with the block sizes they tend to have zero deduplication.

The speed restic processes my system is pretty impressive. I backup only certain parts of the system with it, which are around 40 GB and calculates the diff in just 85 seconds:

scan finished in 85.643s: 733157 files, 40.726 GiB

Files:         577 new,  4757 changed, 727820 unmodified
Dirs:          122 new,   611 changed, 110675 unmodified
Data Blobs:   1614 new
Tree Blobs:    711 new
Added to the repo: 395.722 MiB
Enter fullscreen mode Exit fullscreen mode

You can see, a daily snapshot weights only 400 MB in this case :)