The Snapshot That Travels

#freebsd #zfs #backup #devops

The Unix Way — Episode 13

At three in the morning, on a production cluster with state measured in terabytes, the question is not which backup tool you have. It is whether any of them will finish before breakfast.

The usual answer involves three tools: one for backup, one for replication, one for versioning. Each excellent in isolation, and architecturally in each other's way once state crosses the point where tarballs and rsync stop being a reasonable proposition. FreeBSD's answer has been sitting there since FreeBSD 7.0 (2008). ZFS treats snapshots, replication, and copy-on-write as properties of the filesystem itself. Three workflows become three uses of the same primitive.

A Short History

ZFS began at Sun Microsystems in 2001 and was released under the CDDL in 2005 as part of OpenSolaris. Paweł Jakub Dawidek ported it to FreeBSD in 2007; FreeBSD 7.0 shipped it as a proper filesystem option, and FreeBSD 8.0 (2009) made it a first-class citizen. When Oracle acquired Sun in 2010 and quietly discontinued the open development of Solaris, the future of ZFS outside Oracle's walls was in doubt. The illumos community forked OpenSolaris to preserve it; in 2013 the OpenZFS project was founded to unify development across FreeBSD, illumos, and (eventually) Linux and macOS. In 2020, OpenZFS 2.0 merged the Linux and BSD codebases into a single source. Since FreeBSD 13.0 (2021), FreeBSD has shipped the unified OpenZFS 2 codebase.

The interesting consequence: ZFS is no longer "the Solaris filesystem ported to FreeBSD". It is a platform-independent filesystem maintained by a community that includes everyone who depends on it. The most architecturally committed user remains FreeBSD, where ZFS has been the installer default for over a decade.

Local Snapshots

zfs snapshot tank/data@before-i-try-something-clever

That is the entire command including a proper commit message. Atomic, takes milliseconds on active datasets, no locks, no files copied. The snapshot is an immutable view of every block at that instant. A snapshot is not a copy; it is a reference to the blocks as they were. Writing to the live dataset allocates new blocks; the snapshot continues to point to the old ones. This is copy-on-write at the filesystem layer, not bolted on top of it.

Linux has btrfs, which offers similar subvolume snapshots. It is, in fairness, the only serious counterpart. Alas, btrfs RAID 5/6 is still marked "strongly discouraged for production" in 2026, which one does find a trifle awkward for a system meant to hold long-term state. Btrfs has its strengths: it ships with mainline Linux, it is well-integrated into Ubuntu and Fedora, and its single-disk and mirrored-RAID configurations are perfectly serviceable. It is not, however, what one chooses for a 40 TB production cluster with parity RAID.

Across a Cluster

zfs send tank/data@A | ssh standby zfs receive tank/data
zfs send -i tank/data@A tank/data@B | ssh standby zfs receive tank/data

Block-level, not file-level. The first command ships the entire dataset as a stream. The second ships only the delta between snapshots A and B. For a 40 TB dataset with small deltas, the incremental transfer is seconds, not hours. Published benchmarks put full-send rates at 100 to 300 MB/s and incremental sends at around 60 MB/s on commodity hardware; the real surprise is how small the deltas become. A 1 TB dataset with a 1% daily change rate ships roughly 10 GB per replication cycle. A 512 MB benchmark dataset where only metadata changed: 33 KB incremental. Once the initial full send is done, ongoing replication is close to free for stable workloads.

This is how disaster-recovery standbys are kept in sync, how clusters replicate read replicas, how staging is rebuilt from production in the time it takes to boil a kettle. The automation ecosystem around this primitive is mature: Sanoid handles policy-driven snapshot scheduling, Syncoid orchestrates incremental sends via cron, zrepl runs as a long-lived daemon with TLS and bandwidth limiting for links that matter. Each sits on top of the same five or six subcommands.

Btrfs send/receive exists. It (alas) requires the source subvolume to be made read-only before the send, which is a trifle inconvenient for continuously active state, and its automation ecosystem has not reached the same maturity.

Something Rather Like Git

If one insists on the metaphor: every snapshot is an immutable commit, every clone a branch.

zfs clone tank/data@A tank/branch-experiment

The clone costs no disk space until it diverges. Reads still come from the shared baseline. This is Git's copy-on-write semantics applied to the entire filesystem, except it works on blocks rather than files, and it does not care whether the content is source code, a PostgreSQL data directory, or 40 TB of scientific observations.

One can spin up fifty clones of a multi-terabyte database for fifty engineers, each with their own writable branch, for the price of the blocks they actually change. AWS has productised exactly this pattern in FSx for OpenZFS: Oracle databases at near-petabyte scale, cloned in seconds, zero additional capacity consumed until divergence. The same pattern is used in CI pipelines to give every test run its own writable snapshot of the production dataset, torn down when the run completes.

zfs diff reports changes at file level between two snapshots. zfs rollback returns a dataset to a prior snapshot, undoing every write since. zfs promote swaps a clone with its origin, turning the branch into the main line. Replication via send is effectively git push. The filesystem is the version control.

Btrfs offers subvolume snapshots and clones. Its RAID 5/6 is officially not production-ready, its dedup is offline-only (batch, not inline), and its quota accounting has historically slowed noticeably with many snapshots. One would prefer rather more predictable ground under the version-control layer of one's infrastructure.

The Point

On FreeBSD, ZFS is the installer default. The snapshot-as-stream philosophy is not a backup strategy bolted on top. It is the filesystem. There is no separate "backup layer", "replication layer", or "branching layer". There is the pool, the dataset, the snapshot, and the stream. Everything else is a composition of those four primitives.

This is the Unix philosophy applied to storage: small, orthogonal operations that compose. The value is not in any one of the commands. It is in the fact that snapshot, send, receive, clone, diff, rollback, and promote are variations on the same model.

Backup, replication, and branching are not three tools. They are three uses of the same primitive.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.