Copy-on-Write, Explained Through fork() and Snapshots

#tutorial #webdev

Copying data is expensive. Copying data you never modify is wasted work. Copy-on-write (CoW) is the trick that resolves that tension: you hand out what looks like an independent copy, but no bytes move until someone actually writes. Until that first write, every "copy" is the same physical data, shared and marked read-only. The copy happens lazily, per unit, only when a writer forces it.

That single idea shows up in three places most developers touch every week: the fork() system call, filesystem snapshots on ZFS and Btrfs, and the multi-version concurrency control inside Postgres. They look unrelated until you see they're the same mechanism applied at different granularities.

The mechanism: share read-only, copy on the fault

The unit of sharing on a modern CPU is the page — 4 KiB on x86-64 by default. Your process doesn't address physical memory directly; it addresses virtual pages, and the page table maps each virtual page to a physical frame, plus permission bits like read, write, and execute.

Copy-on-write works by lying about those permission bits. When you want two logical copies of a region, you don't duplicate the underlying frames. You point both page tables at the same physical frames and clear the write bit on both. Reads go straight through and cost nothing extra. The moment either side issues a write, the CPU's memory management unit sees the cleared write bit and raises a page fault.

The kernel's fault handler is where the actual copy happens. It allocates a fresh frame, copies the 4 KiB of contents, repoints the faulting process's page table entry at the new frame, restores the write bit, and resumes the instruction. The writer never knows it faulted. The other side still references the original, untouched. A reference count on each shared frame tells the kernel whether a copy is even necessary — if the count is already 1, there's no one to protect, so it just flips the write bit back on instead of copying.

The cost of a copy-on-write "copy" is proportional to how much you write afterward, not to how big the region is. Duplicating a 4 GiB address space is cheap if the child touches a few pages. It's expensive only if the child rewrites most of it — at which point you've paid roughly what an eager copy would have cost anyway, just spread out over time.

fork() is the textbook case

When a Unix process calls fork(), the kernel needs to produce a child with an identical address space. Eagerly duplicating every page would make fork() scale with the parent's memory footprint, which is brutal for a large process — and pointless, because the overwhelmingly common next move is execve(), which throws the whole address space away and loads a new program.

So fork() copies the page tables, not the page contents, and marks every writable page read-only in both parent and child. Both processes share physical memory. Execution continues until one of them writes to a shared page; that page, and only that page, gets duplicated by the fault handler. If the child immediately calls execve(), almost nothing was ever copied.

This is also why fork()-based snapshotting works. Redis takes a point-in-time RDB snapshot by calling fork() and letting the child serialize memory to disk while the parent keeps serving traffic. The child sees a frozen view: any key the parent mutates after the fork triggers a CoW page copy, so the child keeps reading the pre-fork bytes. The catch is memory pressure — if the parent writes heavily during the save, copied pages accumulate, and a write-heavy Redis can transiently approach double its resident size during a background save.

Copy-on-write makes "how much memory does this process use" a slippery question. Resident set size counts shared CoW pages against every process that maps them, so summing RSS across a parent and its forks double-counts memory that physically exists once. On Linux, PSS (proportional set size) in /proc/[pid]/smaps divides shared pages by the number of sharers — that's the number you want when accounting for forked workers.

Snapshots: the same idea, larger units

Filesystem and database snapshots apply copy-on-write above the page level, to disk blocks and row versions.

A CoW filesystem like ZFS or Btrfs never overwrites a live block in place. When you modify a file, it writes the new data to a free block and updates the metadata to point there, leaving the old block intact. A snapshot is then almost free: you record the current root of the tree and stop reclaiming the blocks it references. The live filesystem keeps moving forward onto new blocks; the snapshot keeps pointing at the old ones. Blocks are shared between the live view and the snapshot until a write diverges them — exactly the page-fault dance, just with the storage allocator playing the role of the fault handler. A snapshot's size on disk is only the blocks that changed since it was taken.

Database MVCC is the row-level version. Instead of locking a row so readers and writers take turns, Postgres writes a new version of the row on update and leaves the old version in place. A transaction reads whichever version was visible when it started, so a long-running read never blocks a concurrent write and vice versa. Old versions are shared by every transaction old enough to see them, and only get cleaned up — by vacuum — once no transaction can reference them anymore. The reference-count idea returns as visibility bookkeeping.

Reading kernel and database source is the fastest way to make this concrete — the do_wp_page fault handler in the Linux mm code, or the tuple visibility checks in Postgres, are short and surprisingly readable once you know what you're looking for. A capable editor that can jump across a large C codebase and answer "who clears this write bit" without you grepping by hand earns its keep here.

The payoff of seeing these three as one mechanism is practical. When a forked worker pool balloons in memory, you know it's CoW pages diverging under write pressure, not a leak. When a Postgres table bloats, you know dead row versions are accumulating faster than vacuum reclaims them. When a snapshot you forgot about quietly consumes a disk, you know it's pinning blocks the live filesystem has long since moved past. Same lazy copy, same reference counting, same failure mode: copies you stopped tracking.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.