I used to think of a file as a solid block of data. 10GB file, 10GB on disk. That's just something you'd normally assume, right?
I was wrong.
I'm a CS student currently going deep on systems programming — reading TLPI, writing C daily, learning how the OS actually works under the hood. This week I was working through file I/O and had one of those moments where something you thought you understood turns out to be completely different one level down.
You can seek past the end of a file and write there. The filesystem won't fill the gap with real data. It just remembers that the gap exists.
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
int fd = open("huge_file", O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) { perror("open"); exit(1); }
// seek to the 10GB mark without writing anything
if (lseek(fd, 10LL * 1024 * 1024 * 1024, SEEK_SET) == -1) {
perror("lseek"); exit(1);
}
// write exactly one byte
if (write(fd, "X", 1) == -1) {
perror("write"); exit(1);
}
close(fd);
return 0;
}
Run this and check the file two different ways:
ls -lh huge_file # 10G
du -h huge_file # 4.0K
The file reports 10GB in size. It uses 4KB of actual disk. Both numbers are telling the truth.
ls -lh shows the logical size — where the last byte lives relative to the start of the file. The filesystem knows the file is 10GB in the sense that if you read from byte 0, you'll eventually reach byte 10GB.
du -h shows allocated blocks — the physical disk space actually consumed. There's one 4KB block holding your single X. Everything before it is a hole: a range of offsets the filesystem tracks in metadata without backing it with real storage. Reading from a hole returns zero bytes. No disk access needed — there's nothing to read.
This isn't just a curiosity. It's the mechanism behind tools you use every day.
When you create a virtual machine in VirtualBox or QEMU and tell it "this VM gets a 100GB disk," you're not consuming 100GB immediately. The disk image is a sparse file. It grows as the VM writes data. The "100GB" is just a ceiling.
Docker layers work on the same idea — copy-on-write means a layer only allocates blocks when data actually changes. Database engines use sparse files for pre-allocated tables that fill in over time without ever needing to write zeros to reserve space.
The thing that stuck with me isn't the trick itself. It's what it reveals about how the filesystem actually thinks.
A file isn't a sequence of bytes on disk. It's a mapping from logical offsets to physical blocks — and some offsets don't map to anything. The OS constructs the illusion of a contiguous sequence on demand.
Once you see that, other things start clicking too. Why copying a sparse file with plain cp can suddenly consume far more disk than the original — it materializes the holes. Why rsync has a --sparse flag. Why checking disk usage with ls alone will mislead you.
Systems programming keeps doing this to me. I think I understand something, and then one level lower the actual mechanism is completely different from the abstraction I was using. The abstraction is useful. But knowing what's underneath makes you a better engineer, not just someone who knows how to use the tools.
Top comments (0)