Vincent Du

Posted on Jan 7

Building a File Copier 4x Faster Than cp Using io_uring

#linux #cpp #performance #tutorial

Building a File Copier That's 4x Faster Than `cp` Using io_uring

I built a high-performance file copier for ML datasets using Linux io_uring. On the right workload, it's 4.2x faster than cp -r. Here's what I learned about when async I/O helps—and when it doesn't.

The Problem: Millions of Small Files

ML training datasets often contain millions of small files:

Dataset	Files	Typical Size
ImageNet	1.28M	100-200KB JPEG
COCO	330K	50-500KB
MNIST	70K	784 bytes
CIFAR-10	60K	3KB

Copying these with cp -r is painfully slow. Each file requires multiple syscalls (open, read, write, close), and the kernel processes them one at a time. For 100,000 files, that's 400,000+ syscalls executed sequentially.

The Solution: io_uring

io_uring is a Linux async I/O interface (kernel 5.1+) that enables:

Batched submission - Queue dozens of operations, submit with one syscall
Async completion - Operations complete out of order
Zero-copy - Splice data directly between file descriptors via kernel pipes

Instead of: open → read → write → close → repeat

We do: submit 64 opens → process completions → submit reads/writes → batch everything

Architecture

┌──────────────┐     ┌─────────────────┐     ┌─────────────────────┐
│ Main Thread  │────▶│  WorkQueue<T>   │────▶│  Worker Threads     │
│ (scanner)    │     │  (thread-safe)  │     │  (per-thread uring) │
└──────────────┘     └─────────────────┘     └─────────────────────┘

Each file progresses through a state machine:

OPENING_SRC → STATING → OPENING_DST → SPLICE_IN ⇄ SPLICE_OUT → CLOSING

Key design decisions:

64 files in-flight per worker simultaneously
Per-thread io_uring instances (avoids lock contention)
Inode sorting for sequential disk access
Splice zero-copy for data transfer (source → pipe → destination)
Buffer pool with 4KB-aligned allocations (O_DIRECT compatible)

Benchmark Results

Local NVMe (Cold Cache)

Workload	cp -r	uring-sync	Speedup
100K × 4KB files (400MB)	7.67s	5.14s	1.5x
100K × 100KB files (10GB)	22.7s	5.4s	4.2x

Key insight: Larger files benefit MORE from io_uring on fast storage. The 100KB test shows 4.2x improvement because we're overlapping many large reads/writes.

GCP pd-balanced (SSD-backed, 100GB)

Workload	cp -r	uring-sync	Speedup
100K × 4KB files	67.7s	31.5s	2.15x
100K × 100KB files	139.6s	64.7s	2.16x

Consistent 2x improvement on cloud SSD storage.

Why io_uring Helps

On fast storage (NVMe, SSD), the bottleneck is CPU and syscall overhead, not the disk:

cp -r: Processes files sequentially, 12+ syscalls per file
io_uring: 64 files in-flight, batched syscalls, async completion

The bigger the files, the more time we spend waiting for I/O to complete—and the more io_uring's async approach helps. That's why we see 4.2x speedup for 100KB files vs 1.5x for 4KB files on NVMe.

Implementation Details

The State Machine

Each file copy is a state machine with these transitions:

enum class FileState {
    OPENING_SRC,    // Opening source file
    STATING,        // Getting file size
    OPENING_DST,    // Creating destination
    SPLICE_IN,      // Reading into kernel pipe
    SPLICE_OUT,     // Writing from pipe to dest
    CLOSING_SRC,    // Closing source
    CLOSING_DST,    // Closing destination
    DONE
};

Completions drive state transitions. When a completion arrives, we look up the file context and advance its state.

Splice Zero-Copy

Instead of read() → userspace buffer → write(), we use splice():

Source FD → Kernel Pipe → Destination FD

Data never touches userspace. The kernel moves pages directly between file descriptors.

// Splice from source into pipe
io_uring_prep_splice(sqe, src_fd, offset, pipe_write_fd, -1, chunk_size, 0);

// Splice from pipe to destination
io_uring_prep_splice(sqe, pipe_read_fd, -1, dst_fd, offset, chunk_size, 0);

Inode Sorting

Before copying, we sort files by inode number:

std::sort(files.begin(), files.end(),
    [](const auto& a, const auto& b) { return a.inode < b.inode; });

This encourages sequential disk access since inodes are typically allocated sequentially for files created together.

What I Learned

Single worker beats multi-threading for local NVMe. Lock contention outweighs parallelism benefits when the bottleneck is fast I/O.
Queue depth matters more than thread count. 64 files in-flight per worker is the sweet spot.
Profile your actual workload. Synthetic benchmarks lie. Test with your real data.
io_uring shines on fast storage. When the disk can keep up, reducing syscall overhead yields big gains.

What's Next: Network Transfer

This tool now also supports network file transfer with kTLS encryption, achieving 58% faster transfers than rsync. See the companion post: Beating rsync by 58% with Kernel TLS.

Code

The full implementation is ~1,400 lines of C++20. Key components:

Component	Purpose
`RingManager`	io_uring wrapper with SQE/CQE management
`BufferPool`	4KB-aligned buffer allocation
`PipePool`	Reusable kernel pipes for splice
`WorkQueue`	Thread-safe file queue
`FileContext`	Per-file state machine

Build requirements:

Linux kernel 5.1+ (5.19+ for splice)
liburing
C++20

Conclusion

io_uring can dramatically speed up small-file workloads—4.2x faster on NVMe and 2x faster on cloud SSD. The key is reducing syscall overhead through batching and async I/O.

When to use io_uring for file copying:

Many small files (ML datasets, source trees)
Fast storage (NVMe, SSD)
CPU-bound on syscall overhead

When cp -r is fine:

Single large files (already efficient)
One-off copies where complexity isn't worth it

The code is available at github.com/VincentDu2021/uring_sync. Benchmarks were run on Ubuntu 24.04 with kernel 6.14 on local NVMe and GCP Compute Engine VMs.

DEV Community

Building a File Copier 4x Faster Than cp Using io_uring

Building a File Copier That's 4x Faster Than `cp` Using io_uring

The Problem: Millions of Small Files

The Solution: io_uring

Architecture

Benchmark Results

Local NVMe (Cold Cache)

GCP pd-balanced (SSD-backed, 100GB)

Why io_uring Helps

Implementation Details

The State Machine

Splice Zero-Copy

Inode Sorting

What I Learned

What's Next: Network Transfer

Code

Conclusion

Top comments (0)

Building a File Copier That's 4x Faster Than cp Using io_uring

The Problem: Millions of Small Files

The Solution: io_uring

Architecture

Benchmark Results

Local NVMe (Cold Cache)

GCP pd-balanced (SSD-backed, 100GB)

Why io_uring Helps

Implementation Details

The State Machine

Splice Zero-Copy

Inode Sorting

What I Learned

What's Next: Network Transfer

Code

Conclusion

Building a File Copier That's 4x Faster Than `cp` Using io_uring