Optimizing encrypted P2P file transfer - from 225 to 441 MB/s

#go #performance #networking #filesystem

Part of the KEIBIDROP development blog. KEIBIDROP is in active development. Release is coming soon.

KEIBIDROP transfers files between two peers over encrypted gRPC. The full stack:

Disk I/O -> FUSE kernel -> FUSE daemon -> gRPC framing -> ChaCha20-Poly1305 -> TCP -> Peer

We built micro-benchmarks for each layer and measured throughput with 1GB files on an Intel MacBook Pro.

Baseline numbers

Layer	Throughput	Overhead
Raw disk (SSD)	~5 GB/s	--
Raw gRPC (no encryption)	981 MB/s	5x vs disk
Encrypted gRPC (ChaCha20)	437 MB/s	2.2x vs raw gRPC
FUSE end-to-end	225 MB/s	1.9x vs encrypted gRPC

The encryption layer costs 2.2x. FUSE adds another 1.9x.

Six optimizations

1. Cache the AEAD cipher. The original code created a new ChaCha20-Poly1305 cipher for every message. Caching it in the constructor is safe because the nonce is a monotonic counter.

// Before: creating cipher per-message
aead, _ := chacha20poly1305.NewX(s.key)  // expensive!

// After: created once in constructor
type SecureWriter struct {
    aead cipher.AEAD  // created once
}

2. Single combined TCP write. Two syscalls per message (header + payload) became one. At 512KB chunks, this halves the write() call count.

3. In-place decryption. aead.Open(ciphertext[:0], nonce, ciphertext, nil) reuses the ciphertext buffer instead of allocating a new one. ~2000 fewer allocations per 1GB transfer.

4. Async cache writes in FUSE Read. FUSE needs the data returned to the kernel. Writing that data to the local cache file can happen in a background goroutine. A WaitGroup ensures everything completes before file close. Cache write overhead dropped from 16% to 1.8%.

n := copy(buff, data)
if cacheFD != nil {
    f.CacheWg.Add(1)
    go func() {
        defer f.CacheWg.Done()
        cacheFD.WriteAt(cacheData, cacheOffset)
    }()
}
return n

5. Push-based StreamFile RPC. The original streaming used request-response per chunk. A new server-streaming RPC pushes all chunks without waiting. On-demand reads (random access) still use bidirectional streaming.

6. Silencing hot-path logs. ~120 slog.Info/Debug/Warn calls in FUSE handlers. Even structured logging allocates. Commenting out non-error logs on Read/Write/Getattr gave 2-6% throughput improvement.

After optimizations

Layer	Before	After	Gain
Encrypted gRPC	437 MB/s	441 MB/s	+0.9%
FUSE end-to-end	225 MB/s	231 MB/s	+2.7%

The encryption tax: 1.49x

We built a controlled benchmark comparing identical gRPC transfers over plain TCP vs ChaCha20-Poly1305:

Transport	MB/s	Duration (100MB)
Plain TCP + gRPC	657	152ms
Encrypted (ChaCha20-Poly1305)	441	227ms

33% of transfer time goes to encryption. Our earlier 2.2x estimate was comparing across different test runs with different system load. The controlled A/B benchmark showed the real cost is much lower.

The irreducible 48%

The gap between encrypted gRPC (441 MB/s) and FUSE end-to-end (231 MB/s) is 48%. Every FUSE Read requires two kernel-to-userspace context switches. We confirmed this by measuring raw FUSE overhead (local file read through FUSE vs direct): ~48%, matching exactly.

Plain gRPC:     657 MB/s  (ceiling)
  -33% encryption
Encrypted gRPC: 441 MB/s
  -48% FUSE kernel overhead
FUSE E2E:       231 MB/s

The FUSE overhead is nearly twice the encryption cost. If we wanted to improve throughput, reducing FUSE transitions would have more impact than switching ciphers. But FUSE is the cost of running a userspace filesystem, and for secure P2P file sharing the tradeoff is worth it.

What we learned

Without isolating each layer, we would have optimized the wrong thing. The biggest win (async cache writes, 16% to 1.8%) was only visible when measured independently.
Caching the AEAD cipher and in-place decryption are tiny code changes with measurable impact at 2000+ messages per second.
Structured logging is not free on hot paths. Even checking the log level has overhead.
FUSE has an inherent 48% overhead from context switches. No amount of userspace optimization can fix this.
Server-streaming eliminates round-trip latency for sequential prefetch. Request-response is still needed for random access.
Our initial 2.2x encryption estimate was wrong. A controlled A/B benchmark (same test, same system) showed 1.49x. Always compare apples to apples.

More from the KEIBIDROP blog: full series | product page | FAQ