beefed.ai

Posted on Mar 26 • Originally published at beefed.ai

Filesystem Caching and Buffer Management for Low Latency

#programming

Why filesystem caching controls io-latency more than raw disk speed
How an eviction-policy prevents latency collapse during pressure
When write-back-cache reduces io-latency and when it doesn't
Techniques to scale the page-cache under heavy concurrency
Quantifying cache effectiveness: metrics and measurement protocols
Practical cache-management checklist you can run tonight

The cache is the control plane for application-visible I/O: a well-tuned page-cache and buffer subsystem will often beat adding more SSDs when your goal is predictable low tail latency. Your job isn’t simply to buy faster media — it’s to shape how pages enter, live in, and leave RAM so that misses are rare and writeback never stalls production threads.

You’re likely seeing one or more of the following symptoms: good median throughput but exploding 95th/99th percentiles, long pauses on fsync/O_SYNC calls, background writeback stealing CPU and IO bandwidth, or unpredictable reclaim latencies that manifest as service tail-latency. Those symptoms point to cache-management and writeback dynamics rather than the raw device. The fix lives in layered controls: read-ahead, eviction-policy, write aggregation, and coherent page-cache design tied to careful measurement.

Why filesystem caching controls io-latency more than raw disk speed

The kernel’s page-cache is the primary mechanism by which file data and mmap-backed pages are served; normal reads and writes flow through that layer before the block layer and device drivers. When a page is resident, you get DRAM latency; when it’s not, you pay the full device and stack cost plus any queueing. A single percentage-point change in cache hit-rate can move p99 latency by orders of magnitude for small-random workloads. (docs.kernel.org)

Read path: a cache hit resolves in microseconds (page lookup + memcpy or zero-copy through mmap). Misses trigger block I/O, device service time, and possible scheduling delays.
Read-ahead matters: sequential access patterns trigger proactive fetches; correct readahead sizing converts many reads from misses into hits and dramatically reduces small-read latency.
Memory-mapped IO uses the same structures as buffered IO; mmap can be a win for throughput but increases pressure on page-cache management.

Practical corollary: investing in SSD bandwidth without addressing cache thrash, writeback storms, and read-ahead tuning is usually throwing cost at a symptoms problem rather than the root cause.

How an eviction-policy prevents latency collapse during pressure

An eviction-policy is the circuit breaker between memory pressure and I/O thrashing. Naive LRU will pollute the cache with one-time sequential scans; good designs separate recency and frequency, maintain short-term history, and resist one-shot scans. Adaptive policies (for example ARC) track both recent and frequent sets and adapt automatically to workload shifts, improving overall hit-rate without manual tuning. (usenix.org)

Key mechanics and implementation notes:

Linux implements per-zone/per-cpu LRU vectors (lruvec) with active and inactive lists to reduce global lock contention; reclaim happens via kswapd and direct reclaim paths.
Dirty-page handling is orthogonal to pure eviction: evicting a dirty page forces writeback or stalls reclaim, so eviction-policy and writeback throttling must coordinate.
Metadata pages deserve higher priority: evicting inode or directory pages aggressively causes more expensive path-length penalties and amplifies latency.
Scan-resistance: when access patterns exhibit long sequential scans, a good eviction-policy avoids filling the cache with cold pages (ghost lists or history help here).

Operationally, set your eviction strategy goals explicitly: minimize p99 for small reads, bound writeback backlog to avoid stalls, and prioritize low-latency metadata access. Using an adaptive replacement layer or a simple hot/cold demotion can yield large improvements in hit-rate with minimal overhead.

Important: Eviction decisions are effective only if your writeback subsystem can sustain the resulting write traffic; eviction without controlled writeback simply moves latency to the storage subsystem.

When write-back-cache reduces io-latency and when it doesn't

The label write-back-cache covers two related ideas: (1) the kernel’s delayed-write model (dirty pages collected in the page-cache and flushed asynchronously), and (2) device-level write caches (SSD DRAM). At the application level, write-back hides device latency by acknowledging writes before persistence, but that behaviour changes durability semantics: a write is not durable until fsync (or an O_SYNC/O_DSYNC open) returns. Use fsync/fdatasync to force durability; their semantics are explicit and blocking. (man7.org)

Compare behavior in practical terms:

Property	Write-back-cache	Write-through
Application-visible write latency	Low (ack on page dirt)	High (ack on device commit)
Durability without `fsync`	Not guaranteed	Guaranteed on write
Throughput for small random writes	High (coalescing)	Low (many syncs)
Risk on power loss	Depends on device PLP	Low (if device honors flushes)

When write-back helps:

Your workload tolerates async durability (e.g., caches, logs buffered with periodic commits).
The system aggregates small writes into larger sequential flushes, reducing per-write overhead.

When write-back hurts:

High sustained dirty-backlog leads to writeback storms that saturate the I/O queue and produce long tail latencies.
Frequent synchronous flushes (fsync) interleaved with write-back cause mixed synchronous and asynchronous work that amplifies latency spikes.

Hardware note: SSD on-board caches can accelerate write-back dramatically but require power-loss protection to provide the same durability guarantees as a synchronous write. Always treat device caches as part of the durability model, not a free performance subsidy.

Techniques to scale the `page-cache` under heavy concurrency

Scaling is about removing global hotspots and making the common path lock-light and cache-friendly. For page-cache that means sharding, batching, NUMA-awareness, and leveraging async IO submission paths.

Practical techniques that move real-world meters:

Shard hot namespaces: partition large files or object keyspaces so locks and LRU lists don’t collide. Use directory- or inode-based sharding so each shard has its own working-set. This reduces cross-core contention on page lookup and mapping hashes.
Use per-CPU batching: pagevec and per-CPU aggregation reduce the number of atomic operations and syscalls for frequent small operations.
Bypass page-cache for large streaming workloads: enable O_DIRECT or direct=1 in benchmarks to avoid competing with small-random traffic that needs low-latency cached access.
Prefer io_uring submission/completion for high concurrency: it avoids thread-per-request traps and reduces kernel-to-user context-switch overhead in I/O-heavy paths.
NUMA placement: allocate and keep hot pages on the CPU/node where the consuming threads run to avoid cross-node latency.

Example fio pattern to stress page-cache vs direct I/O: test both modes and compare tail latencies. The following runs a high-concurrency random-read test using the page cache (direct=0) and then bypasses it (direct=1). Use the results to compute the miss cost and hit benefit. (fio.readthedocs.io)

# Warm cache (populate)
fio --name=warm --rw=read --bs=1M --size=10G --filename=/mnt/testfile --direct=0 --runtime=60 --time_based

# Test with page-cache
fio --name=pcache-test --rw=randread --bs=4k --numjobs=64 --iodepth=32 \
    --filename=/mnt/testfile --direct=0 --runtime=120 --time_based --group_reporting

# Test bypassing page-cache (measure underlying device)
fio --name=device-test --rw=randread --bs=4k --numjobs=64 --iodepth=32 \
    --filename=/dev/nvme0n1 --direct=1 --runtime=120 --time_based --group_reporting

When concurrency increases, watch for locks on global data structures (mapping hash, LRU lists). If you profile and find a hot lock, either reduce sharing via sharding or move latency-critical flows to O_DIRECT.

Quantifying cache effectiveness: metrics and measurement protocols

Good tuning starts with a repeatable measurement plan that isolates hit cost, miss cost, and contention cost. Use the following metrics and tools:

Primary metrics

Hit ratio (cached reads / total reads): absolute and per-file/inode.
Miss service time (ms to satisfy a miss): directly maps to device + queueing latency.
p50/p95/p99/p99.9 I/O latency for both reads and writes.
Dirty bytes / dirty page build-up rate (bytes/s): indicates writeback pressure.
Page reclaim rate and kswapd activity: high rates show memory pressure/thrashing.

Tools and methods

fio for synthetic workloads and for measuring cache vs device: compare direct=0 and direct=1 runs to measure the page-cache benefit. (fio.readthedocs.io)
vmstat and /proc/vmstat for page-in/page-out, pgfault, pgmajfault.
iostat -x / blktrace to measure device latency and request patterns.
bpftrace / eBPF for low-overhead tracing of kernel events and to build histograms of vfs_read/vfs_write or page-fault handling latencies. Example one-liner that builds a latency histogram for vfs_read (run as root): (ebpf.io)

sudo bpftrace -e 'kprobe:vfs_read { @s[tid] = nsecs; }
                  kretprobe:vfs_read /@s[tid]/ { @lat = hist((nsecs - @s[tid])/1000); delete(@s[tid]); }'

Measurement protocol (repeatable)

Snapshot system knobs: sysctl vm.* (including vm.dirty_*, vm.vfs_cache_pressure) and cat /sys/block/<dev>/queue/read_ahead_kb.
Cold-cache run: clear caches on a dedicated test system (echo 3 > /proc/sys/vm/drop_caches as root) and run fio with direct=1 to measure device baseline.
Warm-cache run: warm the cache and run fio with direct=0 to measure cached performance.
Concurrency sweep: sweep --numjobs and --iodepth to find knee points where contention appears.
Trace at the knee: collect blktrace and bpftrace samples to see whether latency arises in the block layer, writeback, or page fault handlers.

That combination isolates whether latency gains are possible via cache tuning (higher cache hit-rate) or require system-level architecture changes (sharding, NUMA, dedicated I/O nodes).

Practical cache-management checklist you can run tonight

This checklist gives a safe, repeatable sequence you can run on a staging node to understand and bound cache behavior.

Inventory current state
- sysctl vm.dirty_bytes vm.dirty_background_bytes vm.vfs_cache_pressure vm.dirty_ratio vm.dirty_background_ratio
- cat /sys/block/<dev>/queue/read_ahead_kb
- vmstat 1 (observe si, so, CPU st.obs)

Measure baseline

Device baseline (cold): on a test machine, as root:

 sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'   # careful: do not run on production
 fio --name=device-baseline --rw=randread --bs=4k --size=10G \
     --filename=/dev/nvme0n1 --direct=1 --numjobs=16 --iodepth=64 \
     --runtime=60 --time_based --group_reporting --output=device-baseline.txt

Cached baseline (warm):

 fio --name=warmup --rw=read --bs=1M --size=10G --filename=/mnt/testfile --direct=0 --runtime=60 --time_based
 fio --name=cache-baseline --rw=randread --bs=4k --filename=/mnt/testfile --direct=0 --numjobs=16 --iodepth=64 --runtime=60 --time_based --group_reporting --output=cache-baseline.txt

Identify miss cost and hit benefit
- Compare the p99/p50 between device-baseline.txt and cache-baseline.txt. The difference approximates miss cost and shows how much latency the page-cache buys you.
Limit dirty backlog to avoid writeback storms
- Use vm.dirty_bytes / vm.dirty_background_bytes to cap the absolute dirty backlog rather than ratios on large-memory machines. Example (as a starting experiment only):
```
 sudo sysctl -w vm.dirty_background_bytes=67108864   # 64MB
 sudo sysctl -w vm.dirty_bytes=268435456            # 256MB
```

Observe vmstat and iostat while driving load; tune the values to keep background writeback steady and prevent large, sudden flushes.

Tune readahead for your dominant access pattern

Query and set:

 cat /sys/block/<dev>/queue/read_ahead_kb
 sudo bash -c 'echo 128 > /sys/block/<dev>/queue/read_ahead_kb'  # 128 KiB example

Re-run warm-cache fio tests to quantify effect on sequential and mixed reads.

Profile and locate contention
- Use perf/flamegraphs and bpftrace to locate hot locks or functions (mapping hash, lru_add, page-fault handlers).
- If kernel-level locks dominate, explore sharding or moving high-throughput flows to O_DIRECT.
Iterate with realistic load
- Re-run step 2 under realistic concurrency (numjobs and iodepth) and verify p99 behavior improved or at least bounded.
- Keep a changelog of each sysctl and read_ahead change so you can revert.

Note: Always run these steps on staging before applying to production; changing vm.dirty_* and dropping caches affects data durability and system behavior.

Sources:
Page Cache — The Linux Kernel documentation - Kernel-level explanation of the page-cache design, folios, and how regular reads/writes and mmaps interact with the cache. (docs.kernel.org)

fsync(2) — Linux manual page (man7) - POSIX/Linux semantics for fsync/fdatasync, blocking behaviour, and durability considerations. (man7.org)

ARC: A Self-Tuning, Low Overhead Replacement Cache (FAST 2003) - The original ARC description and properties (recency+frequency, scan-resistance). (usenix.org)

fio — Flexible I/O Tester documentation - Recommended benchmarking tool for measuring page-cache vs device performance and for concurrency sweeps. (fio.readthedocs.io)

eBPF — Introduction & docs (ebpf.io) - eBPF/bpftrace resources for building low-overhead kernel probes and histograms of VFS and block-layer latencies. (ebpf.io)

DEV Community

Filesystem Caching and Buffer Management for Low Latency

Why filesystem caching controls io-latency more than raw disk speed

How an eviction-policy prevents latency collapse during pressure

When write-back-cache reduces io-latency and when it doesn't

Techniques to scale the `page-cache` under heavy concurrency

Quantifying cache effectiveness: metrics and measurement protocols

Practical cache-management checklist you can run tonight

Top comments (0)

Why filesystem caching controls io-latency more than raw disk speed

How an eviction-policy prevents latency collapse during pressure

When write-back-cache reduces io-latency and when it doesn't

Techniques to scale the page-cache under heavy concurrency

Quantifying cache effectiveness: metrics and measurement protocols

Practical cache-management checklist you can run tonight

Techniques to scale the `page-cache` under heavy concurrency