- Designing representative workloads for meaningful benchmarks
- Building a reliable test harness: fio, iostat, and custom drivers
- What matters: p99 latency, throughput, IOPS, and variability
- Systematic bottleneck analysis and step-by-step storage tuning
- Practical benchmarking: repeatable suites, CI automation, and reporting
- Sources
Benchmarking storage engines is not an academic exercise — it’s the single most reliable lever you have to surface the gaps between your SLOs and reality. Measure the right workload, track the tails, and you stop chasing illusions of performance that evaporate under production load.
The problem you actually have is rarely "disk is slow." Symptoms look like: high aggregate throughput in microbenchmarks but frequent production slowdowns at the p99; unpredictable latency spikes during compactions; or test harnesses that show great IOPS numbers while end users complain about occasional 100–500ms requests. Those symptoms point to a combination of mismatched workloads, hidden queueing effects, and compaction/GC/network-sidewalks — the exact friction a repeatable, telemetry-driven benchmarking approach is built to uncover.
Designing representative workloads for meaningful benchmarks
A benchmark that doesn't model production is a lie you have to pay for later. The objective here: convert production telemetry into a small, repeatable set of synthetic workloads that exercise the same resource profile (reads/writes, key/value sizes, skew, concurrency, and temporal bursts).
-
Capture the signal you actually care about:
- Operation mix (read/write/scan percentages), per-endpoint.
- Key and value size distributions (histograms, not single averages).
- Access skew (Zipfian parameters), hot prefixes, and fan-out patterns.
- Concurrency per client and aggregate concurrency across clients/time windows.
- Failure or GC events that correlate with tail spikes.
-
Tools and mapping:
- Use trace-based generators (YCSB or its ports) for key/value and op-mix shaping. YCSB exposes
recordcount,operationcount, and key distribution generators (Zipfian/Latest) for accurate reproduction. - For RocksDB-specific flows use
db_benchto reproducefill*,readwhilewriting, andcompaction-heavy runs;db_benchaccepts many RocksDB options so you can reproduce memtable/compaction/level behavior.
- Use trace-based generators (YCSB or its ports) for key/value and op-mix shaping. YCSB exposes
-
Practical translation (example):
- Production telemetry: 90% point-reads, 10% writes, key size 16B, value median 512B, skew ≈ Zipf(0.9), average client concurrency 24 with spikes to 240.
- Synthetic mapping:
- YCSB workload:
workloadawithreadproportion=0.9,recordcountscaled down,readdistribution=zipfianwith skew 0.9. - RocksDB:
db_bench --benchmarks=fillrandom,readrandom,readwhilewriting --use_existing_dbwith--threads=24and a short phase that ramps to--threads=240for spike tests.
-
Why warm-up and steady-state matter:
- LSM-based engines exhibit warm-up and compaction transients (write amplification, level growth) that mask steady-state. Design a run with a warm-up population and a long measurement window rather than a short cold run.
Building a reliable test harness: fio, iostat, and custom drivers
A test harness is orchestration + telemetry. The harness must reliably create the workload and collect system+device+engine metrics in sync.
-
Minimum components:
- Workload generator(s):
fiofor block-level tests,db_benchfor RocksDB microbenchmarks, and YCSB (or go-ycsb) for application-level flows. - System collectors:
iostat/sarfor device-level metrics,vmstatandtop/htopfor CPU/memory, andperf/eBPFfor hotspots. Useiostat -x -m 1to capture extended device stats per second. - Engine telemetry: RocksDB
--statistics,--histogramand--stats_per_intervalflags, plus log capture. - Storage tracing:
blktrace/bpftracefor deep I/O sequencing when necessary.
- Workload generator(s):
fio best-practice invocation (example):
fio --name=randrw-4k-q64 \
--ioengine=libaio --direct=1 \
--rw=randrw --rwmixread=70 \
--bs=4k --numjobs=4 --iodepth=64 \
--time_based --runtime=120 --group_reporting \
--output=fio.json --output-format=json+
This emits a json+ payload including latency histograms suitable for automated parsing. Use latency_profile or rate_iops to model bursts (Poisson submission) and target steady states.
-
iostat workflow:
- Run
iostat -x -m 1 > iostat.csvconcurrently with workload runs to collectutil,avgqu-sz,awaitandsvctm(note:svctmis deprecated in some versions). Use these to detect device saturation (%util ≈ 100) and risingawait.
- Run
-
Parsing and aggregation:
- Convert fio
json+withfio_jsonplus_clat2csvor a small Python script (orjq) to extractclatpercentiles and IOPS per interval.fiologparser_hist.pyis shipped with fio and converts clat histograms to CSV. - Correlate timestamped
fiopercentiles withiostatsnapshots to map p99 spikes to device-level events.
- Convert fio
Important: Always include host metadata (CPU model, kernel version, NVMe model, filesystem, mount options) with each run so you can reason about environmental differences.
What matters: p99 latency, throughput, IOPS, and variability
Metrics are signals, not goals. Choose the right metric for the question you’re asking.
| Metric | What it measures | Why it matters | How to measure |
|---|---|---|---|
| p99 latency | Time below which 99% of requests complete | Captures tail behavior that damages user experience and compounds across fan‑out. Tail metrics map directly to SLOs. |
fio json+ clat percentiles; application traces |
| Throughput (MB/s) | Aggregate data rate | Useful for bulk-transfer capacity questions and throughput-bound workloads |
fio bw, OS network/storage counters |
| IOPS | Number of I/O ops per second | Good for small-random workloads; interacts with queue depth and latency via Little’s Law |
fio iops fields; device counters |
| Variability / histograms | Distribution shape (stdev, IQR, histogram bins) | Tells whether spikes are rare outliers or frequent and deterministic |
fio histograms, application tracing |
| Device %util / avgqu-sz | How busy device is and queue length | High %util + rising await indicates device saturation |
iostat -x |
Why p99 specifically: p99 exposes the long tail that usually drives end-user frustration and SLO misses. In distributed flows the slowest leg dominates end-to-end latency; reducing medians rarely improves real UX when tails remain high.
Measuring variability: Prefer histograms and percentiles over averages. Export clat histograms at short intervals to detect transient spikes (e.g., periodic compaction bursts).
Concurrency math (use this frequently): Little’s Law relates concurrency, throughput, and latency: L = λ × W (where L = concurrency/queue depth, λ = throughput [IOPS], W = avg latency in seconds). Use this to pick queue depths and reason about expected IOPS vs latency.
Systematic bottleneck analysis and step-by-step storage tuning
Triage first, tune second. Follow a methodical loop: measure → hypothesize → modify one variable → re-measure.
-
Baseline and scope:
- Produce a reproducible baseline run: warm the DB, run a 10–30 minute measurement window, and capture
fio/db_benchoutputs plusiostat/vmstat/RocksDB stats. Store outputs and host metadata.
- Produce a reproducible baseline run: warm the DB, run a 10–30 minute measurement window, and capture
-
Isolate raw device capability:
- Run
fioagainst the raw block device withdirect=1, single-threaded and then increasenumjobs/iodepthto find the knee. Use--output-format=json+andfio_jsonplus_clat2csvto capture p99 at each point. - Look for
%utilhitting 100% orawaitsuddenly increasing — that’s a device bottleneck.iostat -x -m 1gives the per-second picture.
- Run
Apply Little’s Law to sanity-check contention:
queue_depth ≈ IOPS * avg_latency_seconds
# e.g., desired 50k IOPS at 1ms avg -> QD = 50,000 * 0.001 = 50
If the device needs QD 50 to reach target IOPS, but host or application can only drive QD 4, you will not reach throughput without parallelism.
-
Narrow the scope: CPU vs Disk vs RocksDB internals:
- CPU: high
sysoruserintop, or compaction threads pegged byperf top, points to CPU-bound compaction. - Disk:
%utilat 90–100% with risingawaitpoints to I/O-bound. - RocksDB:
--stats_per_intervalshows compaction write amplification and stalls;level0_file_num_compaction_trigger,max_background_compactions,write_buffer_sizeare first levers.
- CPU: high
-
RocksDB tuning sequence (order matters):
- Reproduce with
--disable_walon disposable DBs to see WAL cost baseline (does not preserve durability — only for microbench). - Tune
write_buffer_sizeandmax_write_buffer_numberto increase memtable flush size if CPU is underutilized and compactions can be amortized. - Increase
max_background_compactionsto process L0→L1 more quickly, but watch CPU and I/O contention. More compaction threads increase throughput but can raise p99 if they steal CPU and I/O from foreground operations. - Adjust
level0_file_num_compaction_trigger,level0_slowdown_writes_trigger, andlevel0_stop_writes_triggerto control write stalls. - Consider
use_plain_table,mmap_reads, orpin_l0_filter_and_index_blocks_in_cachewhen read-latency matters and working sets are cache-friendly.
- Reproduce with
-
Device-level knobs:
- For NVMe, ensure correct driver parameters and avoid unnecessary scheduler work (
mq-deadlineornoopon some stacks). Confirm mount options (e.g.,noatime) and check whether the filesystem is appropriate. Test raw block device vs filesystem-bound tests to understand the difference. Be conservative: some filesystem options affect durability semantics.
- For NVMe, ensure correct driver parameters and avoid unnecessary scheduler work (
-
Validate trade-offs:
- Run workload with production-like write amplification enabled. Tuning that improves median but worsens p99 is a red flag. Repeat the baseline after each change and compare p99 and throughput.
Contrarian insight (hard-won): chasing higher aggregate IOPS without watching the p99 usually backfires. Increasing background compaction threads or queue depths often raises throughput but also widens the latency distribution unless CPU, I/O and memory headroom are verified first.
Practical benchmarking: repeatable suites, CI automation, and reporting
Your benchmarks need to be code: runnable scripts, versioned configs, and deterministic artifacts.
-
Test-suite structure:
-
01-sanity: raw-device fio single-threaded, checks device health. -
02-db-warmup: db_bench populate with deterministic keyset. -
03-read-heavy: workload matching production read ratio. -
04-write-heavy: workload to exercise compaction path. -
05-spike-tests: burst concurrency patterns to exercise tail behavior.
-
Example benchmark runner (bash snippet):
#!/usr/bin/env bash
set -euo pipefail
OUTDIR=results/$(date +%Y%m%d-%H%M%S)
mkdir -p "$OUTDIR"
# collect host metadata
lscpu > "$OUTDIR"/lscpu.txt
nvme list > "$OUTDIR"/nvme.txt || lsblk >> "$OUTDIR"/lsblk.txt
# run fio job with json+ output
fio --name=test --filename=/dev/nvme0n1 --ioengine=libaio --direct=1 \
--rw=randread --bs=4k --numjobs=8 --iodepth=64 --runtime=120 \
--output="$OUTDIR"/fio-test.json --output-format=json+
# collect iostat while fio runs (background)
iostat -x -m 1 > "$OUTDIR"/iostat.log &
wait
- CI integration (GitHub Actions example):
name: storage-bench
on: [workflow_dispatch]
jobs:
bench:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install fio
run: sudo apt-get update && sudo apt-get install -y fio
- name: Run benchmarks
run: ./bench/run_all.sh
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: bench-results
path: results/**
Note: CI runners are ephemeral and have variable hardware. Use CI for regression detection (compare new vs baseline runs) and store baseline artifacts on durable storage, but perform final approval on dedicated hardware labs.
-
Reporting and comparison:
- Store JSON+ outputs and host metadata. Use
fiologparser_hist.pyor the includedfio_jsonplus_clat2csvto convertclathistograms to CSV for plotting. - Compute deltas on key signals (p50, p95, p99, throughput) and report percent change and absolute change.
- Automate a simple regression check: flag if p99 increases beyond X% or p99 absolute increases above SLO.
- Store JSON+ outputs and host metadata. Use
-
Repeatability checklist:
- Record hardware + kernel + fs + driver versions.
- Use the same job files and seeds for synthetic generators.
- Warm to steady state before measurement.
- Run each test ≥3 times and use the median run for reporting.
- Store raw artifacts (fio JSON+, iostat, RocksDB stats).
Closing statement
Good benchmarking is a discipline: define representative workloads from production traces, build a harness that captures both device and engine signals, make percentile and histogram data your primary lenses, and change one variable at a time while automating repeatable runs. Measure to learn, not to validate hope.
Sources
RocksDB — Benchmarking tools (GitHub Wiki) - Documentation and examples for db_bench, benchmark options and RocksDB-specific benchmarking patterns used in the article.
RocksDB* Tuning Guide on Intel® Xeon® Processor Platforms - Practical system-level and RocksDB parameter tuning notes, and explanation of LSM behavior and compaction trade-offs.
fio documentation (readthedocs) - fio job file options, json+ output, percentile settings, and latency profiling examples referenced for fio workflows.
iostat man page (manpages.org) - Definitions and examples for iostat fields such as %util, await, and extended reporting flags used for device telemetry.
What Is P99 Latency? (Aerospike blog) - Rationale for why p99/tail metrics matter and how tail amplification affects distributed systems.
Little's law (Wikipedia) - Queueing relationship used to relate IOPS, latency, and queue depth for capacity reasoning.
YCSB — Yahoo! Cloud Serving Benchmark (GitHub) - Workload generator for application-level CRUD patterns and distributions; used for mapping production mixes.
fio latency profile examples (fio docs examples) - Examples such as Poisson request submission and latency profiling used to model bursts and steady-state.
fio tools: fio_jsonplus_clat2csv (fio tools) - Utility and pattern for converting fio json+ latency dumps into CSV for plotting and CI analysis.
Azure: Queue depth and IOPS relationship (Azure docs) - Practical guidance and formula relating queue depth, IOPS, and latency for storage volumes.
Top comments (0)