Benchmarking & Performance Tuning for Storage Engines

#microservices

Designing representative workloads for meaningful benchmarks
Building a reliable test harness: fio, iostat, and custom drivers
What matters: p99 latency, throughput, IOPS, and variability
Systematic bottleneck analysis and step-by-step storage tuning
Practical benchmarking: repeatable suites, CI automation, and reporting
Sources

Benchmarking storage engines is not an academic exercise — it’s the single most reliable lever you have to surface the gaps between your SLOs and reality. Measure the right workload, track the tails, and you stop chasing illusions of performance that evaporate under production load.

The problem you actually have is rarely "disk is slow." Symptoms look like: high aggregate throughput in microbenchmarks but frequent production slowdowns at the p99; unpredictable latency spikes during compactions; or test harnesses that show great IOPS numbers while end users complain about occasional 100–500ms requests. Those symptoms point to a combination of mismatched workloads, hidden queueing effects, and compaction/GC/network-sidewalks — the exact friction a repeatable, telemetry-driven benchmarking approach is built to uncover.

Designing representative workloads for meaningful benchmarks

A benchmark that doesn't model production is a lie you have to pay for later. The objective here: convert production telemetry into a small, repeatable set of synthetic workloads that exercise the same resource profile (reads/writes, key/value sizes, skew, concurrency, and temporal bursts).

Capture the signal you actually care about:
- Operation mix (read/write/scan percentages), per-endpoint.
- Key and value size distributions (histograms, not single averages).
- Access skew (Zipfian parameters), hot prefixes, and fan-out patterns.
- Concurrency per client and aggregate concurrency across clients/time windows.
- Failure or GC events that correlate with tail spikes.
Tools and mapping:
- Use trace-based generators (YCSB or its ports) for key/value and op-mix shaping. YCSB exposes recordcount, operationcount, and key distribution generators (Zipfian/Latest) for accurate reproduction.
- For RocksDB-specific flows use db_bench to reproduce fill*, readwhilewriting, and compaction-heavy runs; db_bench accepts many RocksDB options so you can reproduce memtable/compaction/level behavior.
Practical translation (example):
- Production telemetry: 90% point-reads, 10% writes, key size 16B, value median 512B, skew ≈ Zipf(0.9), average client concurrency 24 with spikes to 240.
- Synthetic mapping:
- YCSB workload: workloada with readproportion=0.9, recordcount scaled down, readdistribution=zipfian with skew 0.9.
- RocksDB: db_bench --benchmarks=fillrandom,readrandom,readwhilewriting --use_existing_db with --threads=24 and a short phase that ramps to --threads=240 for spike tests.
Why warm-up and steady-state matter:
- LSM-based engines exhibit warm-up and compaction transients (write amplification, level growth) that mask steady-state. Design a run with a warm-up population and a long measurement window rather than a short cold run.

Building a reliable test harness: fio, iostat, and custom drivers

A test harness is orchestration + telemetry. The harness must reliably create the workload and collect system+device+engine metrics in sync.

Minimum components:
1. Workload generator(s): fio for block-level tests, db_bench for RocksDB microbenchmarks, and YCSB (or go-ycsb) for application-level flows.
2. System collectors: iostat/sar for device-level metrics, vmstat and top/htop for CPU/memory, and perf/eBPF for hotspots. Use iostat -x -m 1 to capture extended device stats per second.
3. Engine telemetry: RocksDB --statistics, --histogram and --stats_per_interval flags, plus log capture.
4. Storage tracing: blktrace/bpftrace for deep I/O sequencing when necessary.
fio best-practice invocation (example):

fio --name=randrw-4k-q64 \
    --ioengine=libaio --direct=1 \
    --rw=randrw --rwmixread=70 \
    --bs=4k --numjobs=4 --iodepth=64 \
    --time_based --runtime=120 --group_reporting \
    --output=fio.json --output-format=json+

This emits a json+ payload including latency histograms suitable for automated parsing. Use latency_profile or rate_iops to model bursts (Poisson submission) and target steady states.

iostat workflow:
- Run iostat -x -m 1 > iostat.csv concurrently with workload runs to collect util, avgqu-sz, await and svctm (note: svctm is deprecated in some versions). Use these to detect device saturation (%util ≈ 100) and rising await.
Parsing and aggregation:
- Convert fio json+ with fio_jsonplus_clat2csv or a small Python script (or jq) to extract clat percentiles and IOPS per interval. fiologparser_hist.py is shipped with fio and converts clat histograms to CSV.
- Correlate timestamped fio percentiles with iostat snapshots to map p99 spikes to device-level events.

Important: Always include host metadata (CPU model, kernel version, NVMe model, filesystem, mount options) with each run so you can reason about environmental differences.

What matters: p99 latency, throughput, IOPS, and variability

Metrics are signals, not goals. Choose the right metric for the question you’re asking.

Metric	What it measures	Why it matters	How to measure
p99 latency	Time below which 99% of requests complete	Captures tail behavior that damages user experience and compounds across fan‑out. Tail metrics map directly to SLOs.	`fio` `json+` clat percentiles; application traces
Throughput (MB/s)	Aggregate data rate	Useful for bulk-transfer capacity questions and throughput-bound workloads	`fio` bw, OS network/storage counters
IOPS	Number of I/O ops per second	Good for small-random workloads; interacts with queue depth and latency via Little’s Law	`fio` `iops` fields; device counters
Variability / histograms	Distribution shape (stdev, IQR, histogram bins)	Tells whether spikes are rare outliers or frequent and deterministic	`fio` histograms, application tracing
Device %util / avgqu-sz	How busy device is and queue length	High `%util` + rising `await` indicates device saturation	`iostat -x`

Why p99 specifically: p99 exposes the long tail that usually drives end-user frustration and SLO misses. In distributed flows the slowest leg dominates end-to-end latency; reducing medians rarely improves real UX when tails remain high.
Measuring variability: Prefer histograms and percentiles over averages. Export clat histograms at short intervals to detect transient spikes (e.g., periodic compaction bursts).
Concurrency math (use this frequently): Little’s Law relates concurrency, throughput, and latency: L = λ × W (where L = concurrency/queue depth, λ = throughput [IOPS], W = avg latency in seconds). Use this to pick queue depths and reason about expected IOPS vs latency.

Systematic bottleneck analysis and step-by-step storage tuning

Triage first, tune second. Follow a methodical loop: measure → hypothesize → modify one variable → re-measure.

Baseline and scope:
- Produce a reproducible baseline run: warm the DB, run a 10–30 minute measurement window, and capture fio/db_bench outputs plus iostat/vmstat/RocksDB stats. Store outputs and host metadata.
Isolate raw device capability:
- Run fio against the raw block device with direct=1, single-threaded and then increase numjobs/iodepth to find the knee. Use --output-format=json+ and fio_jsonplus_clat2csv to capture p99 at each point.
- Look for %util hitting 100% or await suddenly increasing — that’s a device bottleneck. iostat -x -m 1 gives the per-second picture.
Apply Little’s Law to sanity-check contention:

queue_depth ≈ IOPS * avg_latency_seconds
# e.g., desired 50k IOPS at 1ms avg -> QD = 50,000 * 0.001 = 50

If the device needs QD 50 to reach target IOPS, but host or application can only drive QD 4, you will not reach throughput without parallelism.

Narrow the scope: CPU vs Disk vs RocksDB internals:
- CPU: high sys or user in top, or compaction threads pegged by perf top, points to CPU-bound compaction.
- Disk: %util at 90–100% with rising await points to I/O-bound.
- RocksDB: --stats_per_interval shows compaction write amplification and stalls; level0_file_num_compaction_trigger, max_background_compactions, write_buffer_size are first levers.
RocksDB tuning sequence (order matters):
- Reproduce with --disable_wal on disposable DBs to see WAL cost baseline (does not preserve durability — only for microbench).
- Tune write_buffer_size and max_write_buffer_number to increase memtable flush size if CPU is underutilized and compactions can be amortized.
- Increase max_background_compactions to process L0→L1 more quickly, but watch CPU and I/O contention. More compaction threads increase throughput but can raise p99 if they steal CPU and I/O from foreground operations.
- Adjust level0_file_num_compaction_trigger, level0_slowdown_writes_trigger, and level0_stop_writes_trigger to control write stalls.
- Consider use_plain_table, mmap_reads, or pin_l0_filter_and_index_blocks_in_cache when read-latency matters and working sets are cache-friendly.
Device-level knobs:
- For NVMe, ensure correct driver parameters and avoid unnecessary scheduler work (mq-deadline or noop on some stacks). Confirm mount options (e.g., noatime) and check whether the filesystem is appropriate. Test raw block device vs filesystem-bound tests to understand the difference. Be conservative: some filesystem options affect durability semantics.
Validate trade-offs:
- Run workload with production-like write amplification enabled. Tuning that improves median but worsens p99 is a red flag. Repeat the baseline after each change and compare p99 and throughput.
Contrarian insight (hard-won): chasing higher aggregate IOPS without watching the p99 usually backfires. Increasing background compaction threads or queue depths often raises throughput but also widens the latency distribution unless CPU, I/O and memory headroom are verified first.

Practical benchmarking: repeatable suites, CI automation, and reporting

Your benchmarks need to be code: runnable scripts, versioned configs, and deterministic artifacts.

Test-suite structure:
- 01-sanity: raw-device fio single-threaded, checks device health.
- 02-db-warmup: db_bench populate with deterministic keyset.
- 03-read-heavy: workload matching production read ratio.
- 04-write-heavy: workload to exercise compaction path.
- 05-spike-tests: burst concurrency patterns to exercise tail behavior.
Example benchmark runner (bash snippet):

#!/usr/bin/env bash
set -euo pipefail
OUTDIR=results/$(date +%Y%m%d-%H%M%S)
mkdir -p "$OUTDIR"
# collect host metadata
lscpu > "$OUTDIR"/lscpu.txt
nvme list > "$OUTDIR"/nvme.txt || lsblk >> "$OUTDIR"/lsblk.txt
# run fio job with json+ output
fio --name=test --filename=/dev/nvme0n1 --ioengine=libaio --direct=1 \
    --rw=randread --bs=4k --numjobs=8 --iodepth=64 --runtime=120 \
    --output="$OUTDIR"/fio-test.json --output-format=json+
# collect iostat while fio runs (background)
iostat -x -m 1 > "$OUTDIR"/iostat.log &
wait

CI integration (GitHub Actions example):

name: storage-bench
on: [workflow_dispatch]
jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install fio
        run: sudo apt-get update && sudo apt-get install -y fio
      - name: Run benchmarks
        run: ./bench/run_all.sh
      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: bench-results
          path: results/**

Note: CI runners are ephemeral and have variable hardware. Use CI for regression detection (compare new vs baseline runs) and store baseline artifacts on durable storage, but perform final approval on dedicated hardware labs.

Reporting and comparison:
- Store JSON+ outputs and host metadata. Use fiologparser_hist.py or the included fio_jsonplus_clat2csv to convert clat histograms to CSV for plotting.
- Compute deltas on key signals (p50, p95, p99, throughput) and report percent change and absolute change.
- Automate a simple regression check: flag if p99 increases beyond X% or p99 absolute increases above SLO.
Repeatability checklist:
1. Record hardware + kernel + fs + driver versions.
2. Use the same job files and seeds for synthetic generators.
3. Warm to steady state before measurement.
4. Run each test ≥3 times and use the median run for reporting.
5. Store raw artifacts (fio JSON+, iostat, RocksDB stats).

Closing statement
Good benchmarking is a discipline: define representative workloads from production traces, build a harness that captures both device and engine signals, make percentile and histogram data your primary lenses, and change one variable at a time while automating repeatable runs. Measure to learn, not to validate hope.

Sources

RocksDB — Benchmarking tools (GitHub Wiki) - Documentation and examples for db_bench, benchmark options and RocksDB-specific benchmarking patterns used in the article.

RocksDB* Tuning Guide on Intel® Xeon® Processor Platforms - Practical system-level and RocksDB parameter tuning notes, and explanation of LSM behavior and compaction trade-offs.

fio documentation (readthedocs) - fio job file options, json+ output, percentile settings, and latency profiling examples referenced for fio workflows.

iostat man page (manpages.org) - Definitions and examples for iostat fields such as %util, await, and extended reporting flags used for device telemetry.

What Is P99 Latency? (Aerospike blog) - Rationale for why p99/tail metrics matter and how tail amplification affects distributed systems.

Little's law (Wikipedia) - Queueing relationship used to relate IOPS, latency, and queue depth for capacity reasoning.

YCSB — Yahoo! Cloud Serving Benchmark (GitHub) - Workload generator for application-level CRUD patterns and distributions; used for mapping production mixes.

fio latency profile examples (fio docs examples) - Examples such as Poisson request submission and latency profiling used to model bursts and steady-state.

fio tools: fio_jsonplus_clat2csv (fio tools) - Utility and pattern for converting fio json+ latency dumps into CSV for plotting and CI analysis.

Azure: Queue depth and IOPS relationship (Azure docs) - Practical guidance and formula relating queue depth, IOPS, and latency for storage volumes.