beefed.ai

Posted on Mar 21 • Originally published at beefed.ai

Hands-on Profiling: perf and bpftrace for Tail Latency

#programming

You see the symptoms: steady average latency, intermittent large spikes at p99/p999, and simple profilers that show nothing useful. That symptom set points to rare, expensive events — long syscalls, cache-miss storms, cross‑NUMA memory fetches, preemption jitter — which amplify with fan‑out and user scale and cannot be solved by looking at averages alone.

Contents

When and What to Profile for Tail Latency
Use perf to Capture Hardware Counters and Build Flame Graphs
bpftrace Recipes for Live, Kernel‑Aware Tracing
Read Traces Like a Surgeon: Interpreting Cache‑Miss and Syscall Hotspots
Practical Application: A p99/p999 Profiling Checklist You Can Run Tonight

When and What to Profile for Tail Latency

For tail work you must measure the right signal, at the right place, and at the right time. The highest-value signals for p99/p999 hunting are:

Wall-clock tail markers (SLO timestamps, request IDs, client-observed times). Capture time windows around these markers.
PMU hardware counters: cycles, instructions, cache-misses (L1/LLC), branch-misses. These surface microarchitectural stalls and memory-bound behavior. perf exposes standard names mapped to the CPU PMU.
Sampled call stacks (user + kernel) captured while the offending thread is running or blocked. Aggregated stacks show hotspots in code paths.
Off‑CPU / sleep stacks showing where threads block (futex, poll/epoll, I/O). These explain why a thread saw a long pause.
Syscall frequency and latency histograms to find noisy syscalls that dominate the tail.
NUMA and memory placement metrics (remote memory accesses, numastat) when you see memory-driven tails.

When to capture:

Target around the spike. Continuous high-rate sampling in production adds overhead; instead capture a short, focused window correlated to the SLO violation. For exploratory work you can sample longer at low frequency, then chase p99 with short, high-frequency bursts.

Hard truth: averages hide the tail. Aggregate counters help triage (are we CPU bound, memory bound, or I/O bound?), but you must combine counters with stack traces and syscall histograms to get a causal story.

Use perf to Capture Hardware Counters and Build Flame Graphs

perf remains the canonical PMU sampler for CPU and microarchitectural events. Use it to collect stack samples tied to hardware events and produce flame graphs that visualize where time is concentrated.

Minimal flow (system‑wide, low-noise):

# system-wide CPU sampling (99Hz), capture callchains
sudo perf record -F 99 -a -g -- sleep 60
# produce folded stacks and render flame graph (FlameGraph tools required)
sudo perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-cpu.svg

If you need PMU-driven sampling (e.g., only when LLC misses occur):

# capture stacks when LLC load misses fire
sudo perf record -e llc-load-misses -F 199 -a -g -- sleep 30
sudo perf script | ./stackcollapse-perf.pl > out.folded
./flamegraph.pl out.folded > perf-llc.svg

Notes and options:

Use -F to control sampling frequency; 50–200 Hz works for many workloads; raise to 500–1000 Hz for sub-ms phenomena but limit duration because of overhead.
For accurate user-space callstacks on optimized builds use --call-graph dwarf (or lbr on supported Intel CPUs) to avoid frame-pointer artifacts. perf record documents the call-graph modes and limits.
You can also attach to a PID with -p <pid> rather than system-wide sampling.
The common flamegraph pipeline is perf script | stackcollapse-perf.pl | flamegraph.pl. Brendan Gregg’s FlameGraph repository and documentation are the canonical references.

Interpreting flame graphs:

Wide blocks = many samples in that stack. For CPU-bound p99, the culpable function appears wide on the top. For I/O-driven tails you will often see kernel syscall frames (e.g., ppoll, futex) and the busy work will live below or in sibling stacks.

bpftrace Recipes for Live, Kernel‑Aware Tracing

When you need context — argument values, filenames, histograms keyed by PID/comm, or low-overhead live sampling — reach for bpftrace. It gives you programmable probes: kprobes, uprobes, tracepoints, and hardware event hooks, with histogram and stack utilities built in.

Quick recipes (one-liners you can run in prod for short windows):

Syscall counts (per second):

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:1 { print(@); clear(@); }'

Per-syscall latency histogram (example: execve):

sudo bpftrace -e '
kprobe:do_sys_execve { @start[tid] = nsecs; }
kretprobe:do_sys_execve /@start[tid]/ {
  @lat_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

Sample user stacks at ~100Hz for a PID:

sudo bpftrace -e 'profile:hz:99 /pid == 12345/ { @[ustack] = count(); } interval:s:10 { print(@); clear(@); }'

Count LLC cache-misses by process/thread:

sudo bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }'

Practical tips:

Use tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args.filename)); } to get syscall args via tracepoint args structs when you need filenames or flags.
Prefer tracepoints (stable ABI) when available; use kprobes/uprobes when you need lower-level hooks at function entry/exit.
Keep probes narrowly scoped (by pid, comm, or cgroup) during production captures to limit overhead and noisy output.

bpftrace ships with many ready-made tools (biolatency, opensnoop, runqlat, etc.) that implement common diagnostics; use those as building blocks.

Read Traces Like a Surgeon: Interpreting Cache‑Miss and Syscall Hotspots

Capturing traces is only half the battle. The other half is mapping signals to surgical fixes.

High LLC or L1 miss rates on p99 samples:
- Diagnose whether the miss storm is coming from a particular call chain in the flame graph. If the culprit is a tight loop that walks pointer-chasing data structures (linked lists, trees), convert to contiguous layouts (SoA or packed arrays), reduce pointer indirection, and consider software prefetching. Hardware vendors’ guides and profiling experience back this approach.
- Consider TLB pressure and page-size; high TLB miss rates call for large pages or working set shrinking. Intel tooling guides and VTune discuss TLB and cache guidance.
Frequent expensive syscalls visible in bpftrace histograms:
- futex dominated tails usually imply lock contention. Inspect stack traces to identify which lock or allocator is the hotspot; reduce lock scope, move to lock‑free algorithms where appropriate, or batch work off the critical path. Off-CPU stacks and syscall histograms show the slow path clearly.
- epoll_pwait/ppoll and long read/write indicate blocked I/O; follow the stack to the I/O source (database, filesystem, network) and target the external dependency. Perf and strace-style traces corroborate each other.
High cross‑socket memory accesses or asymmetrical node activity:
- numastat and numactl can show remote memory usage; remote access is often tens to hundreds of nanoseconds slower and shows up as p99 weeds when memory locality breaks. Pin threads and memory via numactl or correct allocator behaviour to eliminate remote hops.
Branch mispredictions and long chains of instruction stalls:
- Use perf record -e branch-misses and view call stacks to find mispredicted branch patterns; refactor hot code to be more branch-predictable or use branchless idioms in hot loops.

Important: a single tool rarely tells the whole story. Cross-correlate PMU counters, flame graphs, bpftrace histograms, and off‑CPU stacks to form a causal chain: "cache misses in function X → repeated kernel syscall Y → remote NUMA fetch" — then act on the weakest link.

Practical Application: A p99/p999 Profiling Checklist You Can Run Tonight

A compact, repeatable protocol to go from spike to fix.

Mark the window
- Capture a timestamped sample of the SLO violation and note request identifiers or trace IDs.
Lightweight counters (quick triage)
- Run a short perf stat across the service (1–5s) to see whether the system is CPU, memory, or I/O bound:

sudo perf stat -e cycles,instructions,cache-references,cache-misses -p $(pidof myservice) -- sleep 5

Sample stacks for hotspots
- Low-noise baseline (30–120s):

sudo perf record -F 99 -a -g -- sleep 60
sudo perf script | ./stackcollapse-perf.pl > all.folded
./flamegraph.pl all.folded > cpu.svg

PMU-focused window (capture when spike happens):

sudo perf record -e cache-misses -F 199 -a -g -- sleep 20
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > llc.svg

Live syscall and latency histograms (short bursts)

sudo bpftrace -e 'tracepoint:syscalls:sys_enter { @[probe] = count(); } interval:s:5 { print(@); clear(@); }'
# latency hist for a suspect syscall, run for ~10s
sudo bpftrace -e 'kprobe:vfs_read { @s[tid]=nsecs } kretprobe:vfs_read /@s[tid]/ { @lat_us = hist((nsecs-@s[tid])/1000); delete(@s[tid]); }'

Off‑CPU analysis
- Use perf record -g -a -- sleep and perf script to look for blocking syscalls (futex, epoll_pwait, read) and correlate with the flamegraphs and bpftrace histograms.
Map observation → targeted fix
- High per‑thread cache-misses in function X: rework data layout to contiguous arrays, align hot fields, prefetch, or reduce working set.
- futex / locking dominating p99: inspect lock best path, consider partitioning, change lock choice (spin vs mutex), or reduce contented hotspots.
- Remote NUMA hops on p99: pin threads + memory (numactl --cpunodebind + --membind) or refactor allocator to prefer local node.
Verify with controlled re-run
- Rerun the same perf + bpftrace captures and compare p99/p999 before/after your change. Keep the exact command-line captured in a versioned doc for reproducibility.

Comparison at a glance

Capability	`perf`	`bpftrace`
PMU sampling (cycles, cache)	Strong (low-level events, perf stat/record).	Limited (can count/trace PMCs but less established for complex PMU workflows).
Callstack sampling & flamegraphs	Standard pipeline (`perf record` + flamegraph.pl).	Can sample `ustack`/`kstack`, good for quick checks but pipeline for SVGs is external.
Syscall arg inspection & histograms	Basic (strace/perf trace)	Excellent (tracepoints/kprobes + `hist()` and `printf()` primitives).
Production safety for short bursts	Good if scoped	Excellent if narrowly scoped (pid/cgroup) and short-lived.
Ease of ad-hoc queries	Requires some tooling	Fast one-liners + built-in histograms.

Sources

The Tail at Scale - Dean & Barroso (2013). Background on why p99/p999 tail behavior dominates at scale and the kinds of variability that cause tails.

CPU Flame Graphs — Brendan Gregg - Practical perf→flamegraph workflow and guidance about sampling frequency and eBPF profile alternatives.

FlameGraph (GitHub) — brendangregg/FlameGraph - stackcollapse-perf.pl and flamegraph.pl tools and usage examples for rendering SVG flame graphs.

perf tutorial — perf.wiki.kernel.org - perf events, perf stat, and PMU event usage and advice for sampling and multiplexing.

bpftrace (GitHub) — iovisor/bpftrace - bpftrace examples, probe types, and one-liners for histograms and stack sampling.

perf-record(1) — man7.org Linux manual page - perf record options, --call-graph modes (dwarf/lbr/fp) and practical flags.

BPF Performance Tools — Brendan Gregg (book page) - Reference for bpftrace/BPF tools, many ready‑to‑run scripts, and deeper observability patterns.

numactl(8) — man7.org Linux manual page - numactl usage and options for binding threads and memory to NUMA nodes.

Apply measurement rigor: isolate windows, collect counters + stacks, and correlate across perf and bpftrace outputs to produce a single causal chain you can act on. Stop.