You see the symptoms: steady average latency, intermittent large spikes at p99/p999, and simple profilers that show nothing useful. That symptom set points to rare, expensive events — long syscalls, cache-miss storms, cross‑NUMA memory fetches, preemption jitter — which amplify with fan‑out and user scale and cannot be solved by looking at averages alone.
Contents
- When and What to Profile for Tail Latency
- Use perf to Capture Hardware Counters and Build Flame Graphs
- bpftrace Recipes for Live, Kernel‑Aware Tracing
- Read Traces Like a Surgeon: Interpreting Cache‑Miss and Syscall Hotspots
- Practical Application: A p99/p999 Profiling Checklist You Can Run Tonight
When and What to Profile for Tail Latency
For tail work you must measure the right signal, at the right place, and at the right time. The highest-value signals for p99/p999 hunting are:
- Wall-clock tail markers (SLO timestamps, request IDs, client-observed times). Capture time windows around these markers.
-
PMU hardware counters:
cycles,instructions,cache-misses(L1/LLC),branch-misses. These surface microarchitectural stalls and memory-bound behavior.perfexposes standard names mapped to the CPU PMU. - Sampled call stacks (user + kernel) captured while the offending thread is running or blocked. Aggregated stacks show hotspots in code paths.
- Off‑CPU / sleep stacks showing where threads block (futex, poll/epoll, I/O). These explain why a thread saw a long pause.
- Syscall frequency and latency histograms to find noisy syscalls that dominate the tail.
-
NUMA and memory placement metrics (remote memory accesses,
numastat) when you see memory-driven tails.
When to capture:
- Target around the spike. Continuous high-rate sampling in production adds overhead; instead capture a short, focused window correlated to the SLO violation. For exploratory work you can sample longer at low frequency, then chase p99 with short, high-frequency bursts.
Hard truth: averages hide the tail. Aggregate counters help triage (are we CPU bound, memory bound, or I/O bound?), but you must combine counters with stack traces and syscall histograms to get a causal story.
Use perf to Capture Hardware Counters and Build Flame Graphs
perf remains the canonical PMU sampler for CPU and microarchitectural events. Use it to collect stack samples tied to hardware events and produce flame graphs that visualize where time is concentrated.
Minimal flow (system‑wide, low-noise):
# system-wide CPU sampling (99Hz), capture callchains
sudo perf record -F 99 -a -g -- sleep 60
# produce folded stacks and render flame graph (FlameGraph tools required)
sudo perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-cpu.svg
If you need PMU-driven sampling (e.g., only when LLC misses occur):
# capture stacks when LLC load misses fire
sudo perf record -e llc-load-misses -F 199 -a -g -- sleep 30
sudo perf script | ./stackcollapse-perf.pl > out.folded
./flamegraph.pl out.folded > perf-llc.svg
Notes and options:
- Use
-Fto control sampling frequency; 50–200 Hz works for many workloads; raise to 500–1000 Hz for sub-ms phenomena but limit duration because of overhead. - For accurate user-space callstacks on optimized builds use
--call-graph dwarf(orlbron supported Intel CPUs) to avoid frame-pointer artifacts.perf recorddocuments the call-graph modes and limits. - You can also attach to a PID with
-p <pid>rather than system-wide sampling. - The common flamegraph pipeline is
perf script | stackcollapse-perf.pl | flamegraph.pl. Brendan Gregg’s FlameGraph repository and documentation are the canonical references.
Interpreting flame graphs:
- Wide blocks = many samples in that stack. For CPU-bound p99, the culpable function appears wide on the top. For I/O-driven tails you will often see kernel syscall frames (e.g.,
ppoll,futex) and the busy work will live below or in sibling stacks.
bpftrace Recipes for Live, Kernel‑Aware Tracing
When you need context — argument values, filenames, histograms keyed by PID/comm, or low-overhead live sampling — reach for bpftrace. It gives you programmable probes: kprobes, uprobes, tracepoints, and hardware event hooks, with histogram and stack utilities built in.
Quick recipes (one-liners you can run in prod for short windows):
- Syscall counts (per second):
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:1 { print(@); clear(@); }'
- Per-syscall latency histogram (example:
execve):
sudo bpftrace -e '
kprobe:do_sys_execve { @start[tid] = nsecs; }
kretprobe:do_sys_execve /@start[tid]/ {
@lat_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
- Sample user stacks at ~100Hz for a PID:
sudo bpftrace -e 'profile:hz:99 /pid == 12345/ { @[ustack] = count(); } interval:s:10 { print(@); clear(@); }'
- Count LLC cache-misses by process/thread:
sudo bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }'
Practical tips:
- Use
tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args.filename)); }to get syscall args via tracepointargsstructs when you need filenames or flags. - Prefer tracepoints (stable ABI) when available; use kprobes/uprobes when you need lower-level hooks at function entry/exit.
- Keep probes narrowly scoped (by
pid,comm, or cgroup) during production captures to limit overhead and noisy output.
bpftrace ships with many ready-made tools (biolatency, opensnoop, runqlat, etc.) that implement common diagnostics; use those as building blocks.
Read Traces Like a Surgeon: Interpreting Cache‑Miss and Syscall Hotspots
Capturing traces is only half the battle. The other half is mapping signals to surgical fixes.
- High LLC or L1 miss rates on p99 samples:
- Diagnose whether the miss storm is coming from a particular call chain in the flame graph. If the culprit is a tight loop that walks pointer-chasing data structures (linked lists, trees), convert to contiguous layouts (SoA or packed arrays), reduce pointer indirection, and consider software prefetching. Hardware vendors’ guides and profiling experience back this approach.
- Consider TLB pressure and page-size; high TLB miss rates call for large pages or working set shrinking. Intel tooling guides and VTune discuss TLB and cache guidance.
- Frequent expensive syscalls visible in
bpftracehistograms:-
futexdominated tails usually imply lock contention. Inspect stack traces to identify which lock or allocator is the hotspot; reduce lock scope, move to lock‑free algorithms where appropriate, or batch work off the critical path. Off-CPU stacks and syscall histograms show the slow path clearly. -
epoll_pwait/ppolland longread/writeindicate blocked I/O; follow the stack to the I/O source (database, filesystem, network) and target the external dependency. Perf and strace-style traces corroborate each other.
-
- High cross‑socket memory accesses or asymmetrical node activity:
-
numastatandnumactlcan show remote memory usage; remote access is often tens to hundreds of nanoseconds slower and shows up as p99 weeds when memory locality breaks. Pin threads and memory vianumactlor correct allocator behaviour to eliminate remote hops.
-
- Branch mispredictions and long chains of instruction stalls:
- Use
perf record -e branch-missesand view call stacks to find mispredicted branch patterns; refactor hot code to be more branch-predictable or use branchless idioms in hot loops.
- Use
Important: a single tool rarely tells the whole story. Cross-correlate PMU counters, flame graphs,
bpftracehistograms, and off‑CPU stacks to form a causal chain: "cache misses in function X → repeated kernel syscall Y → remote NUMA fetch" — then act on the weakest link.
Practical Application: A p99/p999 Profiling Checklist You Can Run Tonight
A compact, repeatable protocol to go from spike to fix.
- Mark the window
- Capture a timestamped sample of the SLO violation and note request identifiers or trace IDs.
- Lightweight counters (quick triage)
- Run a short
perf statacross the service (1–5s) to see whether the system is CPU, memory, or I/O bound:
- Run a short
sudo perf stat -e cycles,instructions,cache-references,cache-misses -p $(pidof myservice) -- sleep 5
- Sample stacks for hotspots
- Low-noise baseline (30–120s):
sudo perf record -F 99 -a -g -- sleep 60
sudo perf script | ./stackcollapse-perf.pl > all.folded
./flamegraph.pl all.folded > cpu.svg
- PMU-focused window (capture when spike happens):
sudo perf record -e cache-misses -F 199 -a -g -- sleep 20
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > llc.svg
- Live syscall and latency histograms (short bursts)
sudo bpftrace -e 'tracepoint:syscalls:sys_enter { @[probe] = count(); } interval:s:5 { print(@); clear(@); }'
# latency hist for a suspect syscall, run for ~10s
sudo bpftrace -e 'kprobe:vfs_read { @s[tid]=nsecs } kretprobe:vfs_read /@s[tid]/ { @lat_us = hist((nsecs-@s[tid])/1000); delete(@s[tid]); }'
- Off‑CPU analysis
- Use
perf record -g -a -- sleepandperf scriptto look for blocking syscalls (futex,epoll_pwait,read) and correlate with the flamegraphs and bpftrace histograms.
- Use
- Map observation → targeted fix
- High per‑thread
cache-missesin function X: rework data layout to contiguous arrays, align hot fields, prefetch, or reduce working set. -
futex/ locking dominating p99: inspect lock best path, consider partitioning, change lock choice (spin vs mutex), or reduce contented hotspots. - Remote NUMA hops on p99: pin threads + memory (
numactl --cpunodebind+--membind) or refactor allocator to prefer local node.
- High per‑thread
- Verify with controlled re-run
- Rerun the same
perf+bpftracecaptures and compare p99/p999 before/after your change. Keep the exact command-line captured in a versioned doc for reproducibility.
- Rerun the same
Comparison at a glance
| Capability | perf |
bpftrace |
|---|---|---|
| PMU sampling (cycles, cache) | Strong (low-level events, perf stat/record). | Limited (can count/trace PMCs but less established for complex PMU workflows). |
| Callstack sampling & flamegraphs |
Standard pipeline (perf record + flamegraph.pl). |
Can sample ustack/kstack, good for quick checks but pipeline for SVGs is external. |
| Syscall arg inspection & histograms | Basic (strace/perf trace) |
Excellent (tracepoints/kprobes + hist() and printf() primitives). |
| Production safety for short bursts | Good if scoped | Excellent if narrowly scoped (pid/cgroup) and short-lived. |
| Ease of ad-hoc queries | Requires some tooling | Fast one-liners + built-in histograms. |
Sources
The Tail at Scale - Dean & Barroso (2013). Background on why p99/p999 tail behavior dominates at scale and the kinds of variability that cause tails.
CPU Flame Graphs — Brendan Gregg - Practical perf→flamegraph workflow and guidance about sampling frequency and eBPF profile alternatives.
FlameGraph (GitHub) — brendangregg/FlameGraph - stackcollapse-perf.pl and flamegraph.pl tools and usage examples for rendering SVG flame graphs.
perf tutorial — perf.wiki.kernel.org - perf events, perf stat, and PMU event usage and advice for sampling and multiplexing.
bpftrace (GitHub) — iovisor/bpftrace - bpftrace examples, probe types, and one-liners for histograms and stack sampling.
perf-record(1) — man7.org Linux manual page - perf record options, --call-graph modes (dwarf/lbr/fp) and practical flags.
BPF Performance Tools — Brendan Gregg (book page) - Reference for bpftrace/BPF tools, many ready‑to‑run scripts, and deeper observability patterns.
numactl(8) — man7.org Linux manual page - numactl usage and options for binding threads and memory to NUMA nodes.
Apply measurement rigor: isolate windows, collect counters + stacks, and correlate across perf and bpftrace outputs to produce a single causal chain you can act on. Stop.
Top comments (0)