beefed.ai

Posted on May 14 • Originally published at beefed.ai

Designing a One-Click CLI Profiler for Engineers

#programming

Why a true 'one-click' profiler changes developer behavior
Sampling, symbols, and export formats that actually work
Designing low-overhead probes you can run in production
Profiling UX: CLI ergonomics, defaults, and flame-graph output
Actionable checklist: ship a one-click profiler in 8 steps

Profiling must be cheap, fast, and trustworthy — otherwise it becomes a curiosity instead of infrastructure. A one-click profiler should turn the act of measurement into a reflex: one command, low noise, a deterministic artifact (flame graph / pprof / speedscope) that your team can inspect and attach to an issue.

Most teams avoid profiling because it’s slow, fragile, or requires special privileges — that friction means performance regressions linger, expensive resources stay hidden, and root-cause hunts take days. Continuous and low-cost sampling (the architecture behind modern one-click profilers) addresses these adoption problems by making profiling a non-invasive, always-available signal for engineering workflows.

Why a true 'one-click' profiler changes developer behavior

A one-click profiler flips profiling from a gated, expert-only activity into a standard diagnostic tool the whole team uses. When the barrier drops from "request access + rebuild + instrument" to "run profile --short", velocity changes: regressions are reproducible artifacts, performance becomes part of PR reviews, and engineers stop guessing where CPU time is going. Parca and Pyroscope both frame continuous, low-overhead sampling as the mechanism that makes always-on profiling realistic; that cultural change is the primary product-level win.

Practical corollaries that matter when you design the tool:

Make the first-run experience frictionless: no build changes, no source edits, minimal privileges (or clear guidance when privileges are required).
Make the output shareable by default: an SVG, pprof protobuf, and a speedscope JSON give you quick review, deep analysis, and IDE-friendly import points.
Treat profiles as first-class artifacts: store them with the same care you store test results — timestamped, annotated with commit/branch, and linked to CI runs.

Sampling, symbols, and export formats that actually work

Sampling beats instrumentation for production: a well-configured sampler gives representative stacks with negligible perturbation. Timed sampling (what perf, py-spy, and eBPF-based samplers use) is how flame graphs are derived and why they scale to production workloads.

Practical sampling rules

Start at ≈100 Hz (commonly 99 Hz used in perf workflows). That produces about 3,000 samples in a 30s run — usually enough to expose hot paths without swamping the target. Use -F 99 with perf or profile:hz:99 with bpftrace as a sensible default.
For very short traces or microbenchmarks, raise the rate; for always-on continuous collection, drop to 1–10 Hz and aggregate over time.
Sample wall-clock (off-CPU) in addition to on-CPU for IO/blocked analysis. Flame graph variants exist for both on-CPU and off-CPU views.

Symbol / unwinding strategy (what actually yields readable stacks)

Prefer frame-pointer unwinding when available (it's cheap and reliable). Many distributions now enable frame pointers for OS libraries to improve stack traces. Where frame pointers are missing, DWARF-based unwinding helps but is heavier and sometimes brittle. Brendan Gregg has practical notes on this tradeoff and why frame pointers matter again.
Collect debuginfo for significant binaries (strip debug symbols in release artifacts but publish .debug packages or use a symbol server). For eBPF/CO-RE agents, BTF and debuginfo uploads (or a symbol service) dramatically improve usability.

Export formats: pick two that cover the UX triangle

pprof (profile.proto): rich metadata, cross-language tooling (pprof), good for CI/automation. Many backends (cloud profilers and Pyroscope) accept this protobuf.
Folded stacks / FlameGraph SVG: minimal, human-friendly, and interactive in a browser — the canonical artifact for PRs and post-mortems. Brendan Gregg’s FlameGraph toolkit remains the defacto converter for perf-derived stacks.
Speedscope JSON: excellent for multi-language interactive exploration and embedding into web UIs. Use it when you expect engineers to open profiles in a browser or in IDE plugins.

Example pipeline snippets

# Native C/C++ / system-level: perf -> folded -> flamegraph.svg
sudo perf record -F 99 -p $PID -g -- sleep 30
sudo perf script | ./FlameGraph/stackcollapse-perf.pl > /tmp/profile.folded
./FlameGraph/flamegraph.pl /tmp/profile.folded > /tmp/profile.svg

# Python: record with py-spy (non-invasive)
py-spy record -o profile.speedscope --format speedscope --pid $PID --rate 100 --duration 30

Format	Best for	Pros	Cons
pprof (proto)	CI, automated regressions, cross-language analysis	Rich metadata; canonical for programmatic diffing and cloud profilers.	Binary protobuf, needs `pprof` tooling to inspect.
FlameGraph (folded → SVG)	Human post-mortems, PR attachments	Easy to generate from `perf`; immediate visual insight.	Static SVG can be large; lacks pprof metadata.
Speedscope JSON	Interactive browser analysis, multi-language	Responsive viewer, timeline + grouped views.	Conversion may lose some metadata; viewer-dependent.

Designing low-overhead probes you can run in production

Low overhead is non-negotiable. Design probes so the act of measuring does not perturb the system you’re trying to understand.

Probe design patterns that work

Use sampling over instrumentation for CPU and general-purpose performance profiling; sample in the kernel or via safe user-space samplers. Sampling reduces the amount of data and the frequency of costly syscall interactions.
Leverage eBPF for system-wide, language-agnostic sampling where possible. eBPF runs in kernel space and is constrained by the verifier and helper APIs — that makes many eBPF probes both safe and low-overhead when implemented correctly. Prefer aggregated counters and maps in the kernel to avoid heavy per-sample copy traffic.
Avoid transferring raw stacks for every sample. Aggregate in-kernel (counts per stack) and pull only summaries periodically, or use per-CPU ring buffers sized appropriately. Parca’s architecture follows this philosophy: collect low-level stacks with minimal per-sample overhead and archive aggregated data for query.

Probe types and when to use them

perf_event sampling — generic CPU sampling and low-level PMU events. Use this as your default sampler for native code.
kprobe / uprobe — targeted kernel/user-space dynamic probes (use sparingly; good for targeted investigations).
USDT (user static tracepoints) — ideal for instrumenting long-lived language runtimes or frameworks without changing sampling behavior.
Runtime-specific samplers — use py-spy for CPython to get accurate Python-level frames without hacking the interpreter; use runtime/pprof for Go where pprof is native.

Safety and operational controls

Always measure and publish the profiler’s own overhead. Continuous agents should target single-digit percent overhead at most and provide "off" modes. Parca and Pyroscope emphasize that continuous on-production collection must be minimally invasive.
Guard privileges: require explicit opt-in for privileged modes (kernel tracepoints, eBPF requiring CAP_SYS_ADMIN). Document perf_event_paranoid relaxation when necessary and provide fallback modes for unprivileged collection.
Implement robust failure paths: your agent must gracefully detach on OOM, verifier failure, or denied capabilities; do not let profiling cause application instability.

Concrete eBPF example (bpftrace one-liner)

# sample user-space stacks for a PID at 99Hz and count each unique user stack
sudo bpftrace -e 'profile:hz:99 /pid == 1234/ { @[ustack()] = count(); }'

That same pattern is the basis of many production eBPF agents, but production code moves the logic into libbpf C/Rust consumers, uses per-CPU ring buffers, and implements symbolization offline.

Profiling UX: CLI ergonomics, defaults, and flame-graph output

A one-click CLI profiler lives or dies by its defaults and its ergonomics. The goal: minimal typing, predictable artifacts, and safe defaults.

Design decisions that pay off

Single binary with small set of subcommands: record, top, report, upload. record creates artifacts, top is a live summary, report converts or uploads artifacts to a chosen backend. Pattern after py-spy and perf.
Sensible defaults:
- --duration 30s for a representative snapshot (short dev runs can use --short=10s).
- --rate 99 (or --hz 99) as the default sampling frequency.
- --format supports flamegraph, pprof, and speedscope.
- Auto-annotate profiles with git commit, binary build-id, kernel version, and host so artifacts are self-describing.
Explicit modes: --production uses conservative rates (1–5 Hz) and streaming upload; --local uses higher rates for developer iteration.

CLI example (user perspective)

# quick local: 10s flame graph
oneclick-profile record --duration 10s --format=flamegraph -o profile.svg

# produce pprof for CI automation
oneclick-profile record --duration 30s --format=pprof -o profile.pb.gz

# live top-like view
oneclick-profile top --pid $PID

Flame graph & visualization UX

Produce an interactive SVG by default for immediate inspection; include search and zoomable labels. Brendan Gregg’s FlameGraph scripts produce compact and readable SVGs that engineers expect.
Also emit pprof protobuf and speedscope JSON so the artifact slots into CI workflows, pprof comparisons, or the speedscope interactive viewer.
When running in CI, attach the SVG to the run and publish the pprof for automated diffing.

Blockquote callout

Important: Always include the build-id / debug-id and the exact command line in the profile metadata. Without matching symbols, a flame graph becomes a list of hex addresses — useless for actionable fixes.

IDE and PR workflows

Make oneclick-profile produce a single HTML or SVG that can be embedded into a PR comment or opened by developers with one click. Speedscope JSON is also friendly for browser embedding and IDE plugins.

Actionable checklist: ship a one-click profiler in 8 steps

This checklist is a compact implementation plan you can execute in sprints.

Define scope & success criteria
- Languages initially supported (e.g., C/C++, Go, Python, Java).
- Target overhead budget (e.g., <2% for short runs, <0.5% for always-on sampling).
Choose the data model and exports
- Support pprof (profile.proto), flamegraph SVG (folded stacks), and speedscope JSON.
Implement a local CLI with safe defaults
- Subcommands: record, top, report, upload.
- Defaults: --duration 30s, --rate 99, --format=flamegraph.
Build sampling backends
- For native binaries: perf pipeline + optional eBPF agent (libbpf/CO-RE).
- For Python: include py-spy integration fallback to capture Python frames non-invasively.
Implement symbolization and debuginfo pipeline
- Automatic collection of build-id and debuginfo upload to a symbol server; use addr2line, eu-unstrip, or pprof symbolizers to resolve addresses into function/lines.
Add production-friendly agents and aggregation
- eBPF agent that aggregates counts in-kernel; push compressed series to Parca/Pyroscope backends for long-term analysis.
CI integration for performance regression detection
- Capture pprof during benchmark runs in CI, store as artifact, and compare against baseline using pprof or custom diffs. Example GitHub Actions snippet:

name: Profile Regression Test
on: [push]
jobs:
  profile:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build
        run: make -j
      - name: Run workload and profile
        run: ./bin/oneclick-profile record --duration 30s --format=pprof -o profile.pb.gz
      - uses: actions/upload-artifact@v4
        with:
          name: profile
          path: profile.pb.gz

Observe & iterate
- Emit telemetry about agent CPU overhead, sample counts, and adoption. Store representative flame graphs in a "perf repo" for quick browsing and to support post-mortem work.

Quick checklist (operational):

[ ] Default record duration documented

[ ] Debuginfo upload mechanism in place

[ ] pprof + flamegraph.svg produced for each run

[ ] Agent overhead measured and reported

[ ] Safe fallback modes documented for unprivileged runs

Sources
BPF Documentation — The Linux Kernel documentation - Kernel-side description of eBPF, libbpf, BTF, program types, helper functions and safety constraints used when designing eBPF-based sampling agents.

Flame Graphs — Brendan Gregg - Origin and best-practices for flame graphs, why sampling was chosen, and typical generation pipelines. Used for visualization guidance and folded-stack conversion.

perf: Linux profiling with performance counters (perf wiki) - Authoritative description of perf, perf record/perf report, sampling frequency usage (-F 99) and security considerations for perf_event.

Parca — Overview / Continuous Profiling docs - Rationale and architecture for continuous, low-overhead profiling using eBPF and aggregation, and deployment guidance.

Grafana Pyroscope — Configure the client to send profiles - How Pyroscope collects low-overhead profiles (including eBPF collection), and discussion of continuous profiling as an observability signal.

py-spy — Sampling profiler for Python programs (GitHub) - Practical example of a non-invasive, low-overhead process-level sampler for Python and recommended CLI patterns (record, top, dump).

pprof — Google pprof (GitHub / docs) - Specification of the profile.proto format used by pprof, and tooling for programmatic analysis and CI integration.

Speedscope and file format background (speedscope.app / Mozilla blog) - Interactive profile viewer guidance and why speedscope JSON is useful for multi-language, interactive exploration.

This is a practical blueprint: make the profiler the easiest diagnostic you own, ensure the sampling and symbolization choices are conservative and measurable, and produce artifacts that humans and automation both use.