This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.
Performance Profiling: perf, Flamegraphs, py-spy, pprof
Introduction
Performance profiling identifies where your application spends its time — CPU, memory, I/O, or blocking. Without profiling, optimization is guesswork. This article covers four profiling approaches: perf for system-level Linux profiling, flamegraphs for visualization, py-spy for Python without code changes, and pprof for Go applications.
perf (Linux Profiler)
The built-in Linux profiler for CPU, hardware events, and tracepoints:
# CPU profiling
perf record -F 99 -g ./myapp # Sample at 99Hz with call graphs
perf record -F 99 -p PID -g --sleep 30 # Profile running process for 30s
perf report --stdio # Text report
perf report -g graph # Call graph report
# Common events
perf stat ./myapp # Execution statistics
perf stat -e cache-misses ./myapp # Cache miss analysis
perf stat -e branch-misses ./myapp # Branch prediction
perf stat -e context-switches -p PID # Context switch monitoring
# Hardware event sampling
perf record -e cycles -F 99 -a -g --sleep 10 # System-wide CPU sampling
# Tracepoints
perf record -e sched:sched_switch -a -g # Context switch tracing
perf record -e syscalls:sys_enter_write -a # Write syscall tracing
# Top-like live view
perf top -p PID
perf top -e cache-misses
# Generate flamegraph data
perf script > out.perf
Key metrics: cycles for CPU time, cache-misses for memory bottleneck detection, context-switches for contention issues.
Flamegraphs
Brendan Gregg's visualization for profiler output:
# Install FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph
# Generate flamegraph from perf data
perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg
# Generate differential flamegraph (before/after)
# After optimization:
perf script | ./FlameGraph/stackcollapse-perf.pl > optimized.folded
./FlameGraph/difffolded.pl before.folded optimized.folded | ./FlameGraph/flamegraph.pl > diff.svg
Reading flamegraphs: The x-axis shows stack profile population (not time). Each rectangle is a function call; wider rectangles mean more CPU time. The y-axis is stack depth. Look for wide top rectangles — those are the hot functions.
For other languages:
# JavaScript (Node.js)
node --perf-basic-prof app.js
perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded
# Python with py-spy
py-spy record -o profile.svg --pid $PID
# Go with pprof
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
py-spy
Sampling profiler for Python without modifying code:
# Installation
pip install py-spy
# Profile a running process
py-spy record -o profile.svg --pid 12345
py-spy record -o profile.svg -- python myapp.py
# Top-like live view
py-spy top --pid 12345
# Dump current stack traces
py-spy dump --pid 12345
# Profile specific duration
py-spy record -o profile.svg --pid 12345 --duration 30
# With subprocesses
py-spy record -o profile.svg -- python myapp.py --subprocesses
# Native frames
py-spy record --native -o profile.svg --pid 12345
# Save raw data for later analysis
py-spy record -o profile.raw --pid 12345 --format raw
Key advantages: No code changes required, works with running processes, safe for production (read-only), native code frame support.
pprof (Go)
Go's built-in profiling tool:
package main
import (
"net/http"
_ "net/http/pprof"
)
func main() {
// Start pprof HTTP server
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Your application code...
}
# Collect profiles
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 # CPU
go tool pprof http://localhost:6060/debug/pprof/heap # Memory
go tool pprof http://localhost:6060/debug/pprof/goroutine # Goroutines
go tool pprof http://localhost:6060/debug/pprof/block # Blocking
go tool pprof http://localhost:6060/debug/pprof/mutex # Mutex contention
# Interactive exploration
go tool pprof cpu.pprof
(pprof) top10 # Top 10 functions
(pprof) list myFunc # Source with line-level timing
(pprof) web # Open in browser (requires graphviz)
(pprof) pdf # Generate PDF
(pprof) peek myFunc # Caller/callee view
# Web interface
go tool pprof -http=:8080 cpu.pprof
# Allocations profiling
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/allocs
# Compare profiles
go tool pprof -http=:8080 -diff_base=before.pprof after.pprof
Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.
Found this useful? Check out more developer guides and tool comparisons on AI Study Room.
Top comments (0)