DEV Community

丁久
丁久

Posted on • Originally published at dingjiu1989-hue.github.io

Performance Profiling: perf, Flamegraphs, py-spy, pprof

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Performance Profiling: perf, Flamegraphs, py-spy, pprof

Introduction

Performance profiling identifies where your application spends its time — CPU, memory, I/O, or blocking. Without profiling, optimization is guesswork. This article covers four profiling approaches: perf for system-level Linux profiling, flamegraphs for visualization, py-spy for Python without code changes, and pprof for Go applications.

perf (Linux Profiler)

The built-in Linux profiler for CPU, hardware events, and tracepoints:

# CPU profiling

perf record -F 99 -g ./myapp          # Sample at 99Hz with call graphs

perf record -F 99 -p PID -g --sleep 30  # Profile running process for 30s

perf report --stdio                   # Text report

perf report -g graph                  # Call graph report

# Common events

perf stat ./myapp                     # Execution statistics

perf stat -e cache-misses ./myapp     # Cache miss analysis

perf stat -e branch-misses ./myapp    # Branch prediction

perf stat -e context-switches -p PID  # Context switch monitoring

# Hardware event sampling

perf record -e cycles -F 99 -a -g --sleep 10  # System-wide CPU sampling

# Tracepoints

perf record -e sched:sched_switch -a -g  # Context switch tracing

perf record -e syscalls:sys_enter_write -a  # Write syscall tracing

# Top-like live view

perf top -p PID

perf top -e cache-misses

# Generate flamegraph data

perf script > out.perf
Enter fullscreen mode Exit fullscreen mode

Key metrics: cycles for CPU time, cache-misses for memory bottleneck detection, context-switches for contention issues.

Flamegraphs

Brendan Gregg's visualization for profiler output:

# Install FlameGraph tools

git clone https://github.com/brendangregg/FlameGraph

# Generate flamegraph from perf data

perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded

./FlameGraph/flamegraph.pl out.folded > flamegraph.svg

# Generate differential flamegraph (before/after)

# After optimization:

perf script | ./FlameGraph/stackcollapse-perf.pl > optimized.folded

./FlameGraph/difffolded.pl before.folded optimized.folded | ./FlameGraph/flamegraph.pl > diff.svg
Enter fullscreen mode Exit fullscreen mode

Reading flamegraphs: The x-axis shows stack profile population (not time). Each rectangle is a function call; wider rectangles mean more CPU time. The y-axis is stack depth. Look for wide top rectangles — those are the hot functions.

For other languages:

# JavaScript (Node.js)

node --perf-basic-prof app.js

perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded

# Python with py-spy

py-spy record -o profile.svg --pid $PID

# Go with pprof

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
Enter fullscreen mode Exit fullscreen mode

py-spy

Sampling profiler for Python without modifying code:

# Installation

pip install py-spy

# Profile a running process

py-spy record -o profile.svg --pid 12345

py-spy record -o profile.svg -- python myapp.py

# Top-like live view

py-spy top --pid 12345

# Dump current stack traces

py-spy dump --pid 12345

# Profile specific duration

py-spy record -o profile.svg --pid 12345 --duration 30

# With subprocesses

py-spy record -o profile.svg -- python myapp.py --subprocesses

# Native frames

py-spy record --native -o profile.svg --pid 12345

# Save raw data for later analysis

py-spy record -o profile.raw --pid 12345 --format raw
Enter fullscreen mode Exit fullscreen mode

Key advantages: No code changes required, works with running processes, safe for production (read-only), native code frame support.

pprof (Go)

Go's built-in profiling tool:

package main

import (

    "net/http"

    _ "net/http/pprof"

)

func main() {

    // Start pprof HTTP server

    go func() {

        http.ListenAndServe("localhost:6060", nil)

    }()

    // Your application code...

}

# Collect profiles

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30  # CPU

go tool pprof http://localhost:6060/debug/pprof/heap               # Memory

go tool pprof http://localhost:6060/debug/pprof/goroutine          # Goroutines

go tool pprof http://localhost:6060/debug/pprof/block              # Blocking

go tool pprof http://localhost:6060/debug/pprof/mutex              # Mutex contention

# Interactive exploration

go tool pprof cpu.pprof

(pprof) top10           # Top 10 functions

(pprof) list myFunc     # Source with line-level timing

(pprof) web             # Open in browser (requires graphviz)

(pprof) pdf             # Generate PDF

(pprof) peek myFunc     # Caller/callee view

# Web interface

go tool pprof -http=:8080 cpu.pprof

# Allocations profiling

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/allocs

# Compare profiles

go tool pprof -http=:8080 -diff_base=before.pprof after.pprof
Enter fullscreen mode Exit fullscreen mode

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

Top comments (0)