James Lee

Posted on May 19

Go Performance Optimization: pprof, Flame Graphs & Hotspot Profiling

#go #performance #tooling #tutorial

Performance optimization in Go isn't guesswork — it's a systematic process backed by data. In this article, we'll walk through the full optimization workflow, explain how pprof works internally, and show you how to read flame graphs to locate bottlenecks with confidence.

1. The Performance Optimization Workflow

Before touching any code, follow this four-step process:

1. Understand the code → Clarify the common logic and usage scenarios
        ↓
2. Write benchmarks → Simulate realistic traffic and workloads
        ↓
3. Collect data → Use pprof or flame graphs to capture runtime behavior
        ↓
4. Optimize hotspots → Focus on the functions with the highest relative cost

Rule of thumb: Optimization closer to the application layer (e.g. caching, async logic) typically yields ms-level improvements. Code-level micro-optimizations yield µs-level gains. Always start with the bigger wins.

2. What Is pprof?

pprof is Go's built-in tool for visualizing and analyzing performance profiling data. It collects runtime data — goroutine stacks, memory allocations, CPU usage — and lets you identify exactly where your program spends its time and memory.

Two Types of Profilers

Type	How It Works	Examples
Sampling profiler	Measures at regular time intervals	Go CPU profiler
Tracing profiler	Fires on specific events (function call, lock, GC)	Go execution tracer

A sampling profiler has two core components:

Sampler — a callback triggered at fixed intervals that captures the current stack trace
Data collector — aggregates all captured stack traces into a statistical summary (call counts, memory sizes, etc.)

3. How CPU Profiling Works

Go's CPU profiler uses a stack trace + statistics model.

┌──────────────────────────────────────────────────────┐
│               CPU Profiling Pipeline                 │
│                                                      │
│  pprof.StartCPUProfile()                             │
│          ↓                                           │
│  Go runtime sets SIGPROF signal handler              │
│  (via setitimer / timer_create / timer_settime)      │
│          ↓                                           │
│  SIGPROF fires every 10ms (100Hz, fixed rate)        │
│          ↓                                           │
│  Kernel delivers signal to a running goroutine       │
│          ↓                                           │
│  sigProfHandler captures goroutine stack trace       │
│          ↓                                           │
│  Stack written to profBuf                            │
│  (lock-free single-writer / single-reader ring buf)  │
│          ↓                                           │
│  profileWriter goroutine reads profBuf               │
│          ↓                                           │
│  Results aggregated into profMap (hashmap)           │
│          ↓                                           │
│  pprof.StopCPUProfile() → output .prof file          │
└──────────────────────────────────────────────────────┘

Key details:

Sampling rate: 100Hz (every 10ms) — fixed, not configurable at runtime
Only running goroutines are captured. Goroutines blocked on I/O are not counted (Go uses non-blocking I/O)
Each captured stack can be tagged with a custom label for later filtering
The lock-free profBuf structure (runtime/profbuf.go) ensures minimal overhead during signal handling

Note: Because I/O-waiting goroutines are excluded, CPU profiling alone won't reveal I/O bottlenecks. Use fgprof (which calls runtime.GoroutineProfile) to capture both running and waiting goroutines.

4. How Heap Profiling Works

Heap profiling also uses a stack trace + statistics model, but instead of a timer, it hooks directly into the memory allocator.

Memory allocation path
        ↓
Heap profiler intercepts allocation
(samples every 512KB allocated by default)
        ↓
Captures current stack trace
        ↓
Aggregates samples → per-function allocation counts

Key metrics:

Metric	Meaning
`alloc_space`	Total bytes allocated (cumulative)
`alloc_objects`	Total objects allocated (cumulative)
`inuse_space`	Bytes currently in use
`inuse_objects`	Objects currently in use

Formula: inuse = alloc - free

Because heap profiling is also sampled (default: every 512KB), the displayed sizes will be smaller than actual allocations — but the relative proportions are accurate enough to locate hotspots.

5. Other Profiling Types

Goroutine Profiling

Captures the call stack of all user-initiated, currently running goroutines (excludes runtime.* entry points).

stop the world
    → iterate allg slice
    → output stack trace for each goroutine
start the world

Block Profiling

Samples blocking operations (channel waits, mutex waits) by duration.

Only records blocks that exceed a configurable threshold
Rate: 1 = record every block

Lock Contention Profiling

Samples mutex contention — how often locks are contested and for how long.

Rate: 1 = record every lock operation
Uses the same report/aggregate pattern as block profiling

6. How to Read a Flame Graph

A flame graph is the most intuitive way to visualize profiling data. Here's how to interpret it:

┌─────────────────────────────────────────────────────┐
│                    Flame Graph                      │
│                                                     │
│   [narrow]  encodeJSON  [narrow]                    │  ← top: currently on CPU
│   [    processRequest         ]                     │
│   [         handleHTTP                    ]         │
│   [              ServeHTTP                     ]    │
│   [                   main                         ]│  ← bottom: entry point
│                                                     │
│  ← call order: bottom to top                        │
│  ← width = time proportion (wider = more CPU)       │
│  ← color has no special meaning                     │
└─────────────────────────────────────────────────────┘

Reading rules:

Axis	Meaning
Vertical (Y)	Call stack depth — bottom is the entry point, top is what's running on CPU
Horizontal (X)	Alphabetically sorted, merged call stacks — not time order
Width of a block	Proportion of samples — wider = more CPU time = likely bottleneck
Color	No special meaning — just for visual contrast

Focus on wide blocks near the top — these are the functions consuming the most CPU and are your primary optimization targets.

7. go tool trace — When pprof Isn't Enough

pprof tells you what is using CPU. But it can't tell you why a goroutine isn't running. For that, use go tool trace.

Possible reasons a goroutine isn't running:

Blocked on a syscall
Blocked on a channel or mutex
Blocked by the GC (STW)
Not scheduled by the runtime

go tool trace captures these events with nanosecond-level precision:

Event Category	Examples
Goroutine lifecycle	create / block / unblock
Syscall	enter / exit / block
GC events	mark start, STW, sweep
Heap	allocation / free size changes
Processor	start / stop

Trace UI panels:

Timeline     → execution time axis (zoomable)
─────────────────────────────────────────────
Heap         → memory alloc/free over time (line chart)
Goroutines   → GCWaiting | Runnable | Running counts
Threads      → InSyscall | Running counts
─────────────────────────────────────────────
P0 ~ Pn      → one row per virtual processor (GOMAXPROCS)
               shows which goroutine ran on each P
               click a goroutine → stack trace + related events

8. Quick Reference: Profiling Types Summary

Type	What It Captures	Sampling Rate	Trigger
CPU	Function call time	100Hz (10ms)	SIGPROF signal
Heap	Alloc/inuse memory	Every 512KB	Memory allocator hook
Goroutine	All running goroutine stacks	On demand	STW snapshot
ThreadCreate	OS thread creation stacks	On demand	STW snapshot
Block	Blocking op duration	Threshold-based	Block event hook
Mutex	Lock contention duration	Ratio-based	Lock event hook
Trace	All runtime events	Continuous	Event instrumentation

Summary

Use pprof CPU profiling to find functions consuming the most CPU time
Use heap profiling to locate memory allocation hotspots
Use flame graphs to visually identify wide, top-level blocks as bottlenecks
Use block/mutex profiling to diagnose concurrency contention
Use go tool trace when you need to understand why goroutines aren't running

Next in this series: Go Heap Memory Allocation: tcmalloc, Mutator/Allocator & Multi-Level Cache (Part 2)

If this breakdown of Go's profiling internals was useful, follow the series for deeper dives into the Go runtime — scheduler, GC, memory allocator, and more.

DEV Community