Performance optimization in Go isn't guesswork — it's a systematic process backed by data. In this article, we'll walk through the full optimization workflow, explain how pprof works internally, and show you how to read flame graphs to locate bottlenecks with confidence.
1. The Performance Optimization Workflow
Before touching any code, follow this four-step process:
1. Understand the code → Clarify the common logic and usage scenarios
↓
2. Write benchmarks → Simulate realistic traffic and workloads
↓
3. Collect data → Use pprof or flame graphs to capture runtime behavior
↓
4. Optimize hotspots → Focus on the functions with the highest relative cost
Rule of thumb: Optimization closer to the application layer (e.g. caching, async logic) typically yields ms-level improvements. Code-level micro-optimizations yield µs-level gains. Always start with the bigger wins.
2. What Is pprof?
pprof is Go's built-in tool for visualizing and analyzing performance profiling data. It collects runtime data — goroutine stacks, memory allocations, CPU usage — and lets you identify exactly where your program spends its time and memory.
Two Types of Profilers
| Type | How It Works | Examples |
|---|---|---|
| Sampling profiler | Measures at regular time intervals | Go CPU profiler |
| Tracing profiler | Fires on specific events (function call, lock, GC) | Go execution tracer |
A sampling profiler has two core components:
- Sampler — a callback triggered at fixed intervals that captures the current stack trace
- Data collector — aggregates all captured stack traces into a statistical summary (call counts, memory sizes, etc.)
3. How CPU Profiling Works
Go's CPU profiler uses a stack trace + statistics model.
┌──────────────────────────────────────────────────────┐
│ CPU Profiling Pipeline │
│ │
│ pprof.StartCPUProfile() │
│ ↓ │
│ Go runtime sets SIGPROF signal handler │
│ (via setitimer / timer_create / timer_settime) │
│ ↓ │
│ SIGPROF fires every 10ms (100Hz, fixed rate) │
│ ↓ │
│ Kernel delivers signal to a running goroutine │
│ ↓ │
│ sigProfHandler captures goroutine stack trace │
│ ↓ │
│ Stack written to profBuf │
│ (lock-free single-writer / single-reader ring buf) │
│ ↓ │
│ profileWriter goroutine reads profBuf │
│ ↓ │
│ Results aggregated into profMap (hashmap) │
│ ↓ │
│ pprof.StopCPUProfile() → output .prof file │
└──────────────────────────────────────────────────────┘
Key details:
- Sampling rate: 100Hz (every 10ms) — fixed, not configurable at runtime
- Only running goroutines are captured. Goroutines blocked on I/O are not counted (Go uses non-blocking I/O)
- Each captured stack can be tagged with a custom label for later filtering
- The lock-free
profBufstructure (runtime/profbuf.go) ensures minimal overhead during signal handling
Note: Because I/O-waiting goroutines are excluded, CPU profiling alone won't reveal I/O bottlenecks. Use
fgprof(which callsruntime.GoroutineProfile) to capture both running and waiting goroutines.
4. How Heap Profiling Works
Heap profiling also uses a stack trace + statistics model, but instead of a timer, it hooks directly into the memory allocator.
Memory allocation path
↓
Heap profiler intercepts allocation
(samples every 512KB allocated by default)
↓
Captures current stack trace
↓
Aggregates samples → per-function allocation counts
Key metrics:
| Metric | Meaning |
|---|---|
alloc_space |
Total bytes allocated (cumulative) |
alloc_objects |
Total objects allocated (cumulative) |
inuse_space |
Bytes currently in use |
inuse_objects |
Objects currently in use |
Formula:
inuse = alloc - free
Because heap profiling is also sampled (default: every 512KB), the displayed sizes will be smaller than actual allocations — but the relative proportions are accurate enough to locate hotspots.
5. Other Profiling Types
Goroutine Profiling
Captures the call stack of all user-initiated, currently running goroutines (excludes runtime.* entry points).
stop the world
→ iterate allg slice
→ output stack trace for each goroutine
start the world
Block Profiling
Samples blocking operations (channel waits, mutex waits) by duration.
- Only records blocks that exceed a configurable threshold
- Rate:
1= record every block
Lock Contention Profiling
Samples mutex contention — how often locks are contested and for how long.
- Rate:
1= record every lock operation - Uses the same report/aggregate pattern as block profiling
6. How to Read a Flame Graph
A flame graph is the most intuitive way to visualize profiling data. Here's how to interpret it:
┌─────────────────────────────────────────────────────┐
│ Flame Graph │
│ │
│ [narrow] encodeJSON [narrow] │ ← top: currently on CPU
│ [ processRequest ] │
│ [ handleHTTP ] │
│ [ ServeHTTP ] │
│ [ main ]│ ← bottom: entry point
│ │
│ ← call order: bottom to top │
│ ← width = time proportion (wider = more CPU) │
│ ← color has no special meaning │
└─────────────────────────────────────────────────────┘
Reading rules:
| Axis | Meaning |
|---|---|
| Vertical (Y) | Call stack depth — bottom is the entry point, top is what's running on CPU |
| Horizontal (X) | Alphabetically sorted, merged call stacks — not time order |
| Width of a block | Proportion of samples — wider = more CPU time = likely bottleneck |
| Color | No special meaning — just for visual contrast |
Focus on wide blocks near the top — these are the functions consuming the most CPU and are your primary optimization targets.
7. go tool trace — When pprof Isn't Enough
pprof tells you what is using CPU. But it can't tell you why a goroutine isn't running. For that, use go tool trace.
Possible reasons a goroutine isn't running:
- Blocked on a syscall
- Blocked on a channel or mutex
- Blocked by the GC (STW)
- Not scheduled by the runtime
go tool trace captures these events with nanosecond-level precision:
| Event Category | Examples |
|---|---|
| Goroutine lifecycle | create / block / unblock |
| Syscall | enter / exit / block |
| GC events | mark start, STW, sweep |
| Heap | allocation / free size changes |
| Processor | start / stop |
Trace UI panels:
Timeline → execution time axis (zoomable)
─────────────────────────────────────────────
Heap → memory alloc/free over time (line chart)
Goroutines → GCWaiting | Runnable | Running counts
Threads → InSyscall | Running counts
─────────────────────────────────────────────
P0 ~ Pn → one row per virtual processor (GOMAXPROCS)
shows which goroutine ran on each P
click a goroutine → stack trace + related events
8. Quick Reference: Profiling Types Summary
| Type | What It Captures | Sampling Rate | Trigger |
|---|---|---|---|
| CPU | Function call time | 100Hz (10ms) | SIGPROF signal |
| Heap | Alloc/inuse memory | Every 512KB | Memory allocator hook |
| Goroutine | All running goroutine stacks | On demand | STW snapshot |
| ThreadCreate | OS thread creation stacks | On demand | STW snapshot |
| Block | Blocking op duration | Threshold-based | Block event hook |
| Mutex | Lock contention duration | Ratio-based | Lock event hook |
| Trace | All runtime events | Continuous | Event instrumentation |
Summary
- Use pprof CPU profiling to find functions consuming the most CPU time
- Use heap profiling to locate memory allocation hotspots
- Use flame graphs to visually identify wide, top-level blocks as bottlenecks
- Use block/mutex profiling to diagnose concurrency contention
- Use go tool trace when you need to understand why goroutines aren't running
Next in this series: Go Heap Memory Allocation: tcmalloc, Mutator/Allocator & Multi-Level Cache (Part 2)
If this breakdown of Go's profiling internals was useful, follow the series for deeper dives into the Go runtime — scheduler, GC, memory allocator, and more.
Top comments (0)