DEV Community

James Lee
James Lee

Posted on

Go Performance Optimization: pprof, Flame Graphs & Hotspot Profiling

Performance optimization in Go isn't guesswork — it's a systematic process backed by data. In this article, we'll walk through the full optimization workflow, explain how pprof works internally, and show you how to read flame graphs to locate bottlenecks with confidence.


1. The Performance Optimization Workflow

Before touching any code, follow this four-step process:

1. Understand the code → Clarify the common logic and usage scenarios
        ↓
2. Write benchmarks → Simulate realistic traffic and workloads
        ↓
3. Collect data → Use pprof or flame graphs to capture runtime behavior
        ↓
4. Optimize hotspots → Focus on the functions with the highest relative cost
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: Optimization closer to the application layer (e.g. caching, async logic) typically yields ms-level improvements. Code-level micro-optimizations yield µs-level gains. Always start with the bigger wins.


2. What Is pprof?

pprof is Go's built-in tool for visualizing and analyzing performance profiling data. It collects runtime data — goroutine stacks, memory allocations, CPU usage — and lets you identify exactly where your program spends its time and memory.

Two Types of Profilers

Type How It Works Examples
Sampling profiler Measures at regular time intervals Go CPU profiler
Tracing profiler Fires on specific events (function call, lock, GC) Go execution tracer

A sampling profiler has two core components:

  • Sampler — a callback triggered at fixed intervals that captures the current stack trace
  • Data collector — aggregates all captured stack traces into a statistical summary (call counts, memory sizes, etc.)

3. How CPU Profiling Works

Go's CPU profiler uses a stack trace + statistics model.

┌──────────────────────────────────────────────────────┐
│               CPU Profiling Pipeline                 │
│                                                      │
│  pprof.StartCPUProfile()                             │
│          ↓                                           │
│  Go runtime sets SIGPROF signal handler              │
│  (via setitimer / timer_create / timer_settime)      │
│          ↓                                           │
│  SIGPROF fires every 10ms (100Hz, fixed rate)        │
│          ↓                                           │
│  Kernel delivers signal to a running goroutine       │
│          ↓                                           │
│  sigProfHandler captures goroutine stack trace       │
│          ↓                                           │
│  Stack written to profBuf                            │
│  (lock-free single-writer / single-reader ring buf)  │
│          ↓                                           │
│  profileWriter goroutine reads profBuf               │
│          ↓                                           │
│  Results aggregated into profMap (hashmap)           │
│          ↓                                           │
│  pprof.StopCPUProfile() → output .prof file          │
└──────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key details:

  • Sampling rate: 100Hz (every 10ms) — fixed, not configurable at runtime
  • Only running goroutines are captured. Goroutines blocked on I/O are not counted (Go uses non-blocking I/O)
  • Each captured stack can be tagged with a custom label for later filtering
  • The lock-free profBuf structure (runtime/profbuf.go) ensures minimal overhead during signal handling

Note: Because I/O-waiting goroutines are excluded, CPU profiling alone won't reveal I/O bottlenecks. Use fgprof (which calls runtime.GoroutineProfile) to capture both running and waiting goroutines.


4. How Heap Profiling Works

Heap profiling also uses a stack trace + statistics model, but instead of a timer, it hooks directly into the memory allocator.

Memory allocation path
        ↓
Heap profiler intercepts allocation
(samples every 512KB allocated by default)
        ↓
Captures current stack trace
        ↓
Aggregates samples → per-function allocation counts
Enter fullscreen mode Exit fullscreen mode

Key metrics:

Metric Meaning
alloc_space Total bytes allocated (cumulative)
alloc_objects Total objects allocated (cumulative)
inuse_space Bytes currently in use
inuse_objects Objects currently in use

Formula: inuse = alloc - free

Because heap profiling is also sampled (default: every 512KB), the displayed sizes will be smaller than actual allocations — but the relative proportions are accurate enough to locate hotspots.


5. Other Profiling Types

Goroutine Profiling

Captures the call stack of all user-initiated, currently running goroutines (excludes runtime.* entry points).

stop the world
    → iterate allg slice
    → output stack trace for each goroutine
start the world
Enter fullscreen mode Exit fullscreen mode

Block Profiling

Samples blocking operations (channel waits, mutex waits) by duration.

  • Only records blocks that exceed a configurable threshold
  • Rate: 1 = record every block

Lock Contention Profiling

Samples mutex contention — how often locks are contested and for how long.

  • Rate: 1 = record every lock operation
  • Uses the same report/aggregate pattern as block profiling

6. How to Read a Flame Graph

A flame graph is the most intuitive way to visualize profiling data. Here's how to interpret it:

┌─────────────────────────────────────────────────────┐
│                    Flame Graph                      │
│                                                     │
│   [narrow]  encodeJSON  [narrow]                    │  ← top: currently on CPU
│   [    processRequest         ]                     │
│   [         handleHTTP                    ]         │
│   [              ServeHTTP                     ]    │
│   [                   main                         ]│  ← bottom: entry point
│                                                     │
│  ← call order: bottom to top                        │
│  ← width = time proportion (wider = more CPU)       │
│  ← color has no special meaning                     │
└─────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Reading rules:

Axis Meaning
Vertical (Y) Call stack depth — bottom is the entry point, top is what's running on CPU
Horizontal (X) Alphabetically sorted, merged call stacks — not time order
Width of a block Proportion of samples — wider = more CPU time = likely bottleneck
Color No special meaning — just for visual contrast

Focus on wide blocks near the top — these are the functions consuming the most CPU and are your primary optimization targets.


7. go tool trace — When pprof Isn't Enough

pprof tells you what is using CPU. But it can't tell you why a goroutine isn't running. For that, use go tool trace.

Possible reasons a goroutine isn't running:

  • Blocked on a syscall
  • Blocked on a channel or mutex
  • Blocked by the GC (STW)
  • Not scheduled by the runtime

go tool trace captures these events with nanosecond-level precision:

Event Category Examples
Goroutine lifecycle create / block / unblock
Syscall enter / exit / block
GC events mark start, STW, sweep
Heap allocation / free size changes
Processor start / stop

Trace UI panels:

Timeline     → execution time axis (zoomable)
─────────────────────────────────────────────
Heap         → memory alloc/free over time (line chart)
Goroutines   → GCWaiting | Runnable | Running counts
Threads      → InSyscall | Running counts
─────────────────────────────────────────────
P0 ~ Pn      → one row per virtual processor (GOMAXPROCS)
               shows which goroutine ran on each P
               click a goroutine → stack trace + related events
Enter fullscreen mode Exit fullscreen mode

8. Quick Reference: Profiling Types Summary

Type What It Captures Sampling Rate Trigger
CPU Function call time 100Hz (10ms) SIGPROF signal
Heap Alloc/inuse memory Every 512KB Memory allocator hook
Goroutine All running goroutine stacks On demand STW snapshot
ThreadCreate OS thread creation stacks On demand STW snapshot
Block Blocking op duration Threshold-based Block event hook
Mutex Lock contention duration Ratio-based Lock event hook
Trace All runtime events Continuous Event instrumentation

Summary

  • Use pprof CPU profiling to find functions consuming the most CPU time
  • Use heap profiling to locate memory allocation hotspots
  • Use flame graphs to visually identify wide, top-level blocks as bottlenecks
  • Use block/mutex profiling to diagnose concurrency contention
  • Use go tool trace when you need to understand why goroutines aren't running

Next in this series: Go Heap Memory Allocation: tcmalloc, Mutator/Allocator & Multi-Level Cache (Part 2)


If this breakdown of Go's profiling internals was useful, follow the series for deeper dives into the Go runtime — scheduler, GC, memory allocator, and more.

Top comments (0)