Vladyslav Platonov

Posted on Nov 17

Mastering Go profiling

#go #performance #profiling

What is pprof

pprof is Go's built-in profiling tool that helps you understand where your program spends CPU time, how it uses memory, where goroutines block, and more.
It's part of the standard library (runtime/pprof and net/http/pprof) and can be used both locally and in production via HTTP endpoints.

How it works under the hood

The Go runtime includes internal hooks and counters that collect low-level statistics:

CPU - samples stack traces at regular intervals (default 100 Hz).
Memory (heap) — records allocation data from the garbage collector.
Block / Mutex — tracks delays caused by synchronization (e.g., sync.Mutex, channel, select).
Goroutine — captures snapshots of all running goroutines and their stack traces.

pprof gathers this data and can:

Write it to files (pprof.WriteHeapProfile, pprof.StartCPUProfile);
Expose it via HTTP endpoints (net/http/pprof);
Export it in a format compatible with go tool -pprof, Speedscope, or Parca.

Main profile types

Type	Focus	When to use	Note	How to get
CPU	Execution time	High CPU load, slow processing	Where CPU time is spent	`pprof.StartCPUProfile(file)` or `/debug/pprof/profile`
Heap	Memory usage	Memory leaks, OOM, high RAM	Memory allocations (live and temporary)	`pprof.WriteHeapProfile(file)` or `/debug/pprof/heap`
Goroutine	Concurrency snapshot	Deadlocks, leaks, hanging requests	Stack traces of all goroutines	`/debug/pprof/goroutine`
Block	Waiting time	Latency, thread stalls	Where goroutines are blocked	`/debug/pprof/block`
Mutex	Lock contention	Poor scalability	Where mutexes are most frequently held	`/debug/pprof/mutex`
Allocs	Allocation frequency	GC pressure, short-lived allocations	All memory allocations, including freed ones	`/debug/pprof/allocs`

CPU Profile - where processing time goes

The Go runtime samples stack traces about 100 times per second during execution. This tells you which functions consume the most CPU time - i.e. where the CPU is actually being used.

When to use

The app is slow or CPU-bound;
You want to identify hot paths;
You're optimizing loops, parsing, serialization, or number crunching. #### Typical findings
Slow json.Marshal in loops;
Overuse of fmt.Sprintf;
Too many small allocations per iteration.

Command

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

How to read

top shows the most expensive functions;
web (or Speedscope) visualizes a flamegraph (width = time).

Heap Profile - memory usage

Shows how much memory is allocated and where allocations happen. Data is collected from the garbage collector (GC).

When to use

Memory usage keeps growing;
There's a memory leak;
You need to know who allocates too often or too much. #### Typical findings
Temporary objects inside loops;
Unbounded caches or slices;
Unreleased references keeping memory alive.

Command

go tool pprof http://localhost:6060/debug/pprof/heap

How to read

AllocObjects / AllocSpace → total allocations;
InUseObjects / InUseSpace → currently live objects.

Use flags like --alloc_space or --inuse_space to switch views.

Goroutine Profile - snapshot of all goroutines

Captures stack traces of all goroutines at a single point in time.

When to use

The app freezes or stops responding;
Goroutine count keeps increasing;
You suspect a deadlock or goroutine leak. #### Typical findings
Goroutines stuck on <-chan with no receiver;
WaitGroup never reaches zero;
Infinite select {} without a default.

Command

curl http://localhost:6060/debug/pprof/goroutine?debug=2

How to read

You'll see text stack traces. Look for "sleep", "chan receive", "mutex", etc.
Usually it's easy to spot where execution is stuck.

Block Profile - where goroutines are waiting

Tracks how long goroutines are blocked, waiting on synchronization primitives (channels, mutexes, conditions).

When to use

The app hangs under low load;
High latency in simple operations;
You want to find wait points that slow down performance. #### Typical findings
Overloaded channels;
Shared data structures causing contention;
Backpressure in worker queues.

Enable in code

runtime.SetBlockProfileRate(1)

Command

go tool pprof http://localhost:6060/debug/pprof/block

Mutex Profile - lock contention

Records how long goroutines hold a mutex and how long others wait for it.

When to use

CPU usage is low but app is slow;
Throughput doesn't scale with concurrency;
You suspect shared locks or global bottlenecks. #### Typical findings
A global sync.Mutex in a hot path;
Shared maps without sharding;
Logging or metrics inside critical sections.

Enable in code

runtime.SetMutexProfileFraction(1)

Command

go tool pprof http://localhost:6060/debug/pprof/mutex

Allocs Profile - all allocations (including freed ones)

Shows all memory allocations, not just the ones still in use.
Useful to understand allocation rate and GC pressure.

When to use

The app allocates too frequently (high GC load);
You're optimizing short-lived, high-throughput operations. #### Typical findings
Repeated string concatenations (+, fmt.Sprintf);
Allocating new []byte on each request;
Inefficient append or map usage.

Command

go tool pprof http://localhost:6060/debug/pprof/allocs

Warning! Block and Mutex profiling should not be enabled permanently in production.

Use sampling values to reduce overhead, for example:

runtime.SetBlockProfileRate(10000)
runtime.SetMutexProfileFraction(100)

Using pprof via HTTP

Enable profiling on a running service:

import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()

    // your app logic
}

Open this following link in your browser:

http://localhost:6060/debug/pprof/

Run a go command to collect profile information

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Using pprof directly in code

Recording a CPU profile to a file

f, _ := os.Create("cpu.pprof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
// Run your workload
workload()

Analyze it with:

go tool pprof cpu.pprof
(pprof) top
(pprof) web

Best practices for production

Restrict access to /debug/pprof/ (e.g., Basic Auth, internal IPs, env flag).
Don't run CPU profiling all the time - it adds ~5–10% overhead.
CPU profiles should be at least 10–30 seconds
Heap profiles are safe to collect in production.
For visualization, use:
- pprof -http — quick online inspection;
- Speedscope — fast and intuitive;
- Parca — continuous profiling.

Reading profiling results

What a pprof graph represents

pprof generates a call graph, where:

Node (box) - a function;
Edge (arrow) - a call from one function to another;
Node weight - how much CPU time or memory that function consumed;
Edge weight - how much cost was passed to the callee functions.

In short, the graph shows who calls whom and where time or memory is being spent.

Red flags to look for:

Wide bottom block → hottest function
runtime.mallocgc dominating → too many allocations
sync.(*Mutex).Lock high → contention
Many narrow repeated blocks → allocations inside a loop

Common visualization modes

top

Shows a summary table:

(pprof) top
Showing nodes accounting for 85.73%, 34.29s of 40.00s total
      flat  flat%   sum%        cum   cum%
     10.23s 25.6% 25.6%     18.92s 47.3%  main.work
      8.69s 21.7% 47.3%     10.35s 25.9%  processData

Columns meaning:

flat - time spent inside the function itself (excluding callees) cum (cumulative) - total time including callees
flat% / cum% - relative to total runtime ##### Highlights:
Large flat time - optimize that specific function.
Large cum time but small flat - the problem is in a callee.

list funcName

Shows annotated source code lines:

(pprof) list compute
Total: 40s
ROUTINE ======================== compute in main.go
     10.00s     15.00s (flat, cum) 37.50% of Total

Highlights:

You can see which specific lines consume CPU or allocate memory — ideal for micro-optimizations.

web (or `go tool pprof -http=:6060`)

Opens an interactive call graph in your browser.

Color & size meaning:

Red - functions consuming the most resources
Yellow - medium cost
Green - minor impact
Box width - time or memory weight
Arrow - function call relationship ##### Highlights:
The wider and redder the box, the hotter the function.

Flamegraph

A flamegraph is the most intuitive format.
Example:

main
 └── handleRequest
      ├── parseJSON
      └── processData
           ├── validate
           └── saveToDB

Each rectangle = a function:

X-axis = total time (width = cost)
Y-axis = call stack depth
Bottom → top = call chain (from main to leaf functions) ##### Highlights:
If parseJSON is wide - JSON encoding is CPU-heavy.
If saveToDB dominates - DB operations are the bottleneck. ##### Example interpretation For example, your flamegraph shows:

main → handle → json.Marshal → reflect.Value.Interface

and reflect.Value.Interface takes 40% of CPU time.
That's a clear indicator of slow reflection-based serialization - replace with faster encoder like jsoniter or easyjson.

Practical reading tips

Start with the widest blocks at the bottom — they consume the most time or memory.
Ignore runtime internals (runtime.*, syscall.*) unless something abnormal shows up.
Look for repeating narrow peaks — they often mean inefficient work inside loops.
If CPU looks fine but app hangs, check block or mutex profiles — likely synchronization issues, not CPU load.

Compare before/after

go tool pprof -diff_base heap1.pprof heap2.pprof

Heap vs Allocs

Profile	Shows	Common misunderstanding
heap (inuse)	live objects currently in memory	"memory leak" — but may just be a cache or buffer
allocs	all allocations (even freed)	engineers think "growth = leak" — it's not

Visualization tools comparison

Tool	Format	Best for
CLI (top, list)	Text	Quick inspection, remote servers
Web UI (pprof -http)	Interactive graph	Exploring call hierarchy
Speedscope	Visual	Immediate hotspot identification
Parca	Continuous profiling	Real-time production monitoring

Conclusion

Profiling is one of the most valuable tools for diagnosing performance issues in Go applications, and pprof provides everything you need - from understanding CPU hotspots to uncovering memory leaks, goroutine leaks, synchronization bottlenecks, and inefficient allocation patterns.

The key to using pprof effectively is knowing which profile to capture, how to interpret what you see, and how to compare snapshots over time. CPU profiles reveal hot paths, heap profiles uncover leaks or excessive allocations, goroutine dumps expose deadlocks or runaway concurrency, while block and mutex profiles highlight contention that's invisible to standard metrics.

Most importantly:

Always start from the widest blocks in flamegraphs.
Use diffing to compare “before/after” optimizations.
Enable advanced profiles (block/mutex) only when needed.
Treat pprof as part of your standard debugging workflow - not as a last resort.

Mastering pprof turns performance debugging from guesswork into a repeatable, data-driven process. Once your team gets comfortable reading profiles, performance problems that used to take days can be solved in minutes.

Measure, don't guess - profile first.

What is pprof

How it works under the hood

Main profile types

CPU Profile - where processing time goes

When to use

Command

How to read

Heap Profile - memory usage

When to use

Command

How to read

Goroutine Profile - snapshot of all goroutines

When to use

Command

How to read

Block Profile - where goroutines are waiting

When to use

Enable in code

Command

Mutex Profile - lock contention

When to use

Enable in code

Command

Allocs Profile - all allocations (including freed ones)

When to use

Command

Warning! Block and Mutex profiling should not be enabled permanently in production.

Using pprof via HTTP

Using pprof directly in code

Best practices for production

Reading profiling results

What a pprof graph represents

Red flags to look for:

Common visualization modes

top

Columns meaning:

list funcName

Highlights:

web (or go tool pprof -http=:6060)

Color & size meaning:

Flamegraph

Each rectangle = a function:

Practical reading tips

Heap vs Allocs

Visualization tools comparison

Conclusion

web (or `go tool pprof -http=:6060`)