Gabriel Anhaia

Posted on Apr 12

pprof in 15 Minutes: The Go Profiling Walkthrough That Doesn't Skip the Hard Parts (Go 1.26)

#go #performance #tutorial #webdev

📚 This post pairs with two books I've written on Go. Book 1: The Complete Guide to Go Programming teaches the language. Book 2: Hexagonal Architecture in Go teaches how to architect services you can actually profile without crying. Or grab both together as the Thinking in Go collection: Kindle / Paperback. Short blurbs for each at the bottom.

Your Go service is slow. You added structured logs. You added Prometheus counters. One endpoint still takes 400ms on P99 and nothing in your dashboards explains why.

That's the moment pprof exists for. It's also the moment most Go developers bounce off pprof, because the docs drop you into profile.proto and flame graphs without ever telling you what to look at.

This walkthrough fixes that. We'll enable pprof in two lines, capture a real profile, open the web UI, and then do the part every tutorial skips: actually read the thing. By the end you'll know what flat and cum really mean, you'll recognize three common profile shapes, and you'll know what Go 1.26 changed about the workflow.

The four profile types in 60 seconds

pprof gives you four profiles that matter in day-to-day work:

CPU profile. The runtime samples 100 times a second. Answers "which functions are burning cycles?"
Heap profile. Snapshot of allocations. Answers "who's eating memory and who's producing garbage?"
Goroutine profile. Every live goroutine and its stack. Answers "what's stuck and where?"
Mutex profile. Contended locks. Answers "where is my concurrency serializing itself?"

There's also a block profile (time spent blocked on synchronization primitives) and, new in Go 1.26, an experimental goroutine leak profile that uses GC reachability analysis. More on both later.

Quick mental rule: CPU profile answers "why is this slow", heap profile answers "why is this fat", mutex profile answers "why is adding goroutines not helping". Start there.

Step 1: Enable pprof in two lines

Import the side-effect package and start a server on a port nothing else uses:

package main

import (
    "log"
    "net/http"
    _ "net/http/pprof" // registers /debug/pprof/* on the default mux

    "github.com/yourorg/yourapp"
)

func main() {
    // profiling endpoints on localhost, on their own port
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    yourapp.Run()
}

Two things worth being paranoid about.

Don't expose :6060 to the public internet. The pprof endpoints let anyone download your memory and CPU profiles, which often contain query strings, database rows, and secrets in transit. Bind to localhost or put it behind an internal auth layer. Every few months somebody pays out a bug bounty because a team forgot.

Don't reuse your main HTTP mux. If your service uses http.DefaultServeMux for real traffic, importing net/http/pprof silently bolts /debug/pprof/* onto it. That's both ugly and a security footgun. Keep pprof on its own mux on its own port.

If you want an explicit mux instead of the default one, import net/http/pprof without the underscore and wire the handlers yourself:

Optional: explicit mux setup

import (
    "log"
    "net/http"
    "net/http/pprof" // NOT the blank import, we need pprof.Index etc.
)

func startPprof() {
    mux := http.NewServeMux()
    mux.HandleFunc("/debug/pprof/", pprof.Index)
    mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
    mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
    mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
    mux.HandleFunc("/debug/pprof/trace", pprof.Trace)

    server := &amp;http.Server{Addr: "localhost:6060", Handler: mux}
    go func() { log.Println(server.ListenAndServe()) }()
}

Ugly, but it prevents the accidental attach to your public mux.

Step 2: Capture a CPU profile

Run this against a process you've already hit with some load:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

What happens: go tool pprof hits that endpoint, the runtime collects 30 seconds of CPU samples, then you land in an interactive shell. The shell is fine for quick top commands. It's not where you want to live.

Step 3: Open the web UI

The useful view is the web UI:

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

This captures the profile and opens a browser tab at localhost:8080.

Go 1.26 change worth knowing: the web UI now defaults to the flame graph view. Before 1.26 it opened on the graph view (boxes and arrows). If you've been profiling Go for a while, the new default will look wrong at first. The old graph is still there under View → Graph or at /ui/graph. Keep both. They answer different questions: the flame graph is about where the time goes, the call graph is about how calls relate.

Reading the flame graph: flat vs cum

This is the concept every other tutorial skips, and it's the one that unlocks pprof.

Every function in the profile has two numbers attached:

flat: time spent in this function's own body, excluding anything it called.
cum: time spent in this function plus everything it called (its children).

If main calls A, and A calls B, and B is the hot function, then:

B: flat is high, cum is high.
A: flat is low (it's mostly just calling B), cum is high (because B is its child).
main: flat is effectively zero, cum is basically the whole program.

Flame graph width = cum. Color intensity roughly tracks flat. So wide and dark is where the work actually happens.

       main                                 [cum: 100%]
        └── handleRequest                   [cum:  98%]
             ├── parseJSON                  [flat:  3%, cum:   5%]
             ├── computeDiscount  ◄── hot   [flat: 71%, cum:  74%]
             └── writeResponse              [flat:  2%, cum:   3%]

The classic mistake that still shows up in code review: someone opens pprof, sees main taking 100% of the time, and says "nothing's hot." main always takes 100% cumulative. You're looking for leaf-ish functions with high flat, not roots with high cum. If that clicks, you've passed the part of pprof most people never get past.

Shape 1: The tight-loop hotspot

This is the friendly shape. One function is obviously doing all the work, and the fix is usually "stop doing that in a loop" or "cache the result."

Typical culprit:

// called 10,000x per request because we resolved it inside the loop
func (s *PriceService) TotalForCart(items []Item) float64 {
    var total float64
    for _, it := range items {
        // regexp.MustCompile inside a hot loop. classic.
        re := regexp.MustCompile(`^SKU-(\d+)$`)
        if re.MatchString(it.SKU) {
            total += s.priceFor(it.SKU)
        }
    }
    return total
}

The flame graph will show regexp.(*Regexp).MustCompile sitting wide and dark right underneath TotalForCart. Fix: hoist the regex to a package-level var.

var skuRegex = regexp.MustCompile(`^SKU-(\d+)$`)

func (s *PriceService) TotalForCart(items []Item) float64 {
    var total float64
    for _, it := range items {
        if skuRegex.MatchString(it.SKU) {
            total += s.priceFor(it.SKU)
        }
    }
    return total
}

The trap isn't finding the hotspot. It's the post-fix sanity check. After you hoist the regex, profile again. If the function is still slow, it wasn't the regex. Don't close the ticket on vibes.

Shape 2: GC pressure (heap profile territory)

This shape shows up when your CPU profile is a mess of runtime.mallocgc, runtime.scanobject, and runtime.gcBgMarkWorker. Nothing in your code is obviously hot, but the runtime is spending a serious chunk of its time collecting garbage.

Switch to the heap profile:

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

The heap profile has two flavors the UI lets you toggle between:

inuse_space: bytes currently live. Shows you who's holding memory right now.
alloc_space: bytes allocated over time. Shows you who's producing garbage, which is what puts pressure on GC.

For allocation hotspots, use alloc_space. A function can have tiny inuse_space (because everything it allocates is short-lived) while still being the worst offender for GC pressure. inuse_space is for memory leaks; alloc_space is for GC pressure. Mixing them up is how people spend a week optimizing the wrong thing.

A sidebar for Go 1.26: the Green Tea GC is now default, and it reduced GC overhead by 10 to 40% for heap-heavy workloads. If you last profiled the service on Go 1.24, upgrade and re-profile before you start optimizing. You may have already shipped the fix by upgrading the toolchain.

Common allocation offenders the heap profile surfaces:

fmt.Sprintf in hot paths. Switch to strconv or a strings.Builder.
json.Marshal on the same struct repeatedly. Cache the result or use a pool.
Slices without preallocation. s := make([]T, 0, n) when you know n.
any conversions in generic-ish helpers. Each boxing allocates.
Concatenating strings in a loop with +. Use a strings.Builder.

Shape 3: Lock contention

The third shape is the sneaky one. CPU is mostly idle. Throughput is bad. Adding goroutines doesn't help, and sometimes it makes things worse.

That's lock contention, and the CPU profile won't tell you directly. Use the mutex profile:

import "runtime"

func init() {
    runtime.SetMutexProfileFraction(5) // sample 1 in 5 blocked mutex events
}

Then capture:

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/mutex

The profile shows which sync.Mutex.Lock and sync.RWMutex.Lock sites are costing you contention time. The frames above the lock are the callers fighting over it.

Classic culprit: a global cache with a single sync.RWMutex, accessed from every request, where the write path holds the lock while doing I/O.

// the bug: holding the write lock across an HTTP call
func (c *Cache) Refresh(key string) error {
    c.mu.Lock()
    defer c.mu.Unlock()

    // this is the poison. any slow call here blocks every reader.
    value, err := c.upstream.Fetch(key)
    if err != nil {
        return err
    }
    c.data[key] = value
    return nil
}

The fix is to do the slow work outside the lock and only take the lock for the map write:

func (c *Cache) Refresh(key string) error {
    value, err := c.upstream.Fetch(key) // no lock held here
    if err != nil {
        return err
    }
    c.mu.Lock()
    c.data[key] = value
    c.mu.Unlock()
    return nil
}

You won't find this from a CPU profile. You'll find it from the mutex profile, and only if you remembered to call SetMutexProfileFraction. That's the footgun: mutex profiling is off by default, so a team can run pprof for years and never catch their worst bottleneck.

Bonus in Go 1.26: the goroutine leak profile

This one isn't in most "intro to pprof" posts because it shipped in February 2026. Enable it with a build-time env var:

GOEXPERIMENT=goroutineleakprofile go build -o myapp ./cmd/myapp

Then run the binary and hit the new endpoint:

go tool pprof http://localhost:6060/debug/pprof/goroutineleak

What it does: the runtime uses GC reachability analysis to find goroutines blocked on channels, mutexes, or condition variables that nothing in the program can reach anymore. If a goroutine is parked on a <-ch and the only references to ch are held by other blocked goroutines, the whole island is a leak, and this profile will list it with stacks.

What it doesn't catch: leaks where the blocking primitive is still reachable. A goroutine parked forever on ctx.Done() from a context.Background() isn't a leak as far as the GC is concerned, because the context is still reachable through live code. You still need pattern-based debugging for those. We'll go deep on the patterns in the next post in this series.

The Go team expects this profile to be default in 1.27. Worth wiring into staging now so you're not surprised.

Continuous profiling in production

Profiling on demand with go tool pprof is fine for dev and incident response. For production you want continuous profiling: Pyroscope, Parca, Grafana Phlare, or Datadog Profiler. All of them pull pprof data on a schedule and let you diff profiles across time, which turns "the service got slower this week" into a two-minute investigation instead of a four-hour one.

One opinion I'll die on: turn on mutex and block profiling globally in your continuous profiler, not just CPU. A lot of the perf regressions that end up in public postmortems (including ones from Grafana Labs, Cloudflare, and Discord over the last few years) turned out to be lock-contention or goroutine-blocking issues, not CPU issues. If your continuous profiling only catches CPU, you're looking under the streetlight.

One more gotcha: profile duration vs traffic

A 30-second CPU profile on a service getting 2 requests per second is mostly noise. A 30-second CPU profile during a burst where the service is getting 2000 requests per second is gold.

Capture profiles under realistic load. Either:

Let production run for 60 seconds with ?seconds=60 during peak hours.
Replay production traffic against staging with something like goreplay and profile staging.
Load-test with vegeta or k6 and profile during the test.

It's easy to end up "optimizing" a service based on a profile captured during a dev reload cycle and then be mystified when prod performance doesn't budge. Profiles are only as representative as the workload you captured them under.

Next step

Open a Go service you own. Add the two lines from Step 1. Deploy to staging. Generate some load. Capture a profile. Open the flame graph.

Find the widest, darkest flat frame that isn't in runtime.*. That's your suspect. Go fix it, re-profile, and see if the shape changed.

You now know more about your service than your dashboards do.

Question for the comments: what's the weirdest profile shape you've seen in production that none of the three above describe? I'll read every reply.

The books

📖 The Complete Guide to Go Programming — Book 1 in the series. The language, end to end: syntax, concurrency, interfaces, error handling, the standard library, the pieces most tutorials skip. Start here if you want a real foundation in Go.

📖 Hexagonal Architecture in Go — Book 2. How to build Go services that stay understandable past 50,000 lines. Ports, adapters, testing strategy at every layer, dependency injection in main() without frameworks. 22 chapters. Every example tested in a companion repo.

📚 Or the full collection: Thinking in Go on Amazon as Kindle or Paperback.

Next in **Go in Production: goroutine leaks. The four patterns that cause 90% of them, and what Go 1.26's new leak profile actually catches.

DEV Community