Prasad Ekke

Posted on Jun 30 • Originally published at Medium

Profiling a Go Service in Production: pprof in 10 Minutes

#go #designpatterns #programming #backend

Go ships with a profiler built in. No third-party tools, no agents to install, no instrumentation to add upfront. If your service imports net/http/pprof, you can profile it live in production right now — CPU usage, memory allocations, goroutine counts, blocking operations. The data is available over HTTP, readable with standard Go tooling.

Most engineers know pprof exists. Fewer have actually used it under pressure, on a real service, to find a real problem. This post walks through the mechanics — how to enable it, how to collect a profile, how to read a flame graph — and then shows the three classes of problems it catches most often.

Enabling pprof

For an HTTP service, one import is all it takes:

import (
    "log"
    "net/http"

    _ "net/http/pprof" // registers handlers on DefaultServeMux
)

func main() {
    // Your service setup...

    // pprof endpoints are now available on DefaultServeMux
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // ... rest of main
}

The blank import registers the pprof HTTP handlers automatically. They’re available at:

GET /debug/pprof/ — index of available profiles
GET /debug/pprof/goroutine — all current goroutines
GET /debug/pprof/heap — memory allocations
GET /debug/pprof/profile?seconds=30 — CPU profile (30-second sample)
GET /debug/pprof/block — goroutine blocking events
GET /debug/pprof/mutex — mutex contention

Security note: bind pprof to localhost only, or a private interface. Never expose it on your public port. If your service runs in Kubernetes, use kubectl port-forward to reach it.

For services not using DefaultServeMux, register the handlers explicitly:

mux := http.NewServeMux()
mux.HandleFunc("/debug/pprof/", pprof.Index)
mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
mux.HandleFunc("/debug/pprof/trace", pprof.Trace)

Collecting a CPU profile

With the service running under load (profiling a quiet service tells you nothing useful), collect a 30-second CPU sample:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

This downloads the profile and drops you into an interactive shell. Or skip the shell and go straight to a flame graph:

# Collect the profile to a file
curl -o cpu.prof "http://localhost:6060/debug/pprof/profile?seconds=30"

# Open the flame graph in a browser
go tool pprof -http=:8080 cpu.prof

The -http flag serves a web UI with multiple views. The flame graph is the most useful starting point.

Reading a flame graph

A flame graph shows where your program spends time. Each horizontal bar is a function. Width represents time — wider means more CPU. The vertical axis is the call stack — functions lower down called the ones above them.

|          processJob          |     |  encode  |
|       fetchFromDB      |other|
|         sql.Query            |
|          net.Read            |

You’re looking for wide bars near the top — those are the functions consuming the most CPU. If json.Marshal is unexpectedly wide, you’re spending a lot of time serializing. If runtime.mallocgc is wide, you’re allocating heavily. If sync.Mutex.Lock is wide, you have lock contention.

The three most common findings in worker pool services:

Problem 1: Excessive allocations

Symptom: runtime.mallocgc is prominent in the CPU profile. Heap profile shows high allocation rate with short-lived objects.

# Collect a heap profile
curl -o heap.prof "http://localhost:6060/debug/pprof/heap"
go tool pprof -http=:8080 heap.prof

In the heap profile, switch to “alloc_objects” or “alloc_space” view (not “inuse” — that shows what’s live, not what’s being allocated). Look for hot allocation sites.

Common cause in worker pools: allocating a new buffer or slice on every job iteration.

// ❌ Allocates a new byte slice for every job
func processJob(ctx context.Context, job Job) error {
    buf := make([]byte, 4096)
    // use buf...
    return nil
}

// ✅ Reuse buffers with sync.Pool
var bufPool = sync.Pool{
    New: func() interface{} {
        b := make([]byte, 4096)
        return &b
    },
}

func processJob(ctx context.Context, job Job) error {
    bufPtr := bufPool.Get().(*[]byte)
    defer bufPool.Put(bufPtr)
    buf := *bufPtr
    // use buf...
    return nil
}

sync.Pool maintains a pool of reusable objects. Objects not retrieved before the next GC cycle are collected, so it doesn’t prevent garbage collection — it reduces the rate of allocation, which reduces GC pressure.

Problem 2: Goroutine leak

Symptom: goroutine count grows over time and never decreases.

# Check current goroutine count
curl "http://localhost:6060/debug/pprof/goroutine?debug=1"

The debug=1 parameter returns a text listing of all goroutines with stack traces. Look for hundreds or thousands of goroutines all blocked at the same location — that’s your leak.

goroutine 1042 [chan receive, 3 minutes]:
main.worker()
    /app/worker.go:45 +0x68
created by main.startWorkers
    /app/pool.go:23 +0x4c

goroutine 1043 [chan receive, 3 minutes]:
...
(997 more like this)

A thousand goroutines blocked on chan receive for 3 minutes means the jobs channel was never closed. The fix is covered in the concurrency mistakes post in this series — the producer must own close(jobs) via defer.

You can also check the goroutine count programmatically:

// In your health check endpoint
func healthHandler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "goroutines: %d\n", runtime.NumGoroutine())
}

If that number grows continuously over hours of runtime and never stabilizes, you have a leak.

Problem 3: Lock contention

Symptom: throughput is lower than CPU utilization suggests it should be. Workers are running but not making progress.

Mutex profiling is off by default because it has overhead. Enable it from inside the process only while investigating contention:

import "runtime"

runtime.SetMutexProfileFraction(1)
defer runtime.SetMutexProfileFraction(0)

Then collect the mutex profile:

curl -o mutex.prof "http://localhost:6060/debug/pprof/mutex"
go tool pprof -http=:8080 mutex.prof

Common cause in worker pools: a shared results map protected by a single mutex, written to by all workers simultaneously.

// ❌ All workers contend on one mutex
type ResultCollector struct {
    mu      sync.Mutex
    results map[string]Result
}

func (c *ResultCollector) Add(id string, result Result) {
    c.mu.Lock()
    c.results[id] = result
    c.mu.Unlock()
}

Solutions, in order of preference:

// ✅ Option 1: Use a results channel instead of shared state
results := make(chan Result, 1000)

// Workers send to channel (no lock)
go func() {
    for job := range jobs {
        result := processJob(ctx, job)
        results <- result
    }
}()

// Single collector goroutine reads from channel (no contention)
go func() {
    for result := range results {
        collected[result.ID] = result
    }
}()

// ✅ Option 2: sync.Map for concurrent read-heavy workloads
var results sync.Map
results.Store(id, result)
val, ok := results.Load(id)

The channel approach is idiomatic Go — share memory by communicating rather than communicating by sharing memory. The sync.Map approach is better when you need random access to results across goroutines.

A quick reference for which profile to use

High CPU, unclear where — CPU profile: pprof .../profile?seconds=30.

Memory growing over time — heap profile (inuse_space): pprof .../heap.

High GC pressure — heap profile (alloc_objects): pprof .../heap.

Goroutine count growing — goroutine profile: pprof .../goroutine?debug=1.

Low throughput despite low CPU — mutex or block profile: pprof .../mutex or .../block. Enable mutex profiling with runtime.SetMutexProfileFraction; enable block profiling with runtime.SetBlockProfileRate.

One workflow for production incidents

When something is wrong and you need to understand it quickly:

# 1. Check goroutine count (cheap, immediate)
curl localhost:6060/debug/pprof/goroutine?debug=2 | head -50

# 2. If CPU is high, get a 30s CPU profile
curl -o cpu.prof localhost:6060/debug/pprof/profile?seconds=30

# 3. If memory is growing, get a heap snapshot
curl -o heap.prof localhost:6060/debug/pprof/heap

# 4. Open whichever profile is relevant
go tool pprof -http=:8080 cpu.prof

The whole process takes under 5 minutes. You don’t need to reproduce the problem locally, you don’t need to redeploy with extra instrumentation, and you don’t need to restart the service. The profiler runs against the live process.

Summary

pprof is built into Go’s standard library and available with a single import. The three problems it catches most reliably in worker pool services are excessive allocations (fix: sync.Pool), goroutine leaks (fix: close channels, use context), and mutex contention (fix: channel-based result collection or sync.Map). The flame graph view in go tool pprof -http is the fastest path from profile to finding.

The only prerequisite is running the service under realistic load when you collect the profile. A profile from a quiet service tells you where your code could spend time, not where it does.

Previous in this series: context.Context Is Not Optional
Next: Container-aware resource management in Go.

DEV Community