DEV Community

Cover image for Profiling a Go Service in Production: pprof in 10 Minutes
Prasad Ekke
Prasad Ekke

Posted on • Originally published at Medium

Profiling a Go Service in Production: pprof in 10 Minutes

Go ships with a profiler built in. No third-party tools, no agents to install, no instrumentation to add upfront. If your service imports net/http/pprof, you can profile it live in production right now — CPU usage, memory allocations, goroutine counts, blocking operations. The data is available over HTTP, readable with standard Go tooling.

Most engineers know pprof exists. Fewer have actually used it under pressure, on a real service, to find a real problem. This post walks through the mechanics — how to enable it, how to collect a profile, how to read a flame graph — and then shows the three classes of problems it catches most often.


Enabling pprof

For an HTTP service, one import is all it takes:

import (
    "log"
    "net/http"

    _ "net/http/pprof" // registers handlers on DefaultServeMux
)

func main() {
    // Your service setup...

    // pprof endpoints are now available on DefaultServeMux
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // ... rest of main
}
Enter fullscreen mode Exit fullscreen mode

The blank import registers the pprof HTTP handlers automatically. They’re available at:

  • GET /debug/pprof/ — index of available profiles
  • GET /debug/pprof/goroutine — all current goroutines
  • GET /debug/pprof/heap — memory allocations
  • GET /debug/pprof/profile?seconds=30 — CPU profile (30-second sample)
  • GET /debug/pprof/block — goroutine blocking events
  • GET /debug/pprof/mutex — mutex contention

Security note: bind pprof to localhost only, or a private interface. Never expose it on your public port. If your service runs in Kubernetes, use kubectl port-forward to reach it.

For services not using DefaultServeMux, register the handlers explicitly:

mux := http.NewServeMux()
mux.HandleFunc("/debug/pprof/", pprof.Index)
mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
mux.HandleFunc("/debug/pprof/trace", pprof.Trace)
Enter fullscreen mode Exit fullscreen mode

Collecting a CPU profile

With the service running under load (profiling a quiet service tells you nothing useful), collect a 30-second CPU sample:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
Enter fullscreen mode Exit fullscreen mode

This downloads the profile and drops you into an interactive shell. Or skip the shell and go straight to a flame graph:

# Collect the profile to a file
curl -o cpu.prof "http://localhost:6060/debug/pprof/profile?seconds=30"

# Open the flame graph in a browser
go tool pprof -http=:8080 cpu.prof
Enter fullscreen mode Exit fullscreen mode

The -http flag serves a web UI with multiple views. The flame graph is the most useful starting point.


Reading a flame graph

A flame graph shows where your program spends time. Each horizontal bar is a function. Width represents time — wider means more CPU. The vertical axis is the call stack — functions lower down called the ones above them.

|          processJob          |     |  encode  |
|       fetchFromDB      |other|
|         sql.Query            |
|          net.Read            |
Enter fullscreen mode Exit fullscreen mode

You’re looking for wide bars near the top — those are the functions consuming the most CPU. If json.Marshal is unexpectedly wide, you’re spending a lot of time serializing. If runtime.mallocgc is wide, you’re allocating heavily. If sync.Mutex.Lock is wide, you have lock contention.

The three most common findings in worker pool services:


Problem 1: Excessive allocations

Symptom: runtime.mallocgc is prominent in the CPU profile. Heap profile shows high allocation rate with short-lived objects.

# Collect a heap profile
curl -o heap.prof "http://localhost:6060/debug/pprof/heap"
go tool pprof -http=:8080 heap.prof
Enter fullscreen mode Exit fullscreen mode

In the heap profile, switch to “alloc_objects” or “alloc_space” view (not “inuse” — that shows what’s live, not what’s being allocated). Look for hot allocation sites.

Common cause in worker pools: allocating a new buffer or slice on every job iteration.

// ❌ Allocates a new byte slice for every job
func processJob(ctx context.Context, job Job) error {
    buf := make([]byte, 4096)
    // use buf...
    return nil
}

// ✅ Reuse buffers with sync.Pool
var bufPool = sync.Pool{
    New: func() interface{} {
        b := make([]byte, 4096)
        return &b
    },
}

func processJob(ctx context.Context, job Job) error {
    bufPtr := bufPool.Get().(*[]byte)
    defer bufPool.Put(bufPtr)
    buf := *bufPtr
    // use buf...
    return nil
}
Enter fullscreen mode Exit fullscreen mode

sync.Pool maintains a pool of reusable objects. Objects not retrieved before the next GC cycle are collected, so it doesn’t prevent garbage collection — it reduces the rate of allocation, which reduces GC pressure.


Problem 2: Goroutine leak

Symptom: goroutine count grows over time and never decreases.

# Check current goroutine count
curl "http://localhost:6060/debug/pprof/goroutine?debug=1"
Enter fullscreen mode Exit fullscreen mode

The debug=1 parameter returns a text listing of all goroutines with stack traces. Look for hundreds or thousands of goroutines all blocked at the same location — that’s your leak.

goroutine 1042 [chan receive, 3 minutes]:
main.worker()
    /app/worker.go:45 +0x68
created by main.startWorkers
    /app/pool.go:23 +0x4c

goroutine 1043 [chan receive, 3 minutes]:
...
(997 more like this)
Enter fullscreen mode Exit fullscreen mode

A thousand goroutines blocked on chan receive for 3 minutes means the jobs channel was never closed. The fix is covered in the concurrency mistakes post in this series — the producer must own close(jobs) via defer.

You can also check the goroutine count programmatically:

// In your health check endpoint
func healthHandler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "goroutines: %d\n", runtime.NumGoroutine())
}
Enter fullscreen mode Exit fullscreen mode

If that number grows continuously over hours of runtime and never stabilizes, you have a leak.


Problem 3: Lock contention

Symptom: throughput is lower than CPU utilization suggests it should be. Workers are running but not making progress.

Mutex profiling is off by default because it has overhead. Enable it from inside the process only while investigating contention:

import "runtime"

runtime.SetMutexProfileFraction(1)
defer runtime.SetMutexProfileFraction(0)
Enter fullscreen mode Exit fullscreen mode

Then collect the mutex profile:

curl -o mutex.prof "http://localhost:6060/debug/pprof/mutex"
go tool pprof -http=:8080 mutex.prof
Enter fullscreen mode Exit fullscreen mode

Common cause in worker pools: a shared results map protected by a single mutex, written to by all workers simultaneously.

// ❌ All workers contend on one mutex
type ResultCollector struct {
    mu      sync.Mutex
    results map[string]Result
}

func (c *ResultCollector) Add(id string, result Result) {
    c.mu.Lock()
    c.results[id] = result
    c.mu.Unlock()
}
Enter fullscreen mode Exit fullscreen mode

Solutions, in order of preference:

// ✅ Option 1: Use a results channel instead of shared state
results := make(chan Result, 1000)

// Workers send to channel (no lock)
go func() {
    for job := range jobs {
        result := processJob(ctx, job)
        results <- result
    }
}()

// Single collector goroutine reads from channel (no contention)
go func() {
    for result := range results {
        collected[result.ID] = result
    }
}()

// ✅ Option 2: sync.Map for concurrent read-heavy workloads
var results sync.Map
results.Store(id, result)
val, ok := results.Load(id)
Enter fullscreen mode Exit fullscreen mode

The channel approach is idiomatic Go — share memory by communicating rather than communicating by sharing memory. The sync.Map approach is better when you need random access to results across goroutines.


A quick reference for which profile to use

High CPU, unclear where — CPU profile: pprof .../profile?seconds=30.

Memory growing over time — heap profile (inuse_space): pprof .../heap.

High GC pressure — heap profile (alloc_objects): pprof .../heap.

Goroutine count growing — goroutine profile: pprof .../goroutine?debug=1.

Low throughput despite low CPU — mutex or block profile: pprof .../mutex or .../block. Enable mutex profiling with runtime.SetMutexProfileFraction; enable block profiling with runtime.SetBlockProfileRate.


One workflow for production incidents

When something is wrong and you need to understand it quickly:

# 1. Check goroutine count (cheap, immediate)
curl localhost:6060/debug/pprof/goroutine?debug=2 | head -50

# 2. If CPU is high, get a 30s CPU profile
curl -o cpu.prof localhost:6060/debug/pprof/profile?seconds=30

# 3. If memory is growing, get a heap snapshot
curl -o heap.prof localhost:6060/debug/pprof/heap

# 4. Open whichever profile is relevant
go tool pprof -http=:8080 cpu.prof
Enter fullscreen mode Exit fullscreen mode

The whole process takes under 5 minutes. You don’t need to reproduce the problem locally, you don’t need to redeploy with extra instrumentation, and you don’t need to restart the service. The profiler runs against the live process.


Summary

pprof is built into Go’s standard library and available with a single import. The three problems it catches most reliably in worker pool services are excessive allocations (fix: sync.Pool), goroutine leaks (fix: close channels, use context), and mutex contention (fix: channel-based result collection or sync.Map). The flame graph view in go tool pprof -http is the fastest path from profile to finding.

The only prerequisite is running the service under realistic load when you collect the profile. A profile from a quiet service tells you where your code could spend time, not where it does.


Previous in this series: context.Context Is Not Optional
Next: Container-aware resource management in Go.

Top comments (0)