The Go Execution Tracer: Finding Latency the Profiler Misses

#go #performance #concurrency

Book: The Complete Guide to Go Programming
Also by me: Hexagonal Architecture in Go — the companion book in the Thinking in Go series
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You've been here. A handler's p99 is 400ms and nobody knows why.
The CPU profile is clean: the flame graph is wide and shallow, no
single function eats real time. Memory looks fine. The service is
barely working, yet requests sit there for hundreds of
milliseconds doing nothing you can see.

That is the signature of a problem pprof cannot show you. The CPU
profiler samples what the CPU is executing. When a goroutine is
blocked on a channel, parked waiting for the scheduler, or stuck
behind a garbage-collection pause, it is executing nothing. It
does not appear in the CPU profile at all, because from the CPU's
point of view it does not exist. The latency is real. The profiler
is blind to it.

The Go execution tracer sees the other half. It records what every
goroutine is doing over time, including the time it spends waiting.
That is where the missing 400ms lives.

What the tracer records that pprof does not

pprof answers "where is the CPU spending cycles." It samples the
stack a few hundred times a second and builds a statistical
picture. Good for CPU-bound code. Useless for a goroutine that is
idle by design.

The execution tracer answers a different question: "what was every
goroutine, processor, and system thread doing at each moment." It
is event-based, not sampled. The runtime emits an event when a
goroutine starts running, blocks on a channel, gets unblocked, does
a syscall, or gets preempted by the GC. You get a timeline, not a
histogram.

Turning it on for a snapshot takes three lines:

import (
    "os"
    "runtime/trace"
)

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    run() // your workload
}

Then open the viewer:

go tool trace trace.out

That launches a local web UI. The interesting views are "View
trace" (the timeline) and the per-goroutine breakdown that tells
you where wall-clock time actually went.

If the code under suspicion is a benchmark, skip the manual
plumbing:

go test -trace trace.out -bench BenchmarkHandler

For a live service, wire the standard net/http/pprof endpoint and
pull a window on demand. go tool trace reads a file, so fetch
the window first, then open it:

curl -o trace.out \
  "http://localhost:6060/debug/pprof/trace?seconds=5"
go tool trace trace.out

Five seconds of trace on a busy service is a lot of data. Keep the
window short.

Reading the lanes: scheduler, GC, and syscalls

The timeline is organized into horizontal lanes. Learning to read
three of them covers most latency mysteries.

The proc lanes. There is one lane per P (logical processor,
capped by GOMAXPROCS). Each shows which goroutine is running on
that processor at each instant. Gaps in a proc lane mean the
processor had no runnable goroutine to run. If your service is slow
and the proc lanes are full of gaps, you are not CPU-bound. You are
blocked on something, and the CPU profile was never going to tell
you what.

The GC region. Garbage collection shows up as its own marked
region on the timeline. Go's collector is concurrent, but it still
has short stop-the-world phases and it steals CPU from your
goroutines during the mark phase. If your p99 spikes line up with
GC regions, the latency is allocation pressure, and the fix is
fewer allocations, not faster handlers. The CPU profile would have
shown runtime.mallocgc as busy, but never connected it to the
specific slow requests. The tracer does, on the timeline.

Syscall rows. When a goroutine makes a blocking syscall, the
runtime hands its P to another thread so other goroutines keep
running. The tracer marks that goroutine as being in a syscall for
the whole duration. A goroutine that spends most of its life in a
syscall lane is waiting on the network, the disk, or a lock held
in C code. That is I/O latency, and again, invisible to a CPU
profile.

The mental shift: pprof is a picture of busy. The tracer is a
picture of waiting. Production latency is almost always a waiting
problem.

Spotting a goroutine that blocks when it should not

Here is a concrete example that looks fine and is not. A worker
pool where every worker shares one output channel with a slow
consumer:

func run() {
    jobs := make(chan int, 100)
    results := make(chan int) // unbuffered

    for w := 0; w < 8; w++ {
        go worker(jobs, results)
    }
    go func() {
        for i := 0; i < 10000; i++ {
            jobs <- i
        }
        close(jobs)
    }()

    for i := 0; i < 10000; i++ {
        <-results
        time.Sleep(50 * time.Microsecond) // slow drain
    }
}

func worker(jobs <-chan int, results chan<- int) {
    for j := range jobs {
        results <- compute(j) // blocks here
    }
}

Eight workers, plenty of parallelism on paper. Run it under the
tracer and open the per-goroutine view. You will see the workers
spend almost all their wall-clock time in one state: blocked
sending on results. The unbuffered channel plus a slow drain
means seven of the eight workers are parked at any moment, waiting
for the consumer to accept a value. The proc lanes are mostly idle.
Adding more workers does nothing, because the bottleneck is the
single unbuffered channel, not compute.

pprof on this program shows compute at the top of the CPU
profile and looks healthy. The CPU is busy when it runs. The
problem is it barely runs. Only the tracer shows the workers
sitting in "chan send" for the bulk of the timeline.

The tracer even labels the block reason. In the per-goroutine
detail you get a breakdown like this:

// Execution time:         3ms
// Sync block time:      810ms  <- waiting on channels
// Scheduler wait time:    2ms

When "Sync block time" dwarfs "Execution time," you have found
your latency, and no amount of staring at CPU flame graphs would
have surfaced it.

When the tracer beats pprof, and when it doesn't

Reach for the execution tracer when:

p99 is high but CPU utilization is low. Classic waiting problem.
Latency spikes correlate with nothing in the CPU profile.
You suspect channel contention, lock contention, or a goroutine serializing work that should be parallel.
You want to see whether GC pauses are eating specific requests.
You need to know why GOMAXPROCS cores are sitting idle under load.

Reach for pprof when:

CPU utilization is genuinely high and you need to cut cycles.
You are chasing an allocation rate (the heap profile), not latency.
You want a cheap always-on profile. The execution tracer is heavier and produces far more data; it is a scalpel for a window, not a monitor you leave running.

They are complementary. The profiler tells you what is expensive
when it runs. The tracer tells you why it is not running. Start
with pprof for a CPU-bound service. Reach for the tracer the
moment the numbers stop adding up, the CPU looks calm, and the
latency refuses to explain itself.

What to do on Monday

Take the slowest endpoint you own. Pull a five-second trace under
production load with the pprof endpoint. Open go tool trace, go
to the per-goroutine view, and sort by sync block time. If the top
goroutines spend more time blocked than executing, you have been
optimizing the wrong thing. The fix is on the timeline, not in the
flame graph.

The tracer has a learning curve the profiler does not. The payoff
is that it answers the question production latency actually asks:
what is my code waiting for. The profiler only ever told you what
it was doing.

Runtime behavior like this is where Go rewards you for
understanding the machine under the language. The Complete Guide
to Go Programming goes deep on the scheduler, the garbage
collector, and how goroutines actually park and wake, which is
exactly the model you need to read a trace fluently. Hexagonal
Architecture in Go keeps that instrumentation at the right
boundary, so tracing and profiling live in the adapters and never
leak into your domain.