Go Execution Tracer: The Tool That Finally Explains Why Your Concurrent Application Is Slow

#programming #devto #go #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

When your Go application slows down, you might reach for a CPU or memory profile. I’ve done that many times. You get a report showing which functions are using the most time or allocating the most memory. But sometimes, especially when your program is doing many things at once, those profiles leave you with more questions than answers. You see high CPU usage, but you can’t tell why. Is a goroutine stuck waiting for a lock? Is the scheduler bouncing work between threads? Is the garbage collector stopping the world too often? This is where traditional tools stop and a more powerful one begins.

Go’s execution tracer is that tool. It doesn't just tell you what code is running; it tells you the story of how it runs. It records a timeline of events inside the Go runtime itself. Every time a goroutine starts, stops, waits, or wakes up, the tracer makes a note. It times every garbage collection cycle, every network wait, and every system call, all with microsecond precision. It’s like having a super-slow-motion camera inside your program. When you have a performance mystery in a concurrent system, this is how you solve it.

Let’s start with the basics. How do you even turn it on? The simplest way is to write a trace to a file and then use the go tool trace command to view it.

package main

import (
    "os"
    "runtime/trace"
)

func main() {
    // Create a file to hold the trace data.
    f, err := os.Create("my_trace.out")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    // Start the tracer, writing to our file.
    if err := trace.Start(f); err != nil {
        panic(err)
    }
    defer trace.Stop()

    // This is your application code. Everything here gets traced.
    runMyApplication()
}

After running this, you get a file called my_trace.out. You then run go tool trace my_trace.out. This starts a web server and opens your browser to a special interface. At first glance, it can be overwhelming. You see a timeline full of colored bars. But each color and bar has a meaning. Gold bars are goroutines executing. Blue bars might be network waits. Red sections are garbage collection.

The magic is in the details. You can click on any point in time and see exactly what every Goroutine in your program was doing at that microsecond. This is impossible with a standard profile.

I remember debugging an API server that would occasionally have huge spikes in response time. CPU profiles showed nothing unusual. The tracer told a different story. I opened the trace and zoomed into one of the slow request periods. I saw a solid block of red—a garbage collection pause—that lasted nearly 100 milliseconds. More importantly, I could see the stack trace that triggered the large allocation leading to that GC. It was a simple line of code converting a large struct to JSON for logging. The fix was to move that logging to a background goroutine. The tracer gave me the direct, visual cause and effect.

But you don't always want to trace everything. You might only care about a specific operation, like handling an HTTP request. You can mark regions of your code to make them stand out in the trace.

import (
    "context"
    "runtime/trace"
    "net/http"
)

func handleRequest(w http.ResponseWriter, r *http.Request) {
    // Start a named region. It will appear as a labeled section on the timeline.
    ctx := trace.StartRegion(r.Context(), "HandleAPIRequest")
    defer trace.EndRegion()

    // Your handler logic here.
    userID := r.URL.Query().Get("id")
    data, err := fetchUserData(ctx, userID) // This function can also be traced
    if err != nil {
        http.Error(w, err.Error(), 500)
        return
    }
    w.Write(data)
}

When you look at the trace now, you’ll see a clear block labeled “HandleAPIRequest”. Inside it, you can see sub-regions for the database call, JSON marshaling, or any other function you annotate. This connects the abstract runtime events to your actual business logic. It answers the question, “What was my code doing when the scheduler went crazy?”

Let’s talk about the scheduler, because that’s where concurrency problems often live. Go manages goroutines on a pool of OS threads. The scheduler’s job is to keep those threads busy. Sometimes, it doesn’t work perfectly. A goroutine might be ready to run, but no thread is free. This is called scheduler latency. The tracer shows this clearly.

Here’s a small, problematic program.

package main

import (
    "runtime"
    "runtime/trace"
    "sync"
    "os"
)

func main() {
    f, _ := os.Create("sched_latency.out")
    trace.Start(f)
    defer trace.Stop()

    // Restrict to only one OS thread. This is an exaggerated example.
    runtime.GOMAXPROCS(1)

    var wg sync.WaitGroup
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            // Simulate some work
            var counter int
            for j := 0; j < 1e7; j++ {
                counter++
            }
        }(i)
    }
    wg.Wait()
}

If you trace this and open it in the tool, you’ll see a tell-tale pattern. The goroutines won’t run in parallel. They’ll take turns, one after the other, on the single thread. You’ll see each goroutine in the “Runnable” state (waiting for a turn) while another executes. The timeline looks like a staggered staircase. In a real-world app, you might see this if GOMAXPROCS is set too low for your workload, or if you have a few long-running goroutines that don’t yield, starving others.

The tracer can also show you blocking operations. This is its killer feature. A CPU profile shows time spent on the CPU. It hides time spent waiting. But waiting is often the root of latency.

func problematicFunction() {
    var mu sync.Mutex
    var data map[string]int
    var wg sync.WaitGroup

    // A goroutine that holds the lock for a long time.
    wg.Add(1)
    go func() {
        defer wg.Done()
        mu.Lock()
        time.Sleep(100 * time.Millisecond) // Simulating a long operation
        data["key"] = 1
        mu.Unlock()
    }()

    // Ten goroutines that need the lock.
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            mu.Lock() // <<< They will all block here, one after another.
            data["key"] = id
            mu.Unlock()
        }(i)
    }
    wg.Wait()
}

In the trace view, you won’t just see that these goroutines are slow. You’ll see a cluster of them all in a “SyncBlock” state, lined up behind each other, with the stack trace pointing to the mu.Lock() line. The duration of each block is right there. This visual proof is what you need to argue for refactoring the locking strategy, perhaps using a read-write mutex or sharding the data.

Network and system call blocking are just as clear. Let’s say you have a microservice calling three others.

func callServices(ctx context.Context) (Result, error) {
    trace.StartRegion(ctx, "callServices")
    defer trace.EndRegion()

    var result Result
    var err error

    // Network call 1
    trace.StartRegion(ctx, "CallUserService")
    result.User, err = userService.Fetch(ctx)
    trace.EndRegion()
    if err != nil { return result, err }

    // Network call 2
    trace.StartRegion(ctx, "CallOrderService")
    result.Orders, err = orderService.Fetch(ctx)
    trace.EndRegion()
    if err != nil { return result, err }

    // Network call 3
    trace.StartRegion(ctx, "CallInventoryService")
    result.Inventory, err = inventoryService.Fetch(ctx)
    trace.EndRegion()

    return result, err
}

If you trace this, you might expect the three calls to happen in sequence, one after the other. But what if they’re supposed to be concurrent? The trace will show you the ugly truth: three separate, sequential blocks of time where a goroutine is in a “Syscall” or “NetworkWait” state. The fix is to launch goroutines and use a sync.WaitGroup or channels. The tracer doesn’t just find problems; it helps you verify your concurrency patterns are working as you drew them on the whiteboard.

Memory and garbage collection are another major source of pauses. The tracer has a special view for GC activity.

func triggerGCPauses() {
    // Enable more detailed GC tracing
    f, _ := os.Create("gc_trace.out")
    trace.Start(f, trace.WithGC())
    defer trace.Stop()

    // A loop that creates a lot of short-lived garbage.
    for i := 0; i < 1000; i++ {
        // This slice allocation is temporary and will be garbage.
        data := make([]byte, 10*1024) // 10KB
        _ = process(data)             // process does something and discards 'data'
    }
}

With trace.WithGC(), the tracer records the start and end of every garbage collection cycle. In the web view, you can select “GC” from the menu and see a histogram of pause times. You can click on a long pause and see what was happening right before it. Often, you’ll find a pattern: a specific function allocates a large chunk of memory, which quickly becomes garbage and forces a collection. This leads to optimizations like using sync.Pool for common objects.

The overhead of tracing is a valid concern. You shouldn’t leave it on in production all the time. In my experience, the overhead is between 5% and 15%, depending on how many events you capture. The strategy is to use it like a diagnostic scan. You turn it on for short, representative periods when you are investigating a known issue. You can also trigger it programmatically based on conditions, like when a request takes longer than a certain threshold.

Here’s a more complete example of a managed tracer you might integrate.

type DebugTracer struct {
    enabled   bool
    traceFile *os.File
    mu        sync.Mutex
}

func (dt *DebugTracer) StartTrace(duration time.Duration, path string) error {
    dt.mu.Lock()
    defer dt.mu.Unlock()

    if dt.enabled {
        return fmt.Errorf("trace already in progress")
    }

    f, err := os.Create(path)
    if err != nil {
        return err
    }
    // Trace GC, scheduler, and syscalls.
    opts := []trace.Option{
        trace.WithGC(),
        trace.WithScheduler(),
        trace.WithSyscall(),
    }
    if err := trace.Start(f, opts...); err != nil {
        f.Close()
        return err
    }

    dt.traceFile = f
    dt.enabled = true

    // Automatically stop after the duration.
    time.AfterFunc(duration, func() {
        dt.StopTrace()
        log.Printf("Trace saved to %s for analysis", path)
    })
    return nil
}

func (dt *DebugTracer) StopTrace() error {
    dt.mu.Lock()
    defer dt.mu.Unlock()

    if !dt.enabled {
        return nil
    }
    trace.Stop()
    err := dt.traceFile.Close()
    dt.enabled = false
    dt.traceFile = nil
    return err
}

// Use it in an HTTP server.
func main() {
    tracer := &DebugTracer{}
    http.HandleFunc("/debug/trace", func(w http.ResponseWriter, r *http.Request) {
        dur, _ := time.ParseDuration(r.URL.Query().Get("duration"))
        if dur == 0 {
            dur = 5 * time.Second
        }
        if err := tracer.StartTrace(dur, "/tmp/go_trace.out"); err != nil {
            http.Error(w, err.Error(), 500)
            return
        }
        w.Write([]byte("Tracing started\n"))
    })

    // ... your server code
}

This gives you an HTTP endpoint to start a trace for 5 seconds. You can hit it when your service is behaving poorly, collect the trace file, and analyze it later.

The final step is analysis. The go tool trace interface is powerful but can be manual. For larger teams, you might want to extract metrics automatically. You can parse the trace file programmatically, though the API is not officially stable. More commonly, you learn to read the visualizations and teach your team to do the same. You look for a few key things: thick walls of red (GC), long chains of blocked goroutines (contention), large gaps where goroutines are runnable but not running (scheduler issues), and long, unbroken regions of network wait (sequential I/O).

Using the tracer changes how you think about performance. You stop guessing. Instead of saying, “I think the lock is the problem,” you can say, “The trace shows 15 goroutines blocked on this mutex for an average of 45 milliseconds.” This is a fact. You fix the mutex, take another trace, and see the block chain is gone. The improvement is now measurable and visible.

It turns performance tuning from a dark art into a systematic investigation. You have a timeline of evidence. For anyone building concurrent systems in Go, learning to use the execution tracer is not an advanced trick. It is a fundamental skill. It provides the missing link between your code, the runtime, and the actual behavior of your program on the CPU. Start with a simple trace of a slow test. Look at it. Click around. The story of your program’s performance is waiting in there.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!