Shannon

Posted on Jan 6 • Edited on Jan 8

Utilizing the Go 1.25 Flight Recorder with tracing middleware

#go #sre

Overview
Utilizing a FlightRecorder
Code
Run it
Looking at the trace file
Wrap up

Overview

I recently came across this video about profiling request latency from Datadog, and I realized I hadn’t really ever seen or used this Go package. Profiling production services is not an easy task, and the concept of continuous profiling has become a lot more popular in recent years. See below, courtesy of this CNCF article about the topic.

Anyway, Go has some more recent functionality that allows this kind of tracing with low overhead, meaning it won’t consume 10-20% of your CPU but instead 1-2%. Additionally, it allows snippets of trace samples, which can consume significant disk and RAM if you’re recording for an extended period.

The purpose of me writing this article is to give more a layman’s approach and another example from the following two articles (tbh, I’d suggest reading those two articles over this one :-P)

I want to state up front I am not an expert on this and still have a lot of questions myself. The hope is to eventually see more examples of utilizing this in production and just digging into this more myself.

Utilizing a FlightRecorder

The FlightRecorder (FR) object is new in Go 1.25 that captures execution trace data with low overhead, as mentioned above. It takes two potential configuration fields, from a struct called FlightRecorderConfig. You can either choose an age or maximum bytes. The article above goes into this as well, but I’ll be choosing max age since I have a simulated estimate for how long the request will take.

It is quite straightforward to set up a new FR:

// Set up the flight recorder.
    fr := trace.NewFlightRecorder(trace.FlightRecorderConfig{
        MinAge: 5 * time.Second,
    })

Code

For this code, I decided to simulate the tracing as middleware because it feels like an appropriate, realistic example. I will get into some specific code details below this, but the pseudocode is essentially this:

curl /ping
HTTP server is called → middleware handler is called → ping handler is called
Generate randomized number between 0-2
If rand number == 0, simulate syscall blocking through sleeps and CPU constraints by spawning many goroutines but limiting the CPU to 1
if request > .5seconds, take a trace sample
Write trace to trace.[number-here].out

Code comments specific to this page are written as n_[comment]

package main

import (
    "bytes"
    "fmt"
    "log"
    "math/rand"
    "net/http"
    "os"
    "runtime"
    "runtime/trace"
    "sync/atomic"
    "time"
)

// app holds the flight recorder, snapshot count, and max snapshots to take
type app struct {
    fr *trace.FlightRecorder
    // 1_ limit the number of snapshots to take
    snapshotCount atomic.Int32
    maxSnapshots  int32
}

// doWork simulates work
func doBlockingWork() {
    for i := 0; i < 100; i++ {
        time.Sleep(10 * time.Millisecond) // sleep blocking syscall
    }
}

// doSchedulerWork simulates work that schedules goroutines
func doSchedulerWork() {
    // 2_ purposely cap the number of CPUs to 2
    runtime.GOMAXPROCS(2)
    for i := 0; i < 100; i++ {
        go func() {
            _ = fmt.Sprintf("hello")
        }()
    }
    time.Sleep(1 * time.Second)
}

// handlePing responds with a "pong" message
func (a *app) handlePing(w http.ResponseWriter, r *http.Request) {
    start := time.Now()

    // 3_ 1/3 chance of doing work
    if rand.Intn(3) == 0 {
        doBlockingWork()
        doSchedulerWork()
    }

    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "pong, duration=%s\n", time.Since(start))
}

// tracingMiddleware is a middleware that takes a snapshot of the trace if the request takes longer than 500ms
func (a *app) tracingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        // 4_ serve the request (the /ping endpoint)
        next.ServeHTTP(w, r)
        if time.Since(start) > 500*time.Millisecond {
            // 5_ take a snapshot of the trace if > 500ms
            a.takeSnapshot()
        }
    })
}

// takeSnapshot writes the trace to a file
func (a *app) takeSnapshot() {
    count := a.snapshotCount.Add(1)
    // 6_ limit the total number of snapshots. this could rapidly fill up disk space if not limited.
    if count <= a.maxSnapshots {
        filename := fmt.Sprintf("trace.%d.out", count)
        log.Printf("Taking trace %d/%d and saving to file %s", count, a.maxSnapshots, filename)
        var b bytes.Buffer
        _, err := a.fr.WriteTo(&b)
        if err != nil {
            log.Print(err)
            return
        }

        // write to a file for each snapshot
        if err := os.WriteFile(filename, b.Bytes(), 0o755); err != nil {
            log.Print(err)
            return
        }
    }
}

func main() {
    fr := trace.NewFlightRecorder(trace.FlightRecorderConfig{
        // 7_ set the minimum age of the trace to 5 seconds, which is roughly double the duration of a long request
        MinAge: 5 * time.Second,
    })
    fr.Start()
    defer fr.Stop()

    app := &app{
        fr:            fr,
        snapshotCount: atomic.Int32{},
        maxSnapshots:  2,
    }

    // run http server
    http.Handle("/ping", app.tracingMiddleware(http.HandlerFunc(app.handlePing)))
    log.Fatal(http.ListenAndServe(":5051", nil))
}

Let’s talk about each of the labeled comments now:

Basically, I’m showing that there should be some kind of hard cap for how many traces you’d want to take. If this is a production service, it could either be a massive trace or you run the risk of filling up disk space. If containerized, that disk space might share space with other containers and you run the risk of evictions, etc.
We’re capping the number of CPUs where all the goroutines cannot schedule automatically. This will essentially make the wall clock time be much longer than the execution time. We should be able to see it in the trace
Basically, just generate that this won’t happen every time. It’s a contrived example of some requests taking longer than others
All handlers have the ServeHTTP(w, r) method, which is how you call the middleware. In this case, our only other handler is handlePing but this is wired up in such a way you could call it on any Handler
This will only take a snapshot if the request is long. In a normal GET /ping it will take a very short amount of time
Basically the same as mentioned in number 1. If I was writing this in production, I might have it disabled by default and then add a --enable-tracing or something similar to start the continuous profiling
As mentioned above, the suggestion is double what you expect the duration to be. I expect it will be around 2-2.5 seconds, so we set it to 5

Run it

Let’s spin this thing up.

For the server: go run main.go

And in another terminal for your client: while true; do curl localhost:5051/ping; sleep 1; done

You should see something like this for your client side:

pong, duration=4.333µs
pong, duration=2.417µs
pong, duration=2.126059041s
pong, duration=1.167µs
pong, duration=1.458µs
pong, duration=2.140110958s
pong, duration=1.333µs
pong, duration=1.416µs
pong, duration=1.833µs

And then there should be some generated trace files now too (in the directory your server was running):

➜  tracing git:(main) ✗ ls trace.*                                                                                                                                                       (current-context is not set)
trace.1.out     trace.2.out

Looking at the trace file

We use the following command to look at the trace file: go tool trace trace.1.out. It should auto spin up a tab running on localhost, but the output of the command will also give a link.

I am not an expert by any means on the Chromium profiling UI, but let’s examine the goroutines spawned by doSchedulerWork function.

Goroutine analysis → main.doSchedulerWork.func1 → Execution time and Scheduled wait time are listed.

We can see here that the scheduled wait time is an order of magnitude greater than the execution time (nanoseconds vs. microseconds). Additionally, we can see all 100 of the goroutines spawned. From there, we can go back one screen and analyze the goroutines that took the longest amount of it. It is important to click in and see more details, as there's a breakdown of execution time, wall clock time, scheduling time, etc.

Wrap up

I hope this was an interesting read, and if you find more on this topic and using the package with some better examples, send them my way!

DEV Community