Building Distributed Tracing in Go: A Complete Guide to Request Tracking Across Services

#programming #devto #go #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Distributed tracing helps me see what happens to a request as it travels through my software. When a user clicks a button, that single action might trigger dozens of small operations across many different services. A tracing system strings those operations together into a single story. This story shows me where delays happen and how my services depend on each other. Today, I'll show you how to build this capability in Go, focusing on three main ideas: carrying trace information between services, collecting data efficiently, and deciding what to record without overloading the system.

Let's start with the big picture. My implementation is built around a Tracer struct. This is the main object that manages the entire tracing process. When I create a Tracer, I give it a name for my service and a sampling rate. The sampling rate is crucial; it tells the system what percentage of requests I actually want to record. In a high-traffic system, recording every single request would be impossibly expensive. This configuration object holds all the settings that control behavior.

tracer := NewTracer("order-service", 0.1) // Sample 10% of traces

The heart of the system is the span. A span represents one unit of work, like a database query or an HTTP handler. My StartSpan method is where the magic begins. It needs a context, which is Go's way of carrying request-scoped information, and a name for the operation.

First, it checks if there's already a trace happening by looking for a parent span in the provided context. If a parent exists, the new span will become its child. This is how we build the hierarchy of a trace. Then, it asks the sampler if this new span should be recorded. The sampler makes a decision based on the rules I've configured, like a simple percentage or a more complex rate limit.

func (t *Tracer) StartSpan(ctx context.Context, name string, opts ...SpanOption) (context.Context, *Span) {
    var parentSpanContext trace.SpanContext
    if parent := trace.SpanFromContext(ctx); parent != nil {
        parentSpanContext = parent.SpanContext()
    }
    samplingResult := t.sampler.ShouldSample(SamplingParameters{
        TraceID:       generateTraceID(),
        ParentContext: parentSpanContext,
        Name:          name,
        Attributes:    make(map[string]interface{}),
    })
    // ... create span based on the sampling decision
}

If the sampler says "no," we return a special kind of span that does nothing. It's a no-op. This is important for performance. The code still runs through the same functions, but the tracing instrumentation adds almost zero cost. If the sampler says "yes," we create a real span. For performance, I use a sync.Pool. This is a pool of reusable span objects. Creating and destroying millions of small objects can put a lot of pressure on Go's garbage collector. The pool keeps a collection of unused spans ready to go. When I need a new one, I get it from the pool. When I'm done, I clean it and put it back.

span := t.spanPool.Get().(*Span)
// ... configure the span
return ctx, span

A span holds useful information: a unique ID, its parent's ID, when it started, and when it ended. I can also attach attributes to it. Attributes are simple key-value pairs that describe the work, like http.method="GET" or db.query="SELECT * FROM users". This turns a generic "database call" span into a specific, searchable record. When the work is done, I call EndSpan. This calculates the duration, sets a final status (like success or error), and prepares the span data for export. The span is then sent to a channel, which acts as a buffer before the data is shipped to a backend system. Finally, I reset the span and return it to the pool.

Getting trace data from one service to another is called context propagation. This is how a trace started in Service A continues into Service B. For HTTP, the trace ID and span ID are packed into headers. My system uses a propagator to handle this. When an HTTP request arrives, the middleware uses Extract to pull the trace context out of the headers and put it into the Go context.

ctx := tracer.Extract(r.Context(), propagation.HeaderCarrier(r.Header))

When Service A needs to call Service B, it uses Inject to write that same context into the outgoing HTTP request headers. The downstream service will then extract it, and the circle continues. This works the same way for gRPC or message queues; you just use a different "carrier" object to move the data.

tracer.Inject(ctx, propagation.HeaderCarrier(r.Header))

Sampling is how I control the amount of data. The simplest method is probability sampling. It just rolls a dice for each new trace. If I set a rate of 0.1 (10%), it uses a random number generator. If the random number is below a certain threshold, the trace is sampled. This is predictable and easy to understand, but it has a downside. During a sudden traffic surge, 10% of a huge number is still a huge number, which could crash my tracing backend.

A more sophisticated approach is rate-limiting sampling. This sampler allows me to say, "I want at most 100 spans per second." It works with a credit system. Every second, it gains a certain number of credits (e.g., 100). When a span is created, it spends one credit. If there are no credits left, new spans are not sampled until more credits accumulate. This gives me a hard upper limit on my data volume.

func (rls *RateLimitingSampler) ShouldSample(params SamplingParameters) SamplingResult {
    rls.mu.Lock()
    defer rls.mu.Unlock()
    // Update credits based on time passed
    now := time.Now()
    elapsed := now.Sub(rls.lastCreditUpdate).Seconds()
    rls.currentCredits += elapsed * rls.creditsPerSecond
    // Spend a credit if we have one
    if rls.currentCredits >= 1.0 {
        rls.currentCredits -= 1.0
        return SamplingResult{Decision: RecordAndSample}
    }
    return SamplingResult{Decision: Drop}
}

An even smarter system might use adaptive sampling. This could increase the sampling rate automatically if it detects a rise in HTTP error codes, giving me more visibility during failures. The sampler interface makes it easy to plug in these different strategies.

Collecting spans is one thing; sending them somewhere useful is another. The TraceExporter handles this. Spans are sent into a buffered channel (exporterCh). A separate goroutine reads from this channel and groups spans into batches. Batching is critical for efficiency. Sending one span per HTTP request to my tracing backend would be incredibly wasteful. By grouping them, I might reduce network overhead by 95% or more.

This batch processor either waits for a batch to fill up (say, 100 spans) or for a timer to fire (e.g., every 5 seconds). This way, spans are exported quickly during high traffic, but I don't wait forever for a partial batch during low traffic.

func (te *TraceExporter) processBatches() {
    batch := make([]*SpanData, 0, te.batchSize)
    for {
        select {
        case span := <-te.exporterCh:
            batch = append(batch, span)
            if len(batch) >= te.batchSize {
                te.batchCh <- batch
                batch = make([]*SpanData, 0, te.batchSize) // Reset batch
            }
        case <-te.flushTicker.C: // Timer fired
            if len(batch) > 0 {
                te.batchCh <- batch
                batch = make([]*SpanData, 0, te.batchSize)
            }
        }
    }
}

Another goroutine, the exportWorker, takes these ready batches and sends them to the backend collector via an HTTP POST request. If the request fails, I log the error. In a production system, I'd add a retry mechanism with a delay, so transient network problems don't cause data loss.

To make tracing usable for a web service, I create HTTP middleware. This middleware wraps my normal HTTP handler. Its job is automatic: extract trace context, start a span for the request, call the real handler, record the HTTP status code, and end the span. This pattern means my application code doesn't need to think about tracing for every single route; it's handled globally.

func TracingMiddleware(tracer *Tracer, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx := tracer.Extract(r.Context(), propagation.HeaderCarrier(r.Header))
        ctx, span := tracer.StartSpan(ctx, fmt.Sprintf("%s %s", r.Method, r.URL.Path))
        // ... call handler, record status, end span
    })
}

Inside my application handlers, I can then create more precise child spans. For example, in an order API, I might create one span for the database query and another for checking the cache. Because these are created from the context provided by the middleware, they automatically link back to the main HTTP request span.

_, dbSpan := tracer.StartSpan(r.Context(), "db.query")
defer tracer.EndSpan(dbSpan, nil)
// ... run database logic

Managing resources is key in production. I keep an eye on statistics: how many spans started, how many ended, how many were dropped because the export buffer was full. Monitoring these stats helps me tune the buffer sizes and sampling rates. I also make sure to set timeouts on the HTTP client that exports data, so a slow backend doesn't stall my entire application.

Building this from the ground up gives me a clear understanding of the costs involved. Span creation from the pool costs mere nanoseconds. The memory used is proportional to the number of active, sampled traces. For most services, the overhead of tracing is far less than 1% of CPU time, which is a good trade for the operational clarity it provides.

The end result is a system that provides a coherent view of requests across service boundaries. I can see that a slow response on the checkout page was caused by a timeout in the payment service, which itself was waiting on a fraud-check database. This visibility turns a maze of interconnected services into a map I can understand and debug. It starts with a simple span, connected by context, filtered by sampling, and efficiently shipped to a place where I can piece the story together.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!