Distributed Tracing in Go: Build Observability Into Your Microservices From Scratch

#programming #devto #go #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

When you have a system built from many small services, figuring out what went wrong or what's slow can feel like searching for a needle in a haystack. A request might bounce through five, ten, or even twenty different services. If it fails or slows down, which link in the chain is the problem? This is where distributed tracing comes in. It's like giving a unique passport to each request as it enters your system, stamping it at every service it visits. Later, you can gather all those stamps to see the complete journey.

I think of a "trace" as the entire story of one request. Each step in that story, like a call to a database or another service, is a "span." By collecting these spans, we can see the whole picture: how long each part took, if it failed, and how the services are connected.

Let's talk about how to build this in Go. The goal is to create something that adds clarity without slowing everything down. We need a few core parts: a way to start and track traces, a method to pass trace information between services, a smart system to decide which requests to record, and a place to store and analyze the data.

First, we establish a tracer. This is the main object that manages the tracing lifecycle. We'll use the OpenTelemetry project as a foundation because it provides excellent standards and tools.

package main

import (
    "context"
    "log"
    "net/http"
    "sync"
    "sync/atomic"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/propagation"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
    "go.opentelemetry.io/otel/trace"
)

// DistributedTracer is our central manager.
type DistributedTracer struct {
    tracer      trace.Tracer
    propagator  propagation.TextMapPropagator
    sampler     TraceSampler
    spanStore   *SpanStorage
    stats       TraceStats
}

We initialize it by connecting to a backend like Jaeger, which will collect and visualize our traces. We also set up a "propagator." This is the crucial piece that knows how to pack trace information into HTTP headers or other message formats to send it to the next service.

func NewDistributedTracer(serviceName string, sampleRate float64) (*DistributedTracer, error) {
    exp, err := jaeger.New(jaeger.WithCollectorEndpoint())
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithSampler(sdktrace.TraceIDRatioBased(sampleRate)),
    )

    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))

    return &DistributedTracer{
        tracer:     tp.Tracer(serviceName),
        propagator: otel.GetTextMapPropagator(),
        sampler:    NewTraceSampler(sampleRate),
        spanStore:  NewSpanStorage(10000),
    }, nil
}

Now, the magic of propagation. When Service A calls Service B, it must pass along the trace context. We do this by injecting the data into the HTTP headers before the call.

// InjectTrace puts trace context into headers, gRPC metadata, etc.
func (dt *DistributedTracer) InjectTrace(ctx context.Context, carrier propagation.TextMapCarrier) {
    dt.propagator.Inject(ctx, carrier)
}

// ExtractTrace gets trace context from an incoming request.
func (dt *DistributedTracer) ExtractTrace(ctx context.Context, carrier propagation.TextMapCarrier) context.Context {
    return dt.propagator.Extract(ctx, carrier)
}

In practice, this means before Service A makes an HTTP request to B, it calls InjectTrace on the headers. Service B, upon receiving the request, immediately calls ExtractTrace to retrieve the context and link its work back to the original trace.

You can't trace every single request in a high-volume system. The overhead would be too great. This is where sampling is essential. You might only record 1 or 10 out of every 100 requests. The key is to sample intelligently.

A simple sampler might just use a random percentage. A more advanced one can adapt. If a particular service operation starts throwing errors, you might want to sample it more heavily temporarily to gather more data.

type TraceSampler struct {
    baseRate   float64
    adaptive   bool
    mu         sync.RWMutex
    operationRates map[string]float64
}

func (ts *TraceSampler) ShouldSample(operation string) bool {
    ts.mu.RLock()
    rate, exists := ts.operationRates[operation]
    ts.mu.RUnlock()

    if !exists {
        rate = ts.baseRate // Default to the base rate
    }
    // Simple probabilistic check
    return rand.Float64() < rate
}

func (ts *TraceSampler) AdjustRate(operation string, currentErrorRate float64) {
    if !ts.adaptive {
        return
    }
    ts.mu.Lock()
    defer ts.mu.Unlock()

    if currentErrorRate > 0.1 { // If errors are high
        ts.operationRates[operation] = ts.baseRate * 3.0 // Sample more
        if ts.operationRates[operation] > 1.0 {
            ts.operationRates[operation] = 1.0 // Cap at 100%
        }
    } else {
        ts.operationRates[operation] = ts.baseRate // Reset to normal
    }
}

Now, where do we put the spans we collect? We need a temporary storage system. In production, spans are sent to a backend like Jaeger. For our understanding, let's look at a simple in-memory store.

type SpanStorage struct {
    spans    map[TraceID][]*SpanData
    mu       sync.RWMutex
    capacity int
}

type SpanData struct {
    TraceID    TraceID
    SpanID     SpanID
    Name       string
    StartTime  time.Time
    EndTime    time.Time
    Attributes map[string]interface{}
    HasError   bool
}

func (ss *SpanStorage) StoreSpan(span *SpanData) {
    ss.mu.Lock()
    defer ss.mu.Unlock()

    if len(ss.spans) >= ss.capacity {
        ss.evictOldest()
    }
    ss.spans[span.TraceID] = append(ss.spans[span.TraceID], span)
}

The most practical way to add tracing is through middleware. For an HTTP service, a middleware function can handle all the boilerplate: extracting context, starting a span, and recording the result.

func TracingMiddleware(tracer *DistributedTracer) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // 1. Extract trace context from incoming request headers
            ctx := tracer.ExtractTrace(r.Context(), propagation.HeaderCarrier(r.Header))

            // 2. Start a new span for this request
            spanName := r.Method + " " + r.URL.Path
            ctx, span := tracer.tracer.Start(ctx, spanName)
            defer span.End() // End the span when the handler finishes

            // 3. Inject the context so any downstream calls carry the trace
            tracer.InjectTrace(ctx, propagation.HeaderCarrier(w.Header()))

            // 4. Wrap the response writer to capture the status code
            wr := &responseWriter{ResponseWriter: w, statusCode: 200}
            next.ServeHTTP(wr, r.WithContext(ctx)) // Process the request

            // 5. Record important details about the request
            span.SetAttributes(
                semconv.HTTPMethodKey.String(r.Method),
                semconv.HTTPStatusCodeKey.Int(wr.statusCode),
                semconv.HTTPRouteKey.String(r.URL.Path),
            )
            if wr.statusCode >= 500 {
                span.RecordError(fmt.Errorf("server error: %d", wr.statusCode))
            }
        })
    }
}

// A helper to capture the HTTP status code
type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

With this middleware, every HTTP request is automatically traced. Developers can focus on business logic, and the tracing is woven in.

Once you have traces, the real value comes from analysis. You can write simple analyzers to scan your span storage and find patterns.

type PerformanceReport struct {
    SlowestEndpoints map[string]time.Duration
    FrequentErrors   map[string]int
    ServiceDeps      map[string][]string // Which services call which others
}

func (dt *DistributedTracer) GenerateReport() *PerformanceReport {
    report := &PerformanceReport{
        SlowestEndpoints: make(map[string]time.Duration),
        FrequentErrors:   make(map[string]int),
        ServiceDeps:      make(map[string][]string),
    }

    dt.spanStore.mu.RLock()
    defer dt.spanStore.mu.RUnlock()

    for _, spans := range dt.spanStore.spans {
        for _, span := range spans {
            // Find slow operations
            duration := span.EndTime.Sub(span.StartTime)
            if maxDur, exists := report.SlowestEndpoints[span.Name]; !exists || duration > maxDur {
                report.SlowestEndpoints[span.Name] = duration
            }
            // Count errors
            if span.HasError {
                report.FrequentErrors[span.Name]++
            }
            // Note dependencies from attributes
            if targetService, ok := span.Attributes["peer.service"].(string); ok {
                report.ServiceDeps[span.Name] = append(report.ServiceDeps[span.Name], targetService)
            }
        }
    }
    return report
}

Running this report every few minutes can show you which endpoints are consistently slow, which are failing, and how your service graph is connected. I've used this to find a hidden database call that was adding 500ms to an API endpoint—a call that wasn't in the main code path but was triggered by a poorly written library.

Let's put it all together in a main function to see how it works in a small example.

func main() {
    tracer, err := NewDistributedTracer("payment-service", 0.3) // Sample 30% of requests
    if err != nil {
        log.Fatal(err)
    }

    mux := http.NewServeMux()
    mux.HandleFunc("/charge", func(w http.ResponseWriter, r *http.Request) {
        // Simulate work
        time.Sleep(time.Millisecond * 10)

        // Simulate a call to a user service
        userCtx, userSpan := tracer.tracer.Start(r.Context(), "validate_user")
        time.Sleep(time.Millisecond * 5)
        userSpan.End()

        w.Write([]byte("Charged"))
    })

    // Wrap the entire router with tracing
    tracedHandler := TracingMiddleware(tracer)(mux)

    // Start a background reporter
    go func() {
        ticker := time.NewTicker(60 * time.Second)
        for range ticker.C {
            report := tracer.GenerateReport()
            log.Printf("Performance Snapshot: %+v\n", report.SlowestEndpoints)
        }
    }()

    log.Println("Server starting on :8080")
    http.ListenAndServe(":8080", tracedHandler)
}

When you run this, a request to /charge will generate a trace with (at least) two spans: one for the HTTP handler and a child span for the validate_user operation. If you sampled this request, its journey would appear in your Jaeger UI, showing you the timing breakdown.

A few important lessons from building these systems. First, keep identifiers consistent. A trace must have the same ID across all services. The OpenTelemetry propagator handles this. Second, be mindful of context. Always pass the context.Context from the request through any function that might start a span or make an external call.

Third, sampling is your best tool for controlling cost. Start with a low rate (like 1%) in production and adjust based on your needs. Finally, remember that traces are just one piece. They combine with logs and metrics to give you full insight. A trace can tell you which service is slow, but you might need a metric to tell you how many times it happened, and a log to tell you why.

Building a distributed tracing system fundamentally changes how you understand your microservices. It turns a tangled web of independent processes into a coherent, observable flow. You stop guessing about performance and start knowing. The small amount of code you add to each service pays for itself the first time you use a trace to pinpoint a critical failure in seconds instead of hours.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!