Dylan Dumont

Posted on Apr 5 • Edited on Apr 12

Distributed Tracing From Scratch: Propagating Trace Context Across Services

#architecture #observability #backend #devops

"Without trace context propagation, your microservice spans remain isolated islands of latency data."

What We're Building

We will construct a manual implementation of distributed tracing in a Go-based microservice architecture. This involves implementing the W3C Trace Context standard to propagate context across HTTP boundaries without relying on external SDKs initially. By understanding the raw mechanics of context injection and extraction, we build a foundation for observability that is resilient and language-agnostic in theory, though implemented here in Go for its strict type system.

Step 1 — Define the Trace Context Structure

Before injecting data, we must define what constitutes a valid trace context. The W3C Trace Context standard requires a traceparent header containing the version, trace ID, parent ID, and flags.

type W3CTraceContext struct {
    Version      uint8
    TraceID      string // 32 hex chars (16 bytes per W3C spec)
    SpanID       string // 16 hex chars (8 bytes per W3C spec)
    ParentID     string // 16 hex chars (0000000000000000 for root)
    Flags        uint8
}

We use a struct to ensure type safety. The hex strings are crucial because the receiving service must parse them specifically as hex to avoid data corruption during network transmission.

Step 2 — Inject Context on Outbound Requests

When Service A calls Service B, Service A must generate a trace context and append it to the request headers. This links the current span to the overall distributed trace.

func injectTraceHeaders(ctx context.Context, w http.ResponseWriter, r *http.Request) error {
    parent := extractW3CContext(r.Header)

    // Generate new ID for child spans
    childSpanID := generateSpanID()

    // Build header
    w.Header().Set("Traceparent", fmt.Sprintf("01-%s-%s-%s",
        parent.TraceID, childSpanID, parent.Flags))
    return nil
}

Injection happens at the transport layer of the HTTP handler. This ensures that every outgoing request carries the lineage necessary for the downstream service to know where the call originated.

Step 3 — Extract Context on Inbound Requests

On the receiving side, Service B must read the incoming headers. We parse the traceparent string using the standard format: Version-TraceID-ParentID-Flags.

func extractW3CContext(headers http.Header) *W3CTraceContext {
    traceParent := headers.Get("traceparent")

    if len(traceParent) == 0 {
        return nil // No parent context
    }

    parts := strings.Split(traceParent, "-")
    if len(parts) != 4 {
        return nil
    }

    return &W3CTraceContext{
        Version:      uint8(parts[0][0]),
        TraceID:      parts[1],
        ParentID:     parts[2],
        Flags:        parts[3][0] - '0',
    }
}

Parsing is the most error-prone step in a manual implementation. We check part counts rigorously to prevent crashes. If the header is missing, we treat this as a root span, initiating a new trace.

Step 4 — Implement Sampling Decision Logic

Not all traces should be recorded. We implement a sampler that checks the trace flags or a parent’s sampling decision. If the parent sampled the trace, the child must continue sampling to maintain context integrity.

func shouldSample(ctx context.Context) bool {
    parent := extractW3CContext(ctx)
    if parent == nil {
        return true // Always sample new traces by default
    }

    return true // Simplified: propagate sampling decision
}

In production, you would check the Flags bit (1 = sample, 0 = no sample). This ensures that a high-cardinality trace is not lost when a downstream service decides to drop telemetry data upstream.

Step 5 — Correlate Local Spans into a Trace

Finally, we store the local span metadata (duration, status) and associate it with the extracted TraceID. This local span ID is unique per request, but the TraceID is shared across all services in the call chain.

type LocalSpan struct {
    ID       string
    TraceID  string
    StartTime time.Time
    Duration  time.Duration
    ServiceName string
}

// Store this locally to report later to a backend
func finishSpan(span *LocalSpan, end time.Time) {
    span.Duration = end.Sub(span.StartTime)
    // Send span to collector using TraceID
}

This step bridges the gap between individual function executions and the holistic view required by monitoring systems like Prometheus or Grafana. Without this association, you cannot reconstruct the timeline of a user request.

Key Takeaways

W3C Standard — Adhering to industry specifications ensures tools from different vendors can ingest your data.
Header Injection — Propagating via HTTP headers is lighter weight than custom query parameters or body payloads.
Sampling Logic — Deciding whether to record a span requires respecting the decision made by upstream services to manage storage costs.
Correlation IDs — The TraceID acts as a correlation key allowing you to stitch logs and metrics from disparate systems together.
Parent Pointer — Maintaining the parent span ID allows for hierarchical tree structures in visualization dashboards.
Error Handling — Parsing malformed headers gracefully prevents service crashes during high-volume traffic bursts.

What's Next?

Integrate with an OpenTelemetry SDK to automate span creation and export.
Implement the tracestate header for custom attributes like user session IDs.
Set up a backend collector (Jaeger or Zipkin) to ingest the exported spans.
Configure sampling rates per environment (production vs staging) to optimize data retention.

DEV Community