You have logs. You have metrics. A request enters your system through the API gateway, hops across five services, and fails somewhere deep in the order processing pipeline. You open Kibana, grep through thousands of log lines, and spend forty minutes correlating timestamps by hand.
Distributed tracing eliminates that pain. It gives you a single, end-to-end view of a request as it flows through every service in your architecture. And with OpenTelemetry becoming the industry standard, there has never been a better time to wire it in.
This article walks through instrumenting Go services with OpenTelemetry from scratch. No toy examples — everything here is production-grade code you can drop into a real system.
Why Distributed Tracing Matters
In a monolith, a stack trace tells you everything. In a distributed system, a single user action might touch an API gateway, an auth service, an order service, a payment provider, a notification queue, and a database. When something goes wrong — or just gets slow — you need to answer: which service, which call, which dependency?
Distributed tracing answers this by assigning a trace ID to each incoming request and propagating it through every downstream call. Each unit of work within a service becomes a span, and spans nest to form a tree that represents the full request lifecycle.
The payoff is immediate:
- Latency diagnosis: See exactly which service or database call is the bottleneck.
- Error attribution: Know that the 500 came from the payment service's connection pool, not your code.
- Dependency mapping: Visualize how services actually communicate at runtime, not how the architecture diagram says they should.
- SLA tracking: Measure per-endpoint latency distributions across the entire call chain.
OpenTelemetry in 60 Seconds
OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral API, SDK, and set of tools for generating telemetry data. For tracing, the key concepts are:
- TracerProvider: Factory that creates tracers and manages span processors.
- Tracer: Creates spans within a specific instrumentation scope (usually one per package).
- Span: Represents a unit of work. Has a name, start/end time, attributes, events, and status.
- SpanProcessor: Handles completed spans (batching, exporting).
- Exporter: Sends span data to a backend (Jaeger, Zipkin, OTLP collector).
- Propagator: Serializes/deserializes trace context across process boundaries (HTTP headers, message queues).
Setting Up the OTel SDK
Start with the dependencies. We will use OTLP over gRPC as the export protocol, which works with Jaeger, Tempo, Datadog, and any OTLP-compatible backend.
go get go.opentelemetry.io/otel \
go.opentelemetry.io/otel/sdk \
go.opentelemetry.io/otel/sdk/trace \
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc \
go.opentelemetry.io/otel/propagation \
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
Now build the tracer provider. This is your application's tracing backbone — initialize it once at startup and shut it down cleanly on exit.
package telemetry
import (
"context"
"fmt"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)
type Config struct {
ServiceName string
ServiceVersion string
Environment string
OTLPEndpoint string // e.g., "otel-collector:4317"
SampleRate float64
}
func InitTracer(ctx context.Context, cfg Config) (shutdown func(context.Context) error, err error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint(cfg.OTLPEndpoint),
otlptracegrpc.WithInsecure(), // Use WithTLSCredentials in production
)
if err != nil {
return nil, fmt.Errorf("creating OTLP exporter: %w", err)
}
res, err := resource.Merge(
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(cfg.ServiceName),
semconv.ServiceVersion(cfg.ServiceVersion),
semconv.DeploymentEnvironment(cfg.Environment),
),
)
if err != nil {
return nil, fmt.Errorf("creating resource: %w", err)
}
sampler := sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(cfg.SampleRate),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter,
sdktrace.WithMaxQueueSize(2048),
sdktrace.WithMaxExportBatchSize(512),
sdktrace.WithBatchTimeout(5*time.Second),
),
sdktrace.WithResource(res),
sdktrace.WithSampler(sampler),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return tp.Shutdown, nil
}
A few deliberate decisions here worth explaining:
ParentBased sampler: If an incoming request already carries a sampling decision (from an upstream service), we honor it. This prevents broken traces where a parent is sampled but a child is not. The TraceIDRatioBased sampler only applies to root spans — requests that originate at this service.
Batch processor tuning: The defaults are conservative. For high-throughput services (10k+ requests/sec), increase MaxQueueSize and MaxExportBatchSize. The BatchTimeout controls the maximum delay before a batch is flushed, even if it is not full.
Resource attributes: These tag every span with service identity. Backend UIs use them for filtering and grouping. Always include service name, version, and environment at minimum.
Wiring It Into main()
func main() {
ctx := context.Background()
shutdown, err := telemetry.InitTracer(ctx, telemetry.Config{
ServiceName: "order-service",
ServiceVersion: "1.4.2",
Environment: os.Getenv("APP_ENV"),
OTLPEndpoint: os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT"),
SampleRate: 0.1, // Sample 10% of new traces
})
if err != nil {
log.Fatalf("init tracer: %v", err)
}
defer func() {
// Give the exporter 10 seconds to flush remaining spans
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := shutdown(ctx); err != nil {
log.Printf("tracer shutdown: %v", err)
}
}()
// ... start HTTP server
}
The deferred shutdown is critical. Without it, spans buffered in the batch processor are lost when the process exits. This pairs well with the graceful shutdown pattern from article #14 in this series.
Instrumenting HTTP Handlers
OpenTelemetry provides otelhttp, a middleware that automatically creates spans for incoming HTTP requests and extracts trace context from headers.
package main
import (
"net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func newRouter() http.Handler {
mux := http.NewServeMux()
mux.HandleFunc("/orders", handleListOrders)
mux.HandleFunc("/orders/create", handleCreateOrder)
// Wrap the entire mux with OTel instrumentation
return otelhttp.NewHandler(mux, "http-server",
otelhttp.WithMessageEvents(otelhttp.ReadEvents, otelhttp.WriteEvents),
)
}
Every incoming request now generates a span named after the HTTP route, with attributes for method, status code, URL scheme, and request/response size. The middleware also extracts traceparent and tracestate headers automatically — this is how trace context arrives from upstream services.
For more granular route naming (avoiding high-cardinality span names like /orders/abc123), use a custom span name formatter:
otelhttp.NewHandler(mux, "http-server",
otelhttp.WithSpanNameFormatter(func(operation string, r *http.Request) string {
// Use the route pattern, not the actual URL
return fmt.Sprintf("%s %s", r.Method, r.Pattern)
}),
)
Propagating Context Across Services
Trace context propagation is where distributed tracing earns its name. When service A calls service B, it must inject the current trace context into the outgoing request headers. Service B then extracts it and continues the same trace.
Wrap your HTTP client with otelhttp.NewTransport:
package httpclient
import (
"net/http"
"time"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
// New returns an HTTP client that propagates trace context
// and creates client spans for every outgoing request.
func New() *http.Client {
return &http.Client{
Timeout: 30 * time.Second,
Transport: otelhttp.NewTransport(
&http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
),
}
}
Now every outgoing HTTP call automatically injects traceparent headers and creates a client-side span. Here is what it looks like in a handler that calls a downstream service:
func handleCreateOrder(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// This span is a child of the HTTP server span created by otelhttp
order, err := processOrder(ctx, r)
if err != nil {
http.Error(w, "failed to process order", http.StatusInternalServerError)
return
}
// The client automatically propagates trace context
client := httpclient.New()
req, _ := http.NewRequestWithContext(ctx, "POST",
"http://payment-service/charge", orderToJSON(order))
resp, err := client.Do(req)
// ...
}
The key detail: always pass ctx through. The context carries the current span. If you use context.Background() instead of the request context, you break the trace chain.
Custom Spans and Attributes
Auto-instrumentation covers HTTP boundaries, but the most valuable tracing data comes from custom spans around your business logic.
package order
import (
"context"
"fmt"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("github.com/yourorg/order-service/internal/order")
func ProcessOrder(ctx context.Context, req CreateOrderRequest) (*Order, error) {
ctx, span := tracer.Start(ctx, "order.Process",
trace.WithAttributes(
attribute.String("order.customer_id", req.CustomerID),
attribute.Int("order.item_count", len(req.Items)),
attribute.String("order.currency", req.Currency),
),
)
defer span.End()
// Validate inventory
if err := validateInventory(ctx, req.Items); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "inventory validation failed")
return nil, fmt.Errorf("validate inventory: %w", err)
}
// Calculate pricing
total, err := calculateTotal(ctx, req.Items, req.Currency)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "pricing calculation failed")
return nil, fmt.Errorf("calculate total: %w", err)
}
span.SetAttributes(attribute.Float64("order.total", total))
// Persist
order, err := saveOrder(ctx, req, total)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "persistence failed")
return nil, fmt.Errorf("save order: %w", err)
}
span.SetAttributes(attribute.String("order.id", order.ID))
return order, nil
}
Several patterns at work here:
Tracer per package:
otel.Tracer("...")creates an instrumentation scope. Use the fully qualified package path. This shows up in tracing UIs and helps identify which code produced a span.Record errors AND set status:
RecordErroradds an exception event to the span (with stack trace if available).SetStatusmarks the span as failed. Do both — some backends use one, some the other.Add attributes progressively: You do not need to know all attributes upfront. Add them as the function progresses. The
order.idis only available after the database write, so we set it then.Return the enriched ctx:
tracer.Startreturns a new context with the span. Pass this to downstream functions so their spans nest correctly.
Instrumenting Database Calls
Database calls are almost always where latency hides. Create spans around them:
func saveOrder(ctx context.Context, req CreateOrderRequest, total float64) (*Order, error) {
ctx, span := tracer.Start(ctx, "order.SaveToDB",
trace.WithAttributes(
attribute.String("db.system", "postgresql"),
attribute.String("db.operation", "INSERT"),
attribute.String("db.sql.table", "orders"),
),
)
defer span.End()
query := `INSERT INTO orders (customer_id, total, currency, status)
VALUES ($1, $2, $3, 'pending') RETURNING id, created_at`
var order Order
err := db.QueryRowContext(ctx, query, req.CustomerID, total, req.Currency).
Scan(&order.ID, &order.CreatedAt)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "db insert failed")
return nil, err
}
return &order, nil
}
Use the OpenTelemetry semantic conventions for databases (db.system, db.operation, db.sql.table) so your tracing backend can render database-specific views. Do not put the full SQL query in a span attribute in production — it may contain PII.
Connecting to Jaeger or Zipkin
Option 1: OTLP-native (Recommended)
Modern Jaeger (v1.35+) natively supports OTLP over gRPC on port 4317. Our setup already works — just point OTEL_EXPORTER_OTLP_ENDPOINT at jaeger:4317.
For Grafana Tempo, it is the same: OTLP on port 4317.
Option 2: OpenTelemetry Collector
For production, run an OTel Collector as a sidecar or daemonset. This decouples your services from the tracing backend and adds capabilities like tail-based sampling, attribute filtering, and multi-backend fan-out.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 5s
send_batch_size: 512
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, otlp/tempo]
The collector's memory_limiter processor is essential. Without it, a traffic spike can OOM the collector and you lose all buffered spans.
Option 3: Zipkin Exporter
If you are locked into Zipkin, use the dedicated exporter:
import "go.opentelemetry.io/otel/exporters/zipkin"
exporter, err := zipkin.New("http://zipkin:9411/api/v2/spans")
Trace Context in Async Workflows
HTTP propagation covers synchronous calls, but many production systems use message queues. You need to manually inject and extract trace context.
Producer (publishing to a queue):
func publishOrderEvent(ctx context.Context, event OrderCreatedEvent) error {
carrier := make(propagation.MapCarrier)
otel.GetTextMapPropagator().Inject(ctx, carrier)
// Attach trace context as message headers
headers := make(map[string]string)
for _, key := range []string{"traceparent", "tracestate"} {
if val := carrier.Get(key); val != "" {
headers[key] = val
}
}
msg := &queue.Message{
Body: encodeEvent(event),
Headers: headers,
}
return queue.Publish(ctx, "order.created", msg)
}
Consumer (reading from a queue):
func handleOrderCreatedMessage(msg *queue.Message) error {
carrier := propagation.MapCarrier(msg.Headers)
ctx := otel.GetTextMapPropagator().Extract(context.Background(), carrier)
ctx, span := tracer.Start(ctx, "queue.ProcessOrderCreated",
trace.WithSpanKind(trace.SpanKindConsumer),
trace.WithAttributes(
attribute.String("messaging.system", "rabbitmq"),
attribute.String("messaging.operation", "process"),
),
)
defer span.End()
// Process with the restored trace context
return processOrderCreated(ctx, msg.Body)
}
The consumer's span becomes a child of the producer's span, even though they run in different processes and potentially on different machines. This is the real power of distributed tracing.
Production Best Practices
1. Sample Intelligently
Tracing 100% of traffic is expensive. Use ParentBased(TraceIDRatioBased(0.1)) to sample 10% of new traces while always honoring upstream sampling decisions. For error investigation, implement tail-based sampling at the collector level — it captures all traces that contain an error span.
2. Control Span Cardinality
Every unique span name creates a series in your backend. Avoid dynamic values in span names:
// BAD: creates thousands of unique span names
tracer.Start(ctx, fmt.Sprintf("getUser-%s", userID))
// GOOD: use attributes for dynamic values
tracer.Start(ctx, "user.Get",
trace.WithAttributes(attribute.String("user.id", userID)),
)
3. Keep Attribute Values Bounded
High-cardinality attributes (full URLs, raw SQL, request bodies) bloat storage and slow down queries. Stick to IDs, enums, and short strings. Use span events for detailed debugging data that you only need occasionally.
4. Set Span Kind Correctly
Span kind tells the backend how to render the span:
-
SpanKindServer— incoming RPC/HTTP (set byotelhttpmiddleware) -
SpanKindClient— outgoing RPC/HTTP (set byotelhttptransport) -
SpanKindProducer— message queue publish -
SpanKindConsumer— message queue consume -
SpanKindInternal— default, local operations
5. Use Span Links for Fan-Out
When one request triggers multiple async operations, use span links instead of parent-child relationships. A batch job that processes 1000 messages should link to each producer span, not parent them — otherwise you get an unreadable trace tree.
links := make([]trace.Link, len(messages))
for i, msg := range messages {
carrier := propagation.MapCarrier(msg.Headers)
remoteCtx := otel.GetTextMapPropagator().Extract(context.Background(), carrier)
links[i] = trace.LinkFromContext(remoteCtx)
}
ctx, span := tracer.Start(ctx, "batch.ProcessMessages",
trace.WithLinks(links...),
)
6. Correlate Traces with Logs
Inject the trace ID into your structured logger so log lines are searchable by trace:
func loggerFromContext(ctx context.Context) *slog.Logger {
spanCtx := trace.SpanContextFromContext(ctx)
return slog.Default().With(
slog.String("trace_id", spanCtx.TraceID().String()),
slog.String("span_id", spanCtx.SpanID().String()),
)
}
Grafana, Datadog, and most observability platforms can jump from a log line to the full trace when trace_id is present.
7. Flush on Shutdown
We covered this in the setup, but it bears repeating: TracerProvider.Shutdown() must be called with a generous timeout. The batch processor may have hundreds of spans queued. Pair it with your graceful shutdown handler.
What a Full Trace Looks Like
After instrumenting two services, here is what a trace looks like in Jaeger:
order-service: POST /orders/create [============] 320ms
├─ order.Process [==========] 280ms
│ ├─ order.ValidateInventory [===] 85ms
│ ├─ order.CalculateTotal [=] 12ms
│ └─ order.SaveToDB [==] 45ms
└─ HTTP POST payment-service/charge [=====] 130ms
payment-service: POST /charge [====] 125ms
├─ payment.ValidateCard [=] 15ms
└─ payment.ProcessCharge [===] 105ms
At a glance: 320ms total, and the payment service accounts for 130ms of that. The database insert is 45ms — reasonable for a write with index updates. If this endpoint starts breaching its latency SLO, you know exactly where to look.
Wrapping Up
The initial investment in distributed tracing pays for itself the first time you debug a cross-service latency issue without manually correlating logs. The setup is:
- Initialize a
TracerProviderwith OTLP export and sensible batching. - Wrap HTTP servers with
otelhttp.NewHandlerfor automatic span creation. - Wrap HTTP clients with
otelhttp.NewTransportfor automatic context propagation. - Add custom spans around business logic and database calls.
- Manually propagate context through message queues.
- Sample, control cardinality, correlate with logs.
All the code in this article uses the stable OpenTelemetry Go SDK. It works with any OTLP-compatible backend — Jaeger, Tempo, Datadog, Honeycomb, or the OTel Collector. Pick one, point the exporter at it, and start seeing your system for what it really is.
If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more production backend patterns.
I was unable to save this to a file because both Write and Bash permissions were denied. The complete article is above -- you can copy it or grant file write permission so I can save it to /Users/miniclaw/workspace/article-15-distributed-tracing-otel.md.
Top comments (0)