DevHelm

Posted on Jun 8 • Originally published at devhelm.io

Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

#guides #infrastructure #reliability

A request enters your system through an API gateway, hits an authentication service, queries a database, calls a payment provider, publishes an event to a message queue, and returns a response. When that request takes 4 seconds instead of 400 milliseconds, which service is responsible?

Without distributed tracing, you open five dashboards, compare timestamps in five different log streams, and try to reconstruct the request path from memory. With distributed tracing, you open one trace and see every hop, every duration, and every failure — in a single view.

Distributed tracing is the practice of propagating a unique identifier through every service that handles a request, recording the work each service does as spans, and assembling those spans into a trace that represents the request's complete journey.

The mental model: spans and traces

A span is a named, timed operation. "Query user table" is a span. "Call Stripe API" is a span. "Validate JWT" is a span. Each span records:

A name (what happened)
A start time and duration (how long it took)
A status (OK, error, or unset)
Attributes (key-value metadata: http.method=POST, db.statement=SELECT..., rpc.service=PaymentService)
A parent span ID (which span triggered this one)

A trace is a tree of spans rooted at the entry point. The root span represents the entire request. Child spans represent sub-operations. The parent-child relationships form a directed acyclic graph that mirrors the actual execution flow.

Trace: a]b2c3d4 (POST /api/v1/orders)
├── [12ms] Validate JWT
├── [340ms] Query order history
│   └── [320ms] PostgreSQL SELECT
├── [1,200ms] Call Stripe API
│   ├── [800ms] Create PaymentIntent
│   └── [380ms] Confirm PaymentIntent
└── [45ms] Publish OrderCreated event
    └── [38ms] NATS publish

From this trace, you can immediately see that the Stripe API call dominates the latency (1,200ms out of ~1,600ms total). No log correlation, no dashboard cross-referencing, no guesswork.

Context propagation: the glue

Spans only form a trace if each service knows which trace it's participating in. This happens through context propagation — injecting the trace ID and parent span ID into the request headers, then extracting them on the receiving side.

The standard header format is W3C Trace Context:

traceparent: 00-a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6-a1b2c3d4e5f6a7b8-01

This single header carries the trace ID, the parent span ID, and trace flags (sampled or not). Every HTTP client, gRPC framework, and message queue client that supports W3C Trace Context can propagate context automatically. If you're using OpenTelemetry SDKs, propagation is enabled by default.

The failure mode to watch for: a service that doesn't propagate context creates a broken trace. The spans from upstream and downstream services exist in the backend, but they don't connect. The trace view shows two disconnected fragments instead of one coherent tree. This is almost always caused by an uninstrumented HTTP client or a custom queue consumer that doesn't extract the traceparent header.

The standards: OpenTracing → OpenCensus → OpenTelemetry

The distributed tracing ecosystem went through a painful convergence:

OpenTracing (2016–2019). The first vendor-neutral tracing API. Defined the span/trace/context model. Adopted by Jaeger, Zipkin, and many vendor SDKs. Problem: it was an API spec only — no implementation. Every vendor shipped a different SDK with a different wire format.

OpenCensus (2017–2019). Google's attempt to standardize instrumentation across metrics and tracing. Included both the API and an SDK implementation. Problem: it competed with OpenTracing, fragmenting the ecosystem further.

OpenTelemetry (2019–present). The merger of OpenTracing and OpenCensus under the CNCF. Covers traces, metrics, and logs with a unified API, SDK, and wire protocol (OTLP). This is the convergence point — if you're starting today, start with OpenTelemetry.

The practical consequence: if you see a library or tutorial using opentracing or opencensus imports, it's using a deprecated path. Migrate to @opentelemetry/* packages. The concepts are the same; the wire protocol and SDK are different.

The tool landscape

Distributed tracing has two layers: the instrumentation layer (what generates and collects spans) and the backend layer (what stores and queries them). OpenTelemetry has won the instrumentation layer. The backend layer is still competitive:

Backend	Architecture	Storage	Strengths	Weaknesses
Jaeger	Collector + Query + UI	Elasticsearch, Cassandra, Kafka, Badger	CNCF graduated, battle-tested, flexible storage.	UI is functional but basic. No built-in metrics.
Zipkin	Monolithic or microservice	Cassandra, Elasticsearch, MySQL, in-memory	Simpler to deploy than Jaeger, smaller resource footprint.	Fewer features, smaller community, less active development.
Grafana Tempo	Distributed, object-storage-native	S3, GCS, Azure Blob	Cheapest at scale (no indexing). TraceQL is expressive.	Requires Grafana for visualization. Search depends on trace discovery (exemplars).
Datadog APM	SaaS	Managed	Zero operational burden. Unified with metrics and logs.	Expensive. Vendor lock-in.
Honeycomb	SaaS, columnar storage	Managed	Arbitrary-dimension queries. Excellent for high-cardinality.	Expensive at scale. Learning curve for BubbleUp queries.

For a detailed Jaeger vs Zipkin comparison, including architecture differences, OTel integration, and a decision table, see our dedicated comparison. For the relationship between OpenTelemetry and Jaeger — they complement each other, they don't compete — see that guide.

Your first tracing pipeline

The fastest path to a working trace pipeline is: OTel SDK → OTel Collector → Jaeger. Here's a minimal setup.

1. Instrument your application

For a Node.js Express application:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4317",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: "order-service",
});

sdk.start();

This auto-instruments HTTP, gRPC, database clients, and popular frameworks. Every incoming request creates a span. Every outgoing HTTP call creates a child span. Context propagation is automatic.

2. Run the OTel Collector

Use the config from our OTel Collector guide:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 512

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]

3. Run Jaeger

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/jaeger:latest

Open http://localhost:16686 and you'll see traces from your application. Click on a trace to see the span tree — every service hop, every database query, every external API call, with timing for each.

Sampling: the cost control lever

In a high-throughput system (10,000+ requests per second), tracing every request generates terabytes of data per day. Sampling reduces the volume while preserving diagnostic value.

Head-based sampling decides at the entry point whether to trace the request. Simple and predictable, but it can miss rare errors (a 0.1% error rate with 10% sampling means 90% of error traces are lost).

Tail-based sampling records all spans initially, then decides at the Collector whether to keep the complete trace. This lets you keep 100% of error traces, 100% of slow traces, and sample 1% of normal traces. The trade-off: the Collector must buffer all spans until the trace completes, which requires more memory.

For most teams, start with head-based sampling at 10–50% and add tail-based sampling when you find yourself missing critical traces.

Monitoring the tracing pipeline itself

Your tracing pipeline is infrastructure that can fail. The OTel Collector can OOM, Jaeger's Elasticsearch backend can run out of disk, and the network between your Collector and backend can partition. When any of these fail, traces are silently dropped — you don't notice until someone asks "why are there no traces for this incident?"

External monitoring closes the gap. A 30-second health check on your Collector's health endpoint and your Jaeger query service catches pipeline failures before the gap in your trace data becomes a blind spot. Set up these checks at app.devhelm.io — the infrastructure that observes your application should itself be observed by something outside your stack.

Originally published on DevHelm.

DEV Community