DEV Community

DevHelm
DevHelm

Posted on • Originally published at devhelm.io

Jaeger Tracing Explained: How Distributed Tracing Works

Distributed tracing answers the question that uptime monitoring can't: a request failed, but which service in the chain caused it?

When your checkout endpoint returns a 500 and your monitoring dashboard shows the API is degraded, you know that something broke. You don't know where in the chain of payment-service -> inventory-service -> shipping-service the failure originated. Distributed tracing instruments every service in the path and stitches the results into a single timeline — a trace — that shows exactly where latency accumulated or where an error propagated from.

Jaeger is the most widely deployed open-source distributed tracing backend and a CNCF graduated project. This guide covers how it works, how to set it up with OpenTelemetry, and when tracing complements (rather than replaces) uptime monitoring.

What Jaeger does

Jaeger collects, stores, and visualizes traces. A trace is a tree of spans — each span represents one unit of work (an HTTP request handler, a database query, a gRPC call to another service). Spans carry timing data, status codes, and arbitrary tags.

When you instrument your services with OpenTelemetry SDKs, each outgoing request propagates a trace ID via HTTP headers. Jaeger collects spans from every service, groups them by trace ID, and renders the full request path as a timeline.

The architecture has four components:

Component Role
Agent Lightweight daemon that receives spans from the SDK and forwards to the collector. Optional with OTLP — the SDK can send directly.
Collector Receives spans, validates, indexes, and writes to storage.
Query API + UI for searching and visualizing traces.
Storage Pluggable backend — Elasticsearch, Cassandra, Kafka, Badger, or in-memory for development.

For development and small deployments, Jaeger ships an all-in-one binary that bundles all four components with in-memory storage:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest
Enter fullscreen mode Exit fullscreen mode

Port 16686 is the Jaeger UI. Ports 4317 (gRPC) and 4318 (HTTP) receive OTLP spans from OpenTelemetry SDKs.

Instrumenting with OpenTelemetry

Jaeger originally shipped its own client libraries, but the project now officially recommends OpenTelemetry SDKs for instrumentation. The OTel SDK instruments your code; the OTLP exporter sends spans to Jaeger's collector endpoint.

A minimal Python example:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter,
)

provider = TracerProvider()
exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317", insecure=True
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("checkout-service")

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", "ORD-1234")
    span.set_attribute("order.total", 89.99)
Enter fullscreen mode Exit fullscreen mode

The BatchSpanProcessor buffers spans and flushes them in batches to avoid blocking your request path. For production, add resource attributes (service name, version, environment) so you can filter traces in the Jaeger UI.

The equivalent in TypeScript with auto-instrumentation:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4317",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: "checkout-service",
});

sdk.start();
Enter fullscreen mode Exit fullscreen mode

The auto-instrumentations-node package automatically instruments HTTP, gRPC, Express, database clients, and dozens of other libraries — no manual span creation needed for most frameworks.

Jaeger vs Zipkin vs commercial APMs

Jaeger Zipkin Datadog APM / New Relic
License Apache 2.0 Apache 2.0 Proprietary
Storage Elasticsearch, Cassandra, Kafka, Badger Elasticsearch, Cassandra, MySQL Vendor-hosted
Protocol OTLP native, Thrift legacy Zipkin JSON/Thrift, OTLP via collector OTLP, proprietary agents
Auto-instrumentation Via OpenTelemetry SDKs Via OpenTelemetry or Brave Proprietary agents with deep integration
Cost Infrastructure only Infrastructure only $15-75/host/month
Trace analytics Search + compare in UI Search in UI ML anomaly detection, dashboards, alerting

Choose Jaeger when you want full control over your tracing data, you're already running Elasticsearch or Cassandra, and you have the capacity to operate the backend.

Choose a commercial APM when you need trace-based alerting, ML anomaly detection, or you don't want to operate storage infrastructure. The cost is real, but so is the operational burden of self-hosted tracing.

Choose Zipkin when you have an existing Zipkin deployment. For new setups, Jaeger has better OTLP support and a more active community.

For larger deployments, the OpenTelemetry Collector sits between your SDKs and Jaeger, handling batching, sampling, and multi-backend fan-out.

When tracing isn't enough

Tracing answers "which service in the chain is slow?" It doesn't answer "is the service reachable from the outside?" A trace only exists when a request is made — if your service is completely down and rejecting connections, there are no spans to collect.

This is where uptime monitoring and tracing complement each other:

Signal What it tells you Blind spot
Uptime monitoring Is the service responding? How fast? Is the SSL cert valid? Why is it slow? Which internal dependency is the bottleneck?
Distributed tracing Where in the call chain did latency accumulate? Which downstream service errored? Is the service reachable at all? Did the DNS resolve? Did the cert expire?

The most useful setup is both: an external monitor that checks your endpoints every 30 seconds from multiple regions, combined with internal tracing that captures the request path when those endpoints are hit. When the monitor fires an alert because p95 latency crossed your SLO threshold, the trace for that time window shows you exactly which downstream call caused the spike.

For dependency-driven outages — a cloud provider degrades, your payment service slows down, your checkout endpoint breaches its latency budget — a vendor status feed tells you the dependency degraded before you start digging through traces. That head start on root cause identification is the difference between a 15-minute MTTR and an hour-long investigation.

Where to start

If you've never used distributed tracing, start with Jaeger all-in-one in Docker (the command above), instrument one service with the OpenTelemetry SDK, and trace a single request end-to-end. The first time you see a 2-second span on a database query that you thought took 50ms, tracing has paid for itself.

Pair it with external monitoring. Set up a monitor at app.devhelm.io for the same endpoint you're tracing — you'll know both that the service is degraded and where in the call chain the problem lives.


Originally published on DevHelm.

Top comments (0)