DEV Community

DevHelm
DevHelm

Posted on • Originally published at devhelm.io

Jaeger vs Zipkin: Which Distributed Tracing Backend to Pick in 2026

Jaeger and Zipkin both store and query distributed traces. They both support Elasticsearch and Cassandra as storage backends. They both accept data from OpenTelemetry instrumented applications. If you're evaluating them side by side, the marketing pages won't help — they describe the same features with different adjectives.

This comparison focuses on the architectural differences that actually affect your operational experience. For the foundational concepts — spans, traces, context propagation — see Distributed Tracing 101.

Origin and governance

Jaeger was built at Uber in 2015 to trace requests across their microservice fleet. It was open-sourced, donated to the CNCF, and graduated in 2019. It is written in Go. Active development continues under the CNCF umbrella with hundreds of contributors.

Zipkin was built at Twitter in 2012, inspired by Google's Dagger paper. It is written in Java. It is an independent open-source project — not part of the CNCF. Development is active but slower than Jaeger's, with a smaller contributor base.

The governance difference matters for long-term bets. CNCF graduation means Jaeger has committed maintainers, a security audit process, and a defined path for new features. Zipkin relies on a smaller group of core maintainers.

Architecture

This is the most consequential difference.

Zipkin is monolithic. The collector, storage interface, query API, and web UI run as a single process. You deploy one binary (or one Docker container), point it at a storage backend, and you're done. This makes Zipkin trivially easy to deploy and operate for small-to-medium workloads.

Jaeger is distributed. The architecture separates into independently scalable components:

Component Role
jaeger-collector Receives spans, validates, indexes, writes to storage
jaeger-query Serves the UI and API, reads from storage
jaeger-agent Optional — runs per-node, buffers spans, forwards to collector
jaeger-ingester Optional — reads from Kafka for high-volume deployments

Each component can be scaled independently. Under heavy load, you scale the collector horizontally without touching the query service. The agent buffers spans locally, so a temporary collector outage doesn't lose data from your applications.

The trade-off: Jaeger requires more operational knowledge to deploy and tune. You're running 2–4 separate services instead of one.

When the architecture difference matters

Below ~100,000 spans/second: Zipkin's monolithic architecture is fine. One process, one container, straightforward resource allocation.

Above ~100,000 spans/second: Zipkin's single process becomes a bottleneck. The collector, storage writer, and query service compete for the same CPU and memory. Jaeger's separated architecture lets you scale the collector (the write path) independently of the query service (the read path).

With Kafka as a buffer: Jaeger has a native Kafka integration via the ingester component. Write spans to Kafka, then the ingester reads and writes to storage asynchronously. This absorbs traffic spikes without backpressure to your applications. Zipkin supports Kafka as a transport layer, but the integration is less mature.

Storage backends

Backend Jaeger Zipkin
Elasticsearch First-class support. Most common production choice. Supported, commonly used.
Cassandra First-class support. Jaeger was originally built on Cassandra at Uber. Supported (Zipkin's original backend at Twitter).
MySQL Not supported. Supported. Suitable for small deployments only.
Kafka Native ingester component for buffering. Transport layer support, not primary storage.
Badger Supported (embedded key-value store, for single-node deployments). Not supported.
In-memory Supported (development only). Supported (development only).

For production, both converge on Elasticsearch or Cassandra. The choice between those two is a separate decision based on your existing infrastructure and query patterns.

Query and UI

Jaeger UI is a React application with trace search, trace detail view, trace comparison (side-by-side diff of two traces), service dependency graphs, and Service Performance Monitoring (SPM) dashboards. The trace comparison feature is useful for debugging — compare a slow trace against a fast trace to identify the divergence point.

Zipkin UI is simpler. It offers trace search, trace detail view, and a dependency diagram. No trace comparison, no SPM. The interface is functional but less feature-rich.

For teams using Grafana, both integrate as data sources. Grafana's native Jaeger and Zipkin data sources let you query traces from your existing dashboards, reducing the need to use either tool's built-in UI.

OTel integration

Both accept traces from OpenTelemetry instrumented applications:

  • Jaeger natively accepts OTLP (gRPC and HTTP). Configure the OTel Collector's OTLP exporter to point at the Jaeger collector. No protocol translation needed.
  • Zipkin requires the Zipkin exporter in the OTel Collector, which translates OTLP spans to Zipkin's wire format. This works but adds a translation layer.

If you're starting with OpenTelemetry (and you should be — see OTel vs Jaeger for why), Jaeger's native OTLP support is a practical advantage. One less protocol conversion, one less thing to debug.

Sampling

Jaeger supports adaptive sampling — the collector dynamically adjusts sampling rates per service based on traffic volume. High-traffic services get sampled more aggressively; low-traffic services keep more traces. Remote sampling lets you change sampling rates without redeploying your applications.

Zipkin supports fixed-rate and probability-based sampling. You set a percentage, and that percentage of traces gets recorded. Changing the rate requires reconfiguring the Zipkin client or the OTel SDK's sampler.

Adaptive sampling matters at scale. If your checkout service handles 100 RPS and your admin panel handles 1 RPS, a flat 10% sampling rate gives you 10 checkout traces and 0.1 admin traces per second. Adaptive sampling automatically keeps more admin traces because the volume is lower.

Decision table

If you... Pick
Run fewer than 10 services and want minimal operational overhead Zipkin
Need trace comparison (diff two traces side by side) Jaeger
Already run Elasticsearch and want to reuse it Either — both support ES well
Need adaptive sampling for high-volume services Jaeger
Want a single binary with zero configuration Zipkin
Run on Kubernetes and want an official operator Jaeger
Need Kafka as a buffer for traffic spikes Jaeger
Prefer MySQL over Elasticsearch/Cassandra Zipkin
Value CNCF governance and long-term maintenance Jaeger

The common answer in 2026

For most teams starting a new tracing deployment in 2026, the answer is Jaeger. The CNCF backing, native OTLP support, Kubernetes operator, adaptive sampling, and trace comparison features collectively outweigh Zipkin's simplicity advantage — especially since Jaeger's all-in-one deployment mode (jaeger-all-in-one) gives you a single binary for development and small production workloads anyway.

Zipkin remains a valid choice if you have an existing Zipkin deployment, prefer MySQL storage, or want the simplest possible setup for a small-scale system.

Both tools sit downstream of the OTel Collector. If you instrument with OpenTelemetry and export via the Collector, switching from Zipkin to Jaeger (or vice versa) is a config change — not a re-instrumentation project.

Monitor whichever backend you choose with external health checks at app.devhelm.io. A tracing backend that goes down silently means you lose trace data during the exact window when you're most likely to need it — during an incident.


Originally published on DevHelm.

Top comments (0)