Raj Kundalia

Posted on Jan 31

Distributed Tracing in Spring Boot: A Practical Guide to OpenTelemetry and Jaeger

#java #springboot #microservices #monitoring

TL;DR

Distributed tracing helps you understand how requests flow through microservices by tracking every hop with minimal overhead. This guide covers OpenTelemetry integration in Spring Boot 4 using the native starter, explains core concepts like spans and context propagation, and demonstrates Jaeger-based tracing with best practices for production. Whether you're debugging latency issues or optimizing service dependencies, distributed tracing provides the visibility modern architectures demand.

GitHub Repository: learning-distributed-tracing

The Problem: Debugging in the Dark

In a monolithic application, debugging a slow request is straightforward. Add some logging, attach a profiler, and you can see exactly where time is spent. But microservices change everything. A single user request might touch ten or more services, each with its own logs. Failures often happen between services, not inside them. When something breaks or slows down, where do you even start?

Traditional logging falls short here. Sure, you can correlate logs by request ID, but manually piecing together the journey across services, databases, and queues is tedious and error-prone. You need something that automatically tracks the entire execution path, measures timing at each step, and shows you the complete picture. That's distributed tracing.

Understanding Observability: Metrics, Logs, and Traces

Modern observability rests on three pillars. Metrics are numerical measurements like CPU usage or request count—great for alerting but lacking context for debugging. Logs are discrete events that tell you what happened at a specific moment but struggle with correlation across distributed systems. Traces capture the complete journey of a request through your system, showing execution flow and timing.

These pillars complement each other. Metrics tell you there's a problem, logs provide event details, and traces show you the execution path. Together, they form a complete observability strategy.

It's worth distinguishing observability from monitoring. Monitoring answers "Is the system healthy?" through dashboards and alerts. Observability answers "Why is the system behaving this way?" by designing systems to answer questions you didn't anticipate. Distributed tracing is a core enabler of observability, not a replacement for monitoring.

The Fundamentals of Distributed Tracing

Telemetry refers to automated data collection from remote sources—your application constantly reporting its health and activity. Spans are the building blocks of traces, representing units of work with start time, duration, and metadata. When Service A calls Service B, both create spans that form a parent-child relationship showing the call hierarchy.

Traces are collections of spans representing a single transaction. A trace ID ties all related spans together across service boundaries. Context Propagation maintains trace continuity—when Service A calls Service B, it passes the trace context in HTTP headers, allowing Service B to create child spans under the same trace.

OpenTelemetry: The Industry Standard

Before OpenTelemetry, every observability vendor had proprietary SDKs and formats. If you wanted to switch from Jaeger to Zipkin, you'd re-instrument your entire codebase. This vendor lock-in meant architectural decisions became permanent commitments.

OpenTelemetry is a vendor-neutral framework providing APIs, SDKs, and tools for telemetry data. Formed by merging OpenTracing and OpenCensus, it provides a single instrumentation API that works with any backend. The value proposition is simple: instrument once, send data anywhere.

The architecture includes the API and SDK for creating telemetry, Auto-instrumentation for frameworks like Spring and JDBC, and the Collector—an optional but recommended component that receives, processes, and exports telemetry.

While this article focuses on distributed tracing, it's worth noting that OpenTelemetry standardizes all three pillars of observability—metrics, logs, and traces. The same SDK and protocol handle all three, giving you a unified approach to instrumentation across your entire observability stack.

OTLP (OpenTelemetry Protocol) is the wire format for transmitting telemetry data. Supporting both gRPC and HTTP transports, OTLP defines how traces, metrics, and logs are serialized and sent to collectors or backends. The protocol handles backpressure, retries, and batching for reliable delivery. Most modern observability tools now support OTLP natively, making it the de facto standard.

Spring Boot 4 and OpenTelemetry Integration

Spring Boot 4 brings first-class support for OpenTelemetry through the spring-boot-starter-opentelemetry dependency. This starter provides automatic configuration and instrumentation for common scenarios like HTTP requests, database calls, and messaging.

Previous versions of Spring Boot required manual setup using the OpenTelemetry Java agent or custom configuration. Spring Boot 2 and 3 users could leverage the Java agent for bytecode instrumentation, which worked but added operational complexity. The agent approach meant deploying a JAR alongside your application and configuring it via environment variables or system properties.

With Spring Boot 4, the starter eliminates much of this complexity. Add the dependency, configure a few properties, and you're done. Under the hood, it uses Spring's auto-configuration to set up the OpenTelemetry SDK, register instrumentation libraries, and configure exporters based on your application properties.

The starter automatically instruments:

HTTP requests and responses via Spring MVC and WebFlux
RestTemplate, RestClient, and WebClient calls
JDBC database operations
Logs (automatically includes trace and span IDs)

For additional instrumentation like Kafka messaging, you can use the @WithSpan annotation for manual instrumentation, or use the OpenTelemetry Java Agent which provides automatic instrumentation for 150+ libraries.

Spring Boot Actuator's Role: While Actuator isn't required for tracing, it plays a complementary role in Spring Boot 4's observability story. Actuator's ObservationRegistry is what actually observes requests and framework operations. The OpenTelemetry starter bridges these observations into OTel-compliant traces. Think of Actuator as operational introspection (health, metrics) and OpenTelemetry as behavioral introspection (request flows).

You can still use the Java agent if you need instrumentation for libraries outside Spring's ecosystem, but for typical Spring Boot applications, the starter is sufficient and more maintainable. Framework-level instrumentation gives you baseline visibility automatically, while custom spans should be added only where domain insight is needed. This balance is critical—over-instrumentation creates noise, while under-instrumentation hides intent.

Jaeger: Your Trace Backend

Jaeger is an open-source distributed tracing platform originally developed by Uber, providing storage, querying, and visualization for traces. While OpenTelemetry handles generation and collection, Jaeger handles the backend.

Jaeger's architecture includes agents, collectors, a query service, and a web UI. For development, the all-in-one Docker image combines all components. A common misconception is that Jaeger requires Kubernetes—it doesn't. Jaeger runs on Docker, VMs, or bare metal. The all-in-one image works for local development, while production typically uses separate components with external storage like Cassandra or Elasticsearch.

Jaeger supports multiple ingestion formats, including OTLP. With OpenTelemetry's standardization, OTLP is now recommended, meaning your Spring Boot application sends traces in OTLP format directly to Jaeger without needing Jaeger-specific libraries.

Tracing Beyond Services: Databases and Message Queues

One of the most powerful aspects of distributed tracing is visibility into external dependencies. When your application makes a database call or publishes to Kafka, those operations appear as spans in your trace.

Database tracing works through JDBC instrumentation. When your Spring Boot application executes a SQL query, the OpenTelemetry instrumentation automatically creates a span containing the query, execution time, and database connection details. This visibility is crucial for identifying slow queries or N+1 problems—those situations where you're executing one query to fetch entities, then N additional queries to fetch related data for each entity. Database spans make these anti-patterns immediately visible in your trace timeline. However, be mindful of sensitive data. Database spans can include SQL statements with parameter values, which might contain PII. OpenTelemetry provides span processors to redact or mask sensitive information before export.

Message queue tracing extends traces across asynchronous boundaries. When Service A publishes a message to Kafka, it injects the trace context into message headers. When Service B consumes that message, it extracts the context and continues the trace. This creates a parent-child relationship between the producer and consumer spans, even though they execute at different times. The result is end-to-end visibility into asynchronous workflows, making it much easier to debug message processing issues or track down where data transformations went wrong.

Performance Impact and Production Considerations

Distributed tracing adds overhead from creating spans, serializing data, and network transmission. The impact varies by component:

CPU: Span creation and serialization typically add microseconds per operation. The OpenTelemetry SDK uses efficient batching to minimize per-span overhead.

Memory: The SDK buffers spans before export. Configure batch size and timeout based on traffic patterns and memory constraints to prevent excessive buffering.

Network IO: Sending traces to a local collector over localhost has minimal impact. Remote backends introduce latency and bandwidth usage. Using a collector to batch and compress traces reduces network overhead significantly. Importantly, the collector absorbs most of the performance cost, acting as a buffer between your applications and backends.

In practice, overhead is typically under 5 percent for CPU and memory. The key is intelligent sampling—trace 1-5 percent of traffic in production rather than every request (development should trace 100 percent for debugging). OpenTelemetry supports probability-based sampling for production and rate-limiting to cap traces per second.

Best Practices for Distributed Tracing

Use meaningful span names: "validatePaymentRequest" beats "process" every time. Good naming makes traces self-documenting.

Add relevant attributes: Follow OpenTelemetry semantic conventions for HTTP, databases, and queues. Add custom attributes for business context like user ID or tenant ID.

Don't over-instrument: Creating spans for every method produces noise. Focus on external calls, database queries, and significant business logic.

Implement proper error handling: Mark spans as failed and record exception details when errors occur. This helps identify which service and operation caused failures.

Sample intelligently: Trace everything in development (probability 1.0), but use 1-5 percent sampling in production (probability 0.01-0.05). This gives you statistically significant insights without overloading infrastructure. Consider adaptive sampling that increases rates for slow requests or errors.

Watch for orphaned spans: When requests hand off work to async thread pools, ensure context propagation is maintained. If a new thread loses the trace context, your trace will break, resulting in disconnected "orphaned spans" that can't be correlated. Spring Boot 4 usually handles this automatically, but verify your custom executors are properly instrumented.

Use the Collector: It provides buffering, enrichment, routing, and reliability that SDK exporters alone cannot.

Monitor your telemetry pipeline: Track export success rates and latency. If your pipeline breaks, you're debugging blind.

Querying and Analyzing Traces

Jaeger's UI provides powerful analysis tools. Search for traces by service, operation, tags, duration, and time range. The trace timeline shows the complete request flow with parent-child relationships visually nested. For advanced use cases, Jaeger Query Language (JQL) enables programmatic querying and integration with automated alerting systems. The trace comparison feature helps identify performance regressions by highlighting timing differences between trace versions.

Conclusion

Distributed tracing transforms how you understand and debug microservices. By automatically capturing request flows and timing information, it eliminates the guesswork from performance analysis and incident response. OpenTelemetry provides the standardized instrumentation, OTLP handles reliable transmission, and backends like Jaeger give you the visualization and querying tools to make sense of the data.

Spring Boot 4's native OpenTelemetry support makes adoption straightforward. Add the starter, configure your exporter, and you're tracing HTTP requests, database queries, and message queues with minimal code. The result is a system where every request tells its own story, complete with timing, dependencies, and errors.

Start small. Enable tracing in one service, verify the data reaches Jaeger, and gradually expand to your entire application. The visibility you gain will pay dividends the first time you debug a cross-service issue or optimize a slow endpoint. Distributed tracing isn't just a monitoring tool; it's a fundamental shift in how you understand distributed systems.

For hands-on examples and complete configuration, check out the learning-distributed-tracing repository.

Learning Links:

https://spring.io/blog/2025/11/18/opentelemetry-with-spring-boot
https://opentelemetry.io/docs/zero-code/java/spring-boot-starter/
https://foojay.io/today/spring-boot-4-opentelemetry-explained/
https://last9.io/blog/opentelemetry-for-spring/
https://signoz.io/blog/opentelemetry-spring-boot/
https://vorozco.com/blog/2024/2024-11-18-A-practical-guide-spring-boot-open-telemetry.html
https://medium.com/cloud-native-daily/how-to-send-traces-from-spring-boot-to-jaeger-229c19f544db
https://medium.com/xebia-engineering/jaeger-integration-with-spring-boot-application-3c6ec4a96a6f
https://blog.vinsguru.com/distributed-tracing-in-microservices-with-jaeger/
https://last9.io/blog/distributed-tracing-with-spring-boot/
https://signoz.io/blog/jaeger-vs-zipkin/

DEV Community