Juan Torchia

Posted on May 16 • Originally published at juanchi.dev

OpenTelemetry on Spring Boot 3: when logs say OK and traces show the problem

#english #experimentos #backend #observabilidad

There's a question I've asked myself many times while debugging backend systems: did the request take long because the DB was slow, because the downstream kept us waiting, or because some internal loop fired 60 queries to fetch 60 records? The log says duration_ms=340 and status=200. That's it. You start guessing.

That moment of uncertainty is where this lab came from. Not to measure OpenTelemetry overhead, not to compare Jaeger against Tempo, but to answer something more concrete: what signals do you lose when you only have good logs, and what shows up when you add a trace?

The repo is at github.com/JuanTorchia/opentelemetry-spring-boot-lab, commit c12ea4e848dc431c8bbd324318399172302fe053, tag editorial-final-diagnosis-comparison-v2.

The setup: a lab that produces evidence, not benchmarks

The stack is Spring Boot 3.5.7, Java 21, PostgreSQL 16, OpenTelemetry API 1.43.0, OpenTelemetry Java Agent 2.9.0, and Jaeger all-in-one. Everything starts with Docker Compose. To reproduce it:

# Quick smoke test with small dataset (1k tasks)
.\scripts\run-lab.ps1 -Mode smoke -Size small

# Full editorial run (50k tasks, 200 requests, warmup 20, concurrency 8)
.\scripts\run-lab.ps1 -Mode editorial -Size editorial -Runs 3 -Requests 200 -Warmup 20 -Concurrency 8

The runner starts Compose, downloads the agent into tools/, packages the jar, seeds Postgres with synthetic tables (organizations, users, projects, tasks, comments), runs the scenarios, queries Jaeger by traceId, and regenerates the reports in results/.

Jaeger was chosen for local simplicity: one image, web UI, REST API to query traces by traceId. Tempo is also valid, but needs more moving parts for a local editorial demo. This is not a production stack recommendation.

The editorial dataset has 50,000 tasks. The small dataset has 1,000. That difference matters so the N+1 produces visible fan-out rather than a microsecond gap that disappears into noise.

The instrumentation decision I care about most

The pom.xml has opentelemetry-api as a compile dependency, but the agent arrives at runtime. That means HTTP server, HTTP client, and JDBC are instrumented automatically without touching business code.

Manual spans are used only for business stages that the agent can't infer:

// LabService.java — manual span to mark business intent
Span span = tracer.spanBuilder("business.n_plus_one.load_tasks_then_comments").startSpan();
try (var ignored = span.makeCurrent()) {
    // first fetches tasks, then runs one query per task
    List<Map<String, Object>> tasks = jdbcTemplate.queryForList(
        "select t.id, t.title, u.display_name as assignee from tasks t "
        + "join users u on u.id = t.assignee_id order by t.id limit ?",
        limit);
    for (Map<String, Object> task : tasks) {
        Long taskId = ((Number) task.get("id")).longValue();
        // this query repeats per task → fan-out
        Integer comments = jdbcTemplate.queryForObject(
            "select count(*) from comments where task_id = ?",
            Integer.class, taskId);
        // ...
    }
    span.setAttribute("lab.n_plus_one.expected_extra_queries", enriched.size());
} finally {
    span.end();
}

That mix is more honest for the post: auto-instrumentation for infrastructure, manual spans to explain intent. If I had used only manual spans, the lab would require observability-specific code in every layer. If I had relied only on the agent, business spans would be invisible.

The logback-spring.xml injects traceId and spanId into every log line:

<!-- logback-spring.xml -->
<pattern>%d{yyyy-MM-dd'T'HH:mm:ss.SSSXXX} %-5level traceId=%X{trace_id:-none} spanId=%X{span_id:-none} %logger{36} - %msg%n</pattern>

That's what connects both worlds. A log with traceId lets you jump directly to the trace in Jaeger. Without it, logs and traces are islands.

The Matrix That Summarizes The Diagnosis

Scenario	p95	Avg spans	Avg DB spans	Error spans/request	Defensible diagnosis
baseline	55 ms	3.04	1.04	0	Healthy request, no weird story.
optimized	59 ms	3.04	1.04	0	Same functional shape, no DB fan-out.
n-plus-one	209 ms	63.38	61.38	0	DB fan-out visible inside one request.
downstream-slow	374 ms	4	0	0	Time concentrates in downstream.
mixed	395 ms	7.57	1.57	0	DB, downstream, and transformation compete.
partial-error	184 ms	6.27	1.27	3	Downstream error inside a partial response.

This table is not trying to crown a tool. It summarizes which signals are available for diagnosis. The strong point is not that one number is universal: it is that N+1 leaves a very different shape than the optimized case, and that shape does not appear in a flat log unless you enable SQL debug.

What the six scenarios reveal

The lab has six endpoints: baseline, n-plus-one, optimized, downstream-slow, mixed, and partial-error. Each produces different signals that the runner consolidates into results/comparison.md and results/diagnosis-comparison.md.

The finding I most want to defend:

N+1 vs optimized: both return the same response shape. The log for both says status=200. The difference lives in the trace: n-plus-one generates an average of 63.38 spans per request in the editorial run; optimized generates 3.04. That's not a universal performance claim — it's a diagnostic signal. With only logs and no SQL debug enabled, the difference is ambiguous. With the trace, DB fan-out is visible without extra configuration.

Downstream-slow: p95 sits at 374 ms, very close to the configured 300 ms delay. Logs show total duration and traceId. What they don't show is where that time went: was it DB? was it the downstream? was it in-memory transformation? The trace separates it: the downstream HTTP client span dominates the hierarchy. The local DB appears as a secondary span with low duration.

Mixed: this is where flat logs fail the most. Three stages compete (DB, downstream, transformation) and none is obviously dominant. p95 reaches 395 ms. The trace shows the temporal distribution per stage. The log just says it was slow.

Partial-error: the endpoint responds with HTTP 206 (partial content). The log records traceId, status, and error type. The trace goes further: the downstream span is marked with error, nested under a request that technically responded. Logs and trace don't replace each other here — they complement. The log alerts and lets you correlate. The trace places the error in the causal hierarchy.

The Screenshot That Changed The Diagnosis

In Jaeger, n-plus-one does not look like a request that is merely a bit slower. It looks like a request with DB fan-out: many repeated spans under the same business operation.

The optimized case, on the other hand, keeps a compact shape. I do not need to read the code to suspect that the previous case was not "Postgres is slow" in the abstract, but the query shape.

The partial-error case matters for another reason: the request can respond, while the downstream span is marked as errored. That nuance is exactly where logs and traces complement each other: the log alerts, the trace locates.

The honest limits of the metrics

The *_vs_root_pct fields in results/diagnosis-comparison.md are cumulative percentages of span durations exported by Jaeger. They can exceed 100% when there are nested spans, client/server pairs, or overlap. The duration_denominator_type field indicates what was used as the denominator: root_span, http_request_span, or largest_observed_span if the trace was ambiguous.

These are not overhead numbers. They are not an exact distribution of real request time. They are cumulative diagnostic signals. Treating them like CPU percentages would be a misread that this lab doesn't try to encourage.

Similarly, diagnosis_confidence_* is an editorial classification coded in ScenarioDiagnosis.java, not an automatically measured metric. For N+1, diagnosisConfidenceLogs is low and diagnosisConfidenceTrace is high. That reflects the fact that without SQL debug, the log is ambiguous. It's not a universal benchmark of which tool is better.

My position: what I accept and what I don't buy

I accept that OpenTelemetry with the Java Agent is a reasonable way to add structural visibility to a Spring Boot 3 app without polluting business code. JDBC and HTTP client auto-instrumentation works well for common scenarios.

I don't buy the narrative that traces replace logs. The lab's RequestCompletionLoggingFilter is a Servlet filter that records every completed request with scenario, method, path, status, and duration. Those logs are operationally useful even when Jaeger is unavailable. The traceId in the log is the bridge, not the replacement.

I also don't buy that Jaeger is the only valid option. It was chosen because it starts with one image and has a ready web UI. Tempo, Zipkin, or any OTLP-compatible backend would solve the same problem in this context.

The honest trade-off is this: auto-instrumentation reduces accidental work but adds an agent on the classpath that exports data in the background. In a local lab that's trivial. In production, agent overhead depends on load, exporter configuration, and sampling. This lab doesn't measure that, and claiming otherwise would be misleading.

What to do with this

If you already have structured logs in production with traceId and spanId, the next step isn't replacing anything. It's adding a trace backend and connecting both worlds. The lab shows that Spring Boot 3 auto-instrumentation with the Java Agent is enough for common scenarios, and that manual spans only make sense when you want to name business intent that the agent can't infer.

If you're evaluating whether the effort is worth it: the case where it's most clearly justified isn't the healthy baseline. It's the mixed scenario or the N+1, where logs give you a number and the trace gives you a shape. The difference between guessing and diagnosing.

After this lab, my rule is simple: logs tell you what happened; traces help you understand how it happened. If the flat log only gives you total duration, you do not have an explanation yet. You have a clue.

This article was originally published on juanchi.dev

DEV Community