Henry Li

Posted on Apr 25 • Edited on May 4 • Originally published at exesolution.com

You're Flying Blind: Adding LLM Observability to Spring AI with OpenTelemetry and Self-Hosted Langfuse

#ai #springboot #java #llm

Your Spring Boot service returns 200 OK. Latency looks fine in Datadog. Users are complaining the answers are wrong and slow.

You open the logs. Nothing useful. You check your APM traces. HTTP span: 1.2 seconds. Business logic: 40ms. That leaves 1.16 seconds completely unaccounted for — inside the LLM call, where your standard tooling sees nothing.

This is the observability gap in every LLM-enabled Java service. Standard APM tools were not built to capture what actually matters: which prompt triggered which model, how many tokens it consumed, what it cost, whether the tool call chain stalled on the third retry, or which span in a multi-step RAG pipeline blew the latency budget.

This post walks through a runnable setup that closes that gap: Spring AI plus OpenTelemetry plus self-hosted Langfuse, fully containerized, with no data leaving your infrastructure.

The full solution with source code, Docker Compose, and 11 execution screenshots is at exesolution.com. This post covers the core problem, the trace architecture, and the key configuration decisions.

What You Can't See Without LLM-Specific Tracing

Before getting into the setup, it is worth being specific about what is missing. Most teams discover these gaps the hard way:

Latency attribution. A request takes 3 seconds. Your APM shows the HTTP span. It does not show whether the latency came from the embedding call, the LLM completion, a tool invocation, or a retry on a transient 429. You cannot fix what you cannot locate.

Token and cost accumulation. In a chain with retrieval, reranking, a summarization step, and a final completion, tokens accumulate across multiple model calls. Without per-span token metadata, your cost reports are aggregates that tell you you are spending money but not where.

Prompt correlation. When a user reports a bad answer, you need to know the exact prompt that produced it, the model version, and the full context window. Without trace-level prompt capture, incident investigation is manual and slow.

Cross-service correlation. An upstream HTTP request triggers an async enrichment job that calls an LLM. Without W3C traceparent propagation through the LLM span, these two halves of the trace appear in separate, unrelated records.

Sensitive data control. You need observability, but you cannot send prompt content to a third-party SaaS. Self-hosted tracing is the only viable path in regulated environments.

The Trace Architecture

The setup has four components in a single Docker Compose stack:

Spring Boot Application
    -> OpenTelemetry Java SDK (in-process)
        -> OTLP Exporter (HTTP/protobuf)
            -> Langfuse ingestion endpoint (port 4318)
                -> PostgreSQL (trace storage)
                -> Langfuse UI (trace inspection)

Spring AI generates the spans. When you call ChatClient, Spring AI wraps the model invocation in an OpenTelemetry span automatically. Tool calls, embedding calls, and retries each get child spans. You do not write instrumentation code.

The OTel SDK handles propagation and export. W3C trace context flows from inbound HTTP requests through business logic spans into LLM spans — all linked in one trace. The SDK batches spans and exports them via OTLP without blocking the application thread.

Langfuse receives and stores everything. It is the same Langfuse you may know from the Python world, but self-hosted: PostgreSQL for persistence, its own ingestion API on port 4318, and a UI for trace inspection, filtering by model, cost, latency, and prompt review.

The key architectural decision: Langfuse runs on your infrastructure. Prompts, responses, and token metadata never leave your network. This matters for compliance and is non-negotiable in many enterprise contexts.

What Each Span Carries

Once running, every ChatClient call produces a span with these attributes visible in the Langfuse UI:

llm.model              -> "gpt-4o-mini"
llm.prompt_tokens      -> 312
llm.completion_tokens  -> 87
llm.total_tokens       -> 399
llm.latency_ms         -> 1143
error.type             -> (present only on failure)

Nested under the LLM span: tool call spans (if your ChatClient uses tools), each with their own latency and result status. Nested under those: any downstream spans from calls the tool makes.

The Langfuse UI groups these into a flame graph per trace. You can filter by model, sort by token count, drill into a specific prompt, or search for traces where error.type is set.

Configuration

Three environment variable blocks wire the stack together.

Spring Boot application:

SPRING_PROFILES_ACTIVE=otel
SPRING_AI_OPENAI_API_KEY=sk-...

OpenTelemetry Java SDK:

OTEL_SERVICE_NAME=spring-ai-llm-service
OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_ENDPOINT=http://langfuse:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.2

OTEL_RESOURCE_ATTRIBUTES=deployment.environment=local

The sampling configuration deserves a note. parentbased_traceidratio at 0.2 means 20 percent of traces are sampled — enough for operational visibility without the storage overhead of 100 percent capture. Error spans bypass sampling and are always recorded. For debugging sessions, bump the ratio to 1.0 and redeploy with a config change; no code change needed.

Langfuse (self-hosted):

LANGFUSE_PUBLIC_KEY=lf_pk_xxxx
LANGFUSE_SECRET_KEY=lf_sk_xxxx
DATABASE_URL=postgresql://langfuse:langfuse@postgres:5432/langfuse

Running It

docker compose pull
docker compose up -d

Startup order is managed by Compose health checks: PostgreSQL first, then Langfuse services, then the Spring Boot application. No manual sequencing needed.

Verify the Spring Boot application is up:

curl -s http://localhost:8080/actuator/health

The expected response shows status: UP.

Verify the Langfuse UI is up by opening http://localhost:3000 in a browser. Log in with the credentials from your .env file.

Trigger a trace:

curl -s -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Summarize the quarterly report"}'

What to look for in the Langfuse UI: open the Traces view. You should see an entry for spring-ai-llm-service. Expand it — you will see an HTTP span at the root, a business logic span below it, and an LLM invocation span as a child of that. Click the LLM span: model name, token counts, and latency are in the attributes panel on the right.

If you called any tools, each tool call appears as a child span of the LLM span, with its own duration and result status.

Prompt and Response Redaction

By default, prompt and response content is captured in span attributes. For environments where this is not acceptable, two options:

Metadata-only mode. Disable payload capture entirely. Token counts and latency are retained; prompt and response content are not recorded. One configuration flag, no code change.

Partial redaction. Apply regex-based masking in the OTEL instrumentation layer before spans are exported. PII patterns (emails, phone numbers, account numbers) are replaced with [REDACTED] in the span attributes. The LLM still receives the full content; only the observability record is masked.

Both modes are configured in application.yml with the otel Spring profile. The full configuration is in the solution at exesolution.com.

Operational Notes

If Langfuse goes down. The OTEL batch processor drops spans after the queue saturates. Application traffic is completely unaffected — tracing degrades gracefully. No circuit breaker needed.

Disabling tracing without a redeploy. Set OTEL_TRACES_EXPORTER=none and restart the application container with this variable set. Tracing stops; everything else continues normally.

Non-OpenAI providers. The instrumentation is provider-agnostic. It works with any Spring AI ChatModel implementation — Anthropic, Azure OpenAI, Ollama, Mistral. The span attributes are populated by Spring AI's abstraction layer, not by provider-specific code.

Kubernetes. The same OTEL and Langfuse configuration applies. Docker Compose is provided for local and CI reproducibility; the Kubernetes equivalent is straightforward — deploy Langfuse as a Helm chart and point OTEL_EXPORTER_OTLP_ENDPOINT at the service.

What's in the Full Solution

The verified solution at exesolution.com includes everything to run this from scratch:

Complete Spring Boot project with otel profile, OTel dependencies, and ChatClient wiring
Full Docker Compose stack: Spring Boot app + Langfuse (web + worker) + PostgreSQL
application.yml with sampling, batching, and redaction configuration
11 evidence screenshots: Docker Compose build, running containers, chat API test, Langfuse dashboard, and five trace detail views showing nested spans with token and latency data
Verification checklist: services running, traces visible, sampling confirmed, redaction verified

Full solution + runnable code + evidence at exesolution.com

Free registration required to access the code bundle and evidence images.

The Practical Case for Self-Hosted

The cloud-hosted Langfuse option is fine for many projects. But if you are in financial services, healthcare, or any context with data residency requirements, sending prompt content to a third-party SaaS is a non-starter. Self-hosted Langfuse on Docker Compose or Kubernetes gives you the same UI and the same trace schema — the only difference is the data never leaves your network.

The setup in this solution takes about 15 minutes from git clone to first trace in the UI. That is a reasonable investment for closing the observability gap that every LLM service eventually hits.

Questions about the OTel configuration or the Langfuse setup? Leave a comment below.

DEV Community