DEV Community

Cover image for You're Flying Blind: Adding LLM Observability to Spring AI with OpenTelemetry and Self-Hosted Langfuse
Henry Li
Henry Li

Posted on • Originally published at exesolution.com

You're Flying Blind: Adding LLM Observability to Spring AI with OpenTelemetry and Self-Hosted Langfuse

`
Your Spring Boot service returns 200 OK. Latency looks fine in Datadog. Users are complaining the answers are wrong and slow.

You open the logs. Nothing useful. You check your APM traces. HTTP span: 1.2 seconds. Business logic: 40ms. That leaves 1.16 seconds completely unaccounted for — inside the LLM call, where your standard tooling sees nothing.

This is the observability gap in every LLM-enabled Java service. Standard APM tools were not built to capture what actually matters: which prompt triggered which model, how many tokens it consumed, what it cost, whether the tool call chain stalled on the third retry, or which span in a multi-step RAG pipeline blew the latency budget.

This post walks through a runnable setup that closes that gap: Spring AI + OpenTelemetry + self-hosted Langfuse, fully containerized, no data leaving your infrastructure.

The full solution with source code, Docker Compose, and 11 execution screenshots is at exesolution.com. This post covers the core problem, the trace architecture, and the key configuration decisions.


What You Can't See Without LLM-Specific Tracing

Before getting into the setup, it's worth being specific about what's missing. Most teams discover these gaps the hard way:

Latency attribution — A request takes 3 seconds. Your APM shows the HTTP span. It doesn't show whether the latency came from the embedding call, the LLM completion, a tool invocation, or a retry on a transient 429. You can't fix what you can't locate.

Token and cost accumulation — In a chain with retrieval, reranking, a summarization step, and a final completion, tokens accumulate across multiple model calls. Without per-span token metadata, your cost reports are aggregates that tell you you're spending money but not where.

Prompt correlation — When a user reports a bad answer, you need to know the exact prompt that produced it, the model version, and the full context window. Without trace-level prompt capture, incident investigation is manual and slow.

Cross-service correlation — An upstream HTTP request triggers an async enrichment job that calls an LLM. Without W3C traceparent propagation through the LLM span, these two halves of the trace appear in separate, unrelated records.

Sensitive data control — You need observability, but you can't send prompt content to a third-party SaaS. Self-hosted tracing is the only viable path in regulated environments.


The Trace Architecture

The setup has four components in a single Docker Compose stack:

plaintext
Spring Boot Application
└─→ OpenTelemetry Java SDK (in-process)
└─→ OTLP Exporter (HTTP/protobuf)
└─→ Langfuse ingestion endpoint (:4318)
└─→ PostgreSQL (trace storage)
└─→ Langfuse UI (trace inspection)

Spring AI generates the spans. When you call ChatClient, Spring AI wraps the model invocation in an OpenTelemetry span automatically. Tool calls, embedding calls, and retries each get child spans. You don't write instrumentation code.

The OTel SDK handles propagation and export. W3C trace context flows from inbound HTTP requests through business logic spans into LLM spans — all linked in one trace. The SDK batches spans and exports them via OTLP without blocking the application thread.

Langfuse receives and stores everything. It's the same Langfuse you may know from the Python world, but self-hosted: PostgreSQL for persistence, its own ingestion API on port 4318, and a UI for trace inspection, filtering by model/cost/latency, and prompt review.

The key architectural decision: Langfuse runs on your infrastructure. Prompts, responses, and token metadata never leave your network. This matters for compliance and is non-negotiable in many enterprise contexts.


What Each Span Carries

Once running, every ChatClient call produces a span with these attributes visible in the Langfuse UI:

plaintext
llm.model → "gpt-4o-mini"
llm.prompt_tokens → 312
llm.completion_tokens → 87
llm.total_tokens → 399
llm.latency_ms → 1143
error.type → (present only on failure)

Nested under the LLM span: tool call spans (if your ChatClient uses tools), each with their own latency and result status. Nested under those: any downstream spans from calls the tool makes.

The Langfuse UI groups these into a flame graph per trace. You can filter by model, sort by token count, drill into a specific prompt, or search for traces where error.type is set.


Configuration

Three environment variable blocks wire the stack together.

Spring Boot application:

bash
SPRING_PROFILES_ACTIVE=otel
SPRING_AI_OPENAI_API_KEY=sk-...

OpenTelemetry Java SDK:

`bash
OTEL_SERVICE_NAME=spring-ai-llm-service
OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_ENDPOINT=http://langfuse:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

Sample 20% of traces in normal operation;

error spans are always captured regardless of sample rate

OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.2

OTEL_RESOURCE_ATTRIBUTES=deployment.environment=local
`

Langfuse (self-hosted):

bash
LANGFUSE_PUBLIC_KEY=lf_pk_****
LANGFUSE_SECRET_KEY=lf_sk_****
DATABASE_URL=postgresql://langfuse:langfuse@postgres:5432/langfuse

The sampling configuration deserves a note. parentbased_traceidratio at 0.2 means 20% of traces are sampled — enough for operational visibility without the storage overhead of 100% capture. Error spans bypass sampling and are always recorded. For debugging sessions, bump the ratio to 1.0 and redeploy with a config change; no code change needed.


Running It

bash
docker compose pull
docker compose up -d

Startup order is managed by Compose health checks: PostgreSQL first, then Langfuse services, then the Spring Boot application. No manual sequencing needed.

Verify the stack is up:

`bash

Spring Boot health

curl -s http://localhost:8080/actuator/health | jq .status

→ "UP"

Langfuse UI

open http://localhost:3000

→ login with credentials from your .env

`

Trigger a trace:

bash
curl -s -X POST http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "Summarize the quarterly report"}' \
| jq .

What to look for in the Langfuse UI:

Open the Traces view. You should see an entry for spring-ai-llm-service. Expand it — you'll see an HTTP span at the root, a business logic span below it, and an LLM invocation span as a child of that. Click the LLM span: model name, token counts, and latency are in the attributes panel on the right.

If you called any tools, each tool call appears as a child span of the LLM span, with its own duration and result status.


Prompt and Response Redaction

By default, prompt and response content is captured in span attributes. For environments where this isn't acceptable, two options:

Metadata-only mode — Disable payload capture entirely. Token counts and latency are retained; prompt and response content are not recorded. One configuration flag, no code change.

Partial redaction — Apply regex-based masking in the OTEL instrumentation layer before spans are exported. PII patterns (emails, phone numbers, account numbers) are replaced with [REDACTED] in the span attributes. The LLM still receives the full content; only the observability record is masked.

Both modes are configured in application.yml with the otel Spring profile. The full configuration is in the solution at exesolution.com.


Operational Notes

If Langfuse goes down: The OTEL batch processor drops spans after the queue saturates. Application traffic is completely unaffected — tracing degrades gracefully. No circuit breaker needed.

Disabling tracing without a redeploy:

bash
OTEL_TRACES_EXPORTER=none

Restart the application container with this variable set. Tracing stops; everything else continues normally.

Non-OpenAI providers: The instrumentation is provider-agnostic. It works with any Spring AI ChatModel implementation — Anthropic, Azure OpenAI, Ollama, Mistral. The span attributes are populated by Spring AI's abstraction layer, not by provider-specific code.

Kubernetes: The same OTEL and Langfuse configuration applies. Docker Compose is provided for local and CI reproducibility; the Kubernetes equivalent is straightforward — deploy Langfuse as a Helm chart and point OTEL_EXPORTER_OTLP_ENDPOINT at the service.


What's in the Full Solution

The verified solution at exesolution.com includes everything to run this from scratch:

  • Complete Spring Boot project with otel profile, OTel dependencies, and ChatClient wiring
  • Full Docker Compose stack: Spring Boot app + Langfuse (web + worker) + PostgreSQL
  • application.yml with sampling, batching, and redaction configuration
  • 11 evidence screenshots: Docker Compose build, running containers, chat API test, Langfuse dashboard, and five trace detail views showing nested spans with token and latency data
  • Verification checklist: services running, traces visible, sampling confirmed, redaction verified

👉 Full solution + runnable code + evidence at exesolution.com

Free registration required to access the code bundle and evidence images.


The Practical Case for Self-Hosted

The cloud-hosted Langfuse option is fine for many projects. But if you're in financial services, healthcare, or any context with data residency requirements, sending prompt content to a third-party SaaS is a non-starter. Self-hosted Langfuse on Docker Compose or Kubernetes gives you the same UI and the same trace schema — the only difference is the data never leaves your network.

The setup in this solution takes about 15 minutes from git clone to first trace in the UI. That's a reasonable investment for closing the observability gap that every LLM service eventually hits.


Questions about the OTel configuration or the Langfuse setup? Leave a comment below.
`

Top comments (0)