DevHelm

Posted on Jun 8 • Originally published at devhelm.io

LLM Observability: What Breaks in Production and How to Instrument It

#ai #guides #infrastructure

Traditional Application Performance Monitoring (APM) tracks latency, error rate, and throughput. For a REST API backed by a PostgreSQL database, that's enough — the system is deterministic, the failure modes are well-understood, and a p99 latency spike has a finite set of causes.

LLM applications break this model. The same prompt can produce different outputs on consecutive calls. Latency varies by an order of magnitude depending on output length. A "successful" response (HTTP 200, valid JSON) can contain hallucinated facts, toxic content, or instructions that contradict your system prompt. The error rate metric that anchors traditional monitoring becomes a lagging indicator at best, and misleading at worst.

LLM observability is the practice of instrumenting LLM applications to capture the signals that actually predict production failures — not just availability and latency, but token economics, output quality, and the behavioral boundaries that keep autonomous agents from going off the rails.

The five signals that matter

Traditional APM gives you three signals: latency, error rate, and throughput (the RED method). LLM applications need five.

1. Latency — decomposed

A single LLM call has three latency components: time to first token (TTFT), inter-token latency (the streaming speed), and total completion time. TTFT matters for user-facing chat applications where perceived responsiveness depends on how fast the first word appears. Total completion time matters for batch pipelines and agent tool calls where you're waiting for the full response before acting.

A p99 latency of 8 seconds is fine for a batch summarization job and catastrophic for a chat interface. Report both TTFT and total time as separate metrics, broken down by model and provider.

2. Token usage and cost

Every LLM call has a dollar cost determined by input tokens (your prompt) and output tokens (the model's response). A prompt injection that causes the model to produce maximum-length output can dramatically inflate your cost per request. A retrieval-augmented generation (RAG) pipeline that stuffs too much context into the prompt burns input tokens without improving quality.

Track input_tokens, output_tokens, and total_cost_usd per request. Aggregate by model, endpoint, and user. Set alerts on cost-per-minute — a runaway agent loop or a prompt injection attack shows up as a cost spike before it shows up in error rates.

3. Error rate — expanded

HTTP-level errors (429 rate limits, 500 server errors, timeouts) are the obvious failures. But LLM apps have two additional error classes:

Structured output failures. You asked for JSON with a specific schema; the model returned something that doesn't parse. This is a 200 response with valid JSON that doesn't match your schema — invisible to traditional monitoring.
Guardrail violations. The model produced content that your safety filters reject. The LLM call "succeeded" from the API's perspective, but your application refused to serve the result.

Track each class separately. An aggregate error rate that mixes "OpenAI returned 429" with "output failed schema validation" obscures the root cause.

4. Output quality indicators

This is the signal that has no equivalent in traditional APM. A deterministic API either returns the correct result or an error. An LLM can return a response that is syntactically valid, structurally correct, and factually wrong.

Full-stack quality evaluation (checking every response against ground truth) is too expensive for production. Instead, track proxy indicators:

Finish reason. stop means the model completed naturally. length means it hit the token limit — the response is incomplete. content_filter means the safety system intervened. Track the distribution of finish reasons; a spike in length means your prompts are producing responses that overflow the context window.
Latent feedback loops. User actions that correlate with output quality — retry rate, edit rate after accepting a suggestion, time spent reading before acting. These are application-specific but often the best quality signal available.
Semantic similarity to expected output. For tasks with reference answers (RAG, summarization), compute embedding cosine similarity between the model output and the expected result. Track it as a metric, alert on distribution shifts.

5. Cost circuit breakers

Agent systems that loop — calling tools, reasoning about results, calling more tools — can accumulate unbounded costs. A coding agent that misinterprets an error and retries the same failing approach 50 times burns tokens without making progress.

Track cumulative cost per session and per user. Set hard limits: if a single agent session exceeds your cost threshold, terminate it. This is not just a business concern — it's a safety boundary that prevents a single malformed input from draining your API budget.

Why traditional monitoring isn't enough

The fundamental problem is non-determinism. Traditional monitoring assumes that the same input produces the same output, so you can reason about system behavior from aggregate metrics. LLM applications violate this assumption at every layer:

Prompt sensitivity. Adding a single word to a prompt can change the model's behavior from helpful to harmful. There's no equivalent in traditional systems — adding a query parameter to a REST endpoint doesn't randomly change the response schema.
Model drift. When OpenAI updates gpt-4o behind the scenes (same model name, different weights), your application's behavior changes without any deployment on your side. The gen_ai.request.model and gen_ai.response.model attributes can differ — and the gap is worth monitoring.
Context window economics. A 128k context window doesn't mean you should use all of it. Performance and cost degrade as you approach the limit. Traditional APM has no concept of "this request used 87% of its available input capacity."

Instrumenting with OpenTelemetry GenAI conventions

The OpenTelemetry GenAI semantic conventions define a standard schema for LLM telemetry. As of v1.40.0 (February 2026), the gen_ai.* namespace is experimental but already adopted by the major instrumentation libraries.

Every LLM call becomes a span with a standardized name: {operation} {model}. A chat completion to GPT-4o produces a span named chat gpt-4o. The key attributes:

Span: chat gpt-4o
Kind: CLIENT
Attributes:
  gen_ai.operation.name:         "chat"
  gen_ai.provider.name:          "openai"
  gen_ai.request.model:          "gpt-4o"
  gen_ai.response.model:         "gpt-4o-2024-11-20"
  gen_ai.usage.input_tokens:     1842
  gen_ai.usage.output_tokens:    326
  gen_ai.response.finish_reason: "stop"
  gen_ai.request.temperature:    0.7
  server.address:                "api.openai.com"

For agent systems, the conventions define additional span types: create_agent, invoke_agent, and execute_tool. An agent span tree shows the full decision chain — what the agent decided to do, which tools it called, and what each tool returned. Agent spans carry gen_ai.agent.name and tool execution spans carry gen_ai.tool.name, giving you the ability to trace cost and latency per tool and per agent step.

The OTel Collector processes these spans identically to any other OTLP data. Export to Jaeger for trace visualization, to Prometheus for metrics aggregation, and to your log backend for event-level detail. No custom pipeline required.

Prompt and completion content is not captured by default — these contain user data and are potentially large. Opt in with the OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT environment variable when you need full-text debugging.

The tool landscape — honest assessment

LLM-specific observability platforms

Tool	Strengths	Weaknesses
LangSmith	Deep LangChain integration, prompt versioning, evaluation datasets, annotation queues.	Tightly coupled to LangChain. Limited value if you don't use LangChain. Closed source.
Helicone	Proxy-based (no SDK changes), cost tracking, caching, rate limiting, prompt management.	Adds a network hop. All LLM traffic routes through a third-party proxy.
Arize Phoenix	Open-source trace viewer, embedding drift detection, supports OTel natively.	Evaluation features are less mature than LangSmith. Smaller community.
OpenLLMetry (Traceloop)	Open-source OTel-based instrumentation for LLM frameworks. Vendor-neutral.	Instrumentation library, not a platform — you still need a backend.

General observability platforms with LLM support

Tool	Strengths	Weaknesses
Datadog LLM Observability	Unified with existing APM, no new vendor, prompt-level traces.	Expensive. LLM monitoring is an add-on to an already-expensive platform.
New Relic AI Monitoring	Similar unified approach, consumption-based pricing.	GenAI features are newer and less mature than Datadog's.

The OpenTelemetry-native path

Use the OTel GenAI semantic conventions with auto-instrumentation libraries (opentelemetry-instrumentation-openai, opentelemetry-instrumentation-anthropic), export to your existing observability stack (Jaeger + Prometheus + Grafana), and add custom metrics for quality signals that the conventions don't cover.

This path has the highest setup cost and the lowest vendor lock-in. You own the data pipeline, you own the schema, and you can switch backends without re-instrumenting.

What to instrument first

If you're running LLM calls in production today and have zero observability beyond HTTP-level monitoring, here's the priority order:

Week 1: Token usage and cost tracking. This is the signal most likely to catch a production incident before it becomes expensive. Add OTel auto-instrumentation, export to your existing metrics backend, and set a daily cost alert.

Week 2: Latency decomposition. Break down TTFT vs total completion time per model. Set SLOs for each: TTFT under 500ms at p95 for chat interfaces, total time under 10s at p95 for batch.

Week 3: Error classification. Separate HTTP errors from structured output failures from guardrail violations. Build a dashboard that shows each class independently.

Week 4: Output quality baselines. Start logging finish reason distributions. If you have reference answers, compute embedding similarity scores and track the distribution. Set alerts on distribution shifts, not absolute thresholds — you're looking for changes, not perfection.

The infrastructure layer underneath

LLM observability tools track what happens inside your application. But your application depends on external infrastructure: the OpenAI API, the Anthropic API, your vector database, your embedding service. When any of these degrade, your LLM application degrades — and the root cause is invisible to application-level instrumentation.

An external monitor that checks your model provider's API status, your Pinecone endpoint health, and your embedding service latency every 30 seconds catches provider outages before they propagate through your application. When your LLM observability dashboard shows a latency spike, you want to know immediately whether it's your code or your provider — set up infrastructure checks at app.devhelm.io starting with your most critical model provider endpoint.

Originally published on DevHelm.

DEV Community