Arindam Majumder

Posted on Mar 6 • Originally published at levelup.gitconnected.com

What is LLM Observability? The Complete Guide (2026)

#ai #llm #observability #programming

TL;DR: LLM observability is your ability to understand what your language models are doing in production, not just whether they're up, but whether they're good.
This guide covers everything: what it is, how it differs from traditional monitoring, the four pillars, key metrics, RAG and agent observability, enterprise challenges, the current tools landscape, and how to implement it from scratch.

Shipping an LLM feature is no longer the hard part. Keeping it reliable, fast, and cheap in production is.

Most engineering teams hit the same wall about three months after their first production deployment:

responses quietly degrade,
costs balloon unexpectedly,
a customer escalation surfaces a class of failures nobody noticed.

And the team realizes they have almost no visibility into what the model is actually doing.

That's where we need LLM observability.

In this article, we'll understand what LLM observability is, why traditional monitoring isn't enough, the four pillars you need to instrument, and how to implement it, for simple deployments, RAG pipelines, and agentic systems alike.

Let's start with the problem that makes all of this necessary.

The Problem Nobody Talks About

You can have a perfect HTTP 200 response from an LLM API and still be on the edge of a production disaster.

The model responded. No error was thrown. The latency was acceptable. Your uptime check is green.

And somewhere in that response, the model hallucinated a fact, cited a policy that doesn't exist, or gave a customer the wrong refund amount, with complete confidence, in fluent prose.

This is the core problem that LLM observability exists to solve. Traditional software monitoring tells you whether your system worked. LLM observability tells you whether it worked well.

The gap between those two questions is where most enterprise AI deployments silently fail.

The LLM observability platform market was valued at $1.44 billion in 2024 and is projected to reach $6.80 billion by 2029, with a 36.3% CAGR. But market growth doesn't tell you what to do on Monday morning.

What is LLM Observability?

LLM observability is the practice of capturing, analyzing, and acting on telemetry data from large language model applications in production. It gives you the capacity to understand the internal states of an LLM system through its outputs, with which your team can ensure that the models function accurately, reliably, and safely at scale.

More precisely, LLM observability means capturing inference-level data, token usage, prompt content, response quality, error rates, latency breakdowns, and cost, and correlating it with user interactions to provide a complete, queryable picture of system behavior.

Before going further, three terms are worth distinguishing:

Term	Scope
LLM Monitoring	Continuously tracks performance metrics (latency, token usage, error rates) as a health signal
LLM Observability	Goes beyond monitoring & provides in-depth insight into how and why an LLM behaves the way it does
LLM Tracing	Captures request/response flow through an LLM pipeline, tracking inputs, intermediate stages, and outputs

Observability includes monitoring but adds root-cause analysis. Monitoring tells you something is wrong. Observability tells you why.

Why LLM Observability is Different from Traditional APM

Your platform team already runs Datadog, Grafana, New Relic, or some combination. You have RED metrics (Rate, Errors, Duration) on every service. Your infrastructure is well-monitored.

That's great but none of these are sufficient for LLMs.

Traditional APM was designed for deterministic systems. A successful API call means the function completed correctly. An HTTP 200 means the request was handled. These assumptions break completely the moment you introduce a language model.

They are non-deterministic in nature; they can give 5 different answers in 5 calls.

Here's the fundamental incompatibility:

Dimension	Traditional APM	LLM Observability
Primary question	Did it work?	Was it good?
Success signal	HTTP 200, no exception	Output quality score
Input space	Finite, structured	Infinite, natural language
Output behavior	Deterministic	Non-deterministic
Failure modes	Errors, timeouts	Hallucinations, drift, unsafe content
Cost model	Fixed compute	Variable (token-based)
Debugging unit	Stack trace	Prompt + context + response chain
Analysis type	Threshold alerts	Semantic search over interactions

The most dangerous LLM failures are silent. The model doesn't throw an exception when it makes something up. Degradations happen without a single error being raised.

A customer service bot starts giving inaccurate answers. A RAG pipeline silently returns less relevant chunks after an index update. A fine-tuned model's quality drifts after a base model version change by the API provider. None of these appear in your existing dashboards.

LLM observability doesn't replace traditional APM; it supplements it. You still need infrastructure health monitoring. You just need an additional layer that understands what's happening at the content level, not just the infrastructure level.

The Four Pillars of LLM Observability

Traditional observability has three pillars: logs, metrics, and traces. LLM observability keeps all three and adds a fourth that doesn't exist in traditional software monitoring at all.

Pillar 1: Metrics

Metrics in LLM observability cover two distinct layers, infrastructure performance and business quality, and you need both.

Performance metrics:

Time to First Token (TTFT) is the elapsed time between when a request is sent and when the first response token arrives. It's the primary latency signal for streaming interfaces because it's what users feel as "waiting." TTFT has two components: the prefill phase (the model processes all input tokens and builds its KV-cache) plus any scheduling or queue time. The longer the prompt, the higher the TTFT. This is why prompt size distribution is a leading indicator for latency degradation.
Token throughput (TPS) is the number of output tokens generated per second after the first token. This is your compute load metric. It's not the same as requests per second; a single request can be 50 tokens or 50,000. Token throughput tells you what the model is actually doing; RPS tells you how many requests arrived. Both matter, for different reasons.
End-to-end latency is what most teams track. But the unit that matters is percentiles, not averages. LLM response time distributions are heavily skewed by prompt length, context size, and load. p50 is your median user experience. p95 is your bad-day experience. p99 is where your SLA commitments should live. Average latency in a skewed distribution is a lie; it will always look better than what your worst users actually experience.

Error metrics:

Not all LLM errors are equal, and treating them as a single "error rate" metric is one of the most common mistakes in inference monitoring.

4xx errors: malformed requests, invalid parameters, context length violations. Client problem. The application team owns it.
429s: rate limit exhaustion. Capacity problem. Needs a quota increase or traffic shaping, not a bug fix.
5xx errors: infrastructure failures, model unavailability. Infra team incident. Pages differently from a 4xx.

When these are merged into a single error rate, a 429 storm looks identical to an infrastructure outage. You escalate to the wrong team, debug the wrong layer, and waste hours that your users are already feeling.

Nebius Token Factory separates these three error classes natively each tracked as its own signal, filterable by endpoint, region, and API key. This is how error monitoring should work by default.

Quality metrics:

These are unique to LLM observability and have no equivalent in traditional APM:

Groundedness score: Alignment between the response and source documents (critical for RAG)
Relevance score: How well the response addresses the actual user query
Faithfulness: Whether the response is supported by the retrieved context
Hallucination rate: Percentage of responses containing fabricated information
Refusal rate: Proportion of queries declined by safety filters

Cost metrics:

Token consumption drives your AI budget, and it can grow invisibly fast.

Track cost per request broken down by model, endpoint, team, and API key. Track input vs. output token ratios a single bloated system prompt, replicated across millions of requests, can materially change your monthly bill. Track cache hit ratio.

Most teams that audit their token usage find that a 30–50% reduction is achievable without any quality loss.

Nebius Token Factory tracks token throughput broken down by project, endpoint, and API key out of the box giving you the per-consumer cost visibility that most teams have to build custom tooling for.

Pillar 2: Traces

Traces are the backbone of LLM observability. A trace captures the complete lifecycle of a request as it moves through every component of your system, from the user's input through retrieval, re-ranking, the LLM call, post-processing, and back to the user.

The trace hierarchy for LLM applications:

Level	Description
Session	A multi-turn user conversation (groups related traces)
Trace	Complete end-to-end request lifecycle
Span	A discrete unit of work within the trace
Generation	A single LLM call, prompt in, completion out
Retrieval	A RAG document fetch operation
Tool Call	An external API call made by an agent
Event	A state milestone within a span

What each generation span should capture:

The exact prompt content (or a hash, if privacy-constrained)
Model name and version
Temperature and sampling parameters
Input token count and output token count
TTFT, total latency, and per-token latency
Cost in USD
Evaluation scores
Tool arguments and returns (for agents)
Retrieved document chunks and relevance scores (for RAG)

The critical discipline: propagate a single trace_id through every layer of the system. Application → retriever → guardrails → model call → post-processing. Without this thread, distributed traces are incoherent. Debugging becomes guessing.

Pillar 3: Logs

LLM logs differ from traditional application logs in one fundamental way: the payload is unstructured natural language, not structured error codes.

A JSON log of a successful API call tells you almost nothing about whether that call was valuable. You need a different approach.

Recommended log structure for every LLM call:

trace_id              # Propagated from parent span
timestamp
model_id
model_provider
input_tokens
output_tokens
latency_ms
ttft_ms
cost_usd
user_id               # If applicable
session_id
environment           # dev / staging / prod
application_name
error_type            # If applicable
evaluation_scores     # groundedness, relevance, etc.

Privacy-first default: Log metadata, not content. Token counts, model names, latency, and trace IDs by default, not raw prompt and response content. Prompts frequently contain PII. Enable full content capture only for authenticated sessions with explicit data governance controls.

A semantic search capability over stored prompts is eventually necessary at scale. Traditional log indexing is insufficient for natural language; you can't grep your way to the root cause in a production LLM incident when the relevant signal is "prompts that started producing low groundedness scores after Tuesday."

Pillar 4: Evaluation

This is the pillar that has no equivalent in traditional software observability. It is unique to LLM systems, and it is the pillar most teams skip.

Evaluation is the practice of systematically assessing the quality of LLM outputs. Not just whether requests were completed, but whether the responses were good by the standards your application requires.

LLM-as-Judge: A separate LLM evaluates the outputs of your primary LLM. Scores across dimensions: relevance, accuracy, coherence, safety, tone. Scales to production traffic volumes without human bottlenecks.
Online evaluation: Running evals on live traffic in real-time. Flags outputs that fall below quality thresholds as they happen.
Offline evaluation: Running evals on captured traces as a batch job. Slower but more thorough, uses more expensive evaluators on a sample of historical traffic.
Human annotation: The gold standard for precision. Human labelers reviewing flagged outputs and feeding corrections back into training datasets.

The practical implementation: define 3–5 evaluation metrics you will actually act on. Tracking 20 metrics and acting on none is worse than tracking 3 rigorously. For most applications, a good starting set is: groundedness (for RAG), response relevance, format compliance, and safety.

LLM Observability for RAG Pipelines

RAG (Retrieval-Augmented Generation) is now one of the most common LLM deployment patterns in the enterprise. It also introduces multiple independent observable stages, each of which can fail or degrade silently.

The observable stages in a RAG pipeline:

Query processing: User query embedding, query rewriting, or expansion
Retrieval: Vector database lookup, hybrid search, metadata filtering
Reranking: Scoring and ordering retrieved chunks by relevance
Context assembly: Stuffing ranked chunks into the prompt within context window limits
Generation: The LLM call
Post-processing: Citation extraction, formatting, safety filtering

Each stage has its own failure mode. Slow retrieval. Poor reranking quality. Context window overflow. Hallucination despite good retrieval. Good retrieval that the model ignores.

RAG-specific metrics:

Context relevance: Are the retrieved chunks actually relevant to the query?
Faithfulness/groundedness: Does the generated answer stay within what the retrieved documents support? (This is your RAG-specific hallucination detector.)
Answer relevance: Does the response address the original question?
Recall@k: Of all truly relevant documents in the corpus, what fraction was retrieved?
Chunk hit rate: How often at least one retrieved chunk was genuinely useful
Context window utilization: Truncation rates and overflow events

The key insight for RAG observability: evaluate components independently from the full pipeline. Good retrieval does not guarantee good answers. Good answers can sometimes emerge from poor retrieval.

You need both component-level evaluation (retrieval quality) and end-to-end evaluation (answer quality), and you need to track them separately, because they can disagree.

Wrap your retrieval function in a span that captures: which documents were fetched, their relevance scores, the query that produced them, and the latency. Link this span to the parent trace. When a user gets a bad answer, you need to be able to answer: "Was the retrieval bad, or was the model bad with good data?"

LLM Observability for Agentic Systems

Agents are the hardest observability problem in the LLM space.

In a simple LLM call, you have one input and one output. In an agentic system, a single user request might execute 15 LLM calls across multiple models, trigger 8 tool calls, read from memory, spawn sub-agents, and make decisions at branching points, all before returning a response.

None of this is visible in standard API logs.

Why agents are uniquely hard:

Execution path variability: The same input can produce completely different execution paths across runs. Slight phrasing changes, ambiguous instructions, or slight differences in retrieved memory can produce different tool call sequences.
Depth: Multi-hop reasoning chains mean errors can originate deep in the chain, far from where symptoms appear.
Temporal span: Agent tasks can run for minutes to hours. The request/response tracing model breaks down.
Silent failure: An agent might silently terminate early, return a partial result, or use the wrong tool, without any error being raised.

According to LangChain's State of AI Agents report, 89% of organizations have implemented some form of observability for their AI agents. But quality issues remain the primary production barrier, cited by 32% of organizations because monitoring that the agent ran is not the same as monitoring that the agent worked correctly.

Required trace data for agents:

Every LLM call: prompt, completion, token usage, latency, cost
Every tool call: function name, arguments passed, return value, latency
Every memory read/write
Every branching decision with the reasoning that led to it
Session context across multiple turns

The single most important capability: Step-level trace reconstruction. For any failed or low-quality agent run, you must be able to reconstruct the exact sequence of decisions the agent made. Not just "the agent returned a bad result", but "at step 7, the agent called the search tool with query X, received result Y, and then decided Z, which led to the failure."

Without that capability, debugging agentic failures is not debugging. It's guessing.

Enterprise LLM Observability Challenges

For enterprises, observability is not just a developer experience concern. It's a compliance, security, and governance concern.

Data Privacy

LLMs in enterprise settings operate on sensitive data, such as customer records, contracts, source code, financial data, and medical records. Standard observability that logs raw prompts can inadvertently create a regulated data lake nobody intended to build.

The practical implications:

PII in prompts: User inputs frequently contain names, account numbers, addresses, and medical information. Your observability pipeline needs PII scrubbing middleware before data is written to any storage system.
Prompt content as IP: Even without PII, system prompts may encode proprietary business logic, product roadmaps, or trade secrets. Who has access to that data in your observability tool?
Data residency: GDPR, HIPAA, and financial services regulations may prohibit transmitting prompt data to US-based SaaS platforms for EU-based or regulated-industry teams.

The default should be: log metadata, not content. Enable full prompt/response logging only where it's been explicitly reviewed and approved.

For teams running under GDPR or strict data residency requirements, Nebius Token Factory stores observability metrics in the EU-North region regardless of where inference runs and supports Zero Data Retention mode, where requests and outputs are never stored or reused. Both options exist because the right choice depends on your regulatory context.

Compliance Requirements

EU AI Act: Requires transparency, risk documentation, and human oversight for high-risk AI systems. Observability is the technical foundation for compliance; you cannot document risk you cannot measure.

GDPR: Data protection, right to explanation, right to erasure. Every AI-assisted decision that affects a user must be traceable. "Right to erasure" is particularly complex for LLMs; models don't support selective forgetting.

HIPAA: Any LLM processing patient information requires audit trails of every interaction, Business Associate Agreements (BAAs) with every vendor in the data path (including your observability tool), and strict access controls.

This is not theoretical. These requirements apply to production LLM deployments today, and the cost of non-compliance dwarfs the cost of proper observability infrastructure.

Multi-Tenancy

Enterprises serving multiple internal teams or external customers from shared LLM infrastructure face a set of observability challenges that single-tenant deployments don't.

Cost attribution requires per-request metadata tagging, team ID, product, and customer ID, so you can do showback and chargeback across cost centers. Without this, your AI costs are opaque at the organizational level, even if your infrastructure team can see the aggregate bill.

Nebius Token Factory's per-API-key and per-project filtering makes this practical without additional instrumentation each consumer's token usage, latency, and error rates are filterable independently.

Data isolation means Tenant A's prompts and conversation history must never appear in Tenant B's context. Your retrieval layer and vector stores must support tenant-scoped filtering, and your observability data access must be role-gated by the same tenant boundaries.

Differential SLAs require per-tenant performance tracking. If you've committed to different latency or availability targets for different customers or teams, you need to monitor against those targets independently.

Non-Determinism at Scale

A production LLM can return a different response to the same prompt across runs. This isn't a bug, it's inherent to how these systems work. But it fundamentally changes what monitoring means.

You cannot monitor LLM quality as binary pass/fail. You have to monitor distributions, trends, and statistical anomalies. A model behaving "correctly" on 97% of requests is different from one behaving correctly on 83%, but neither is 0% failure rate. You need statistical significance thresholds, not simple error rate alerts.

Model provider updates compound this. When OpenAI or Anthropic updates their base model, your fine-tuned model or prompt template may produce different outputs without any change on your end. Observability is the only way to catch this before users do.

The Current Tool Landscape

The LLM observability tool market has matured into three categories. Choosing the wrong one for your situation creates lock-in, privacy exposure, or observability blind spots.

Purpose-Built LLM Observability Platforms

Langfuse: Open source (MIT license), self-hostable, framework-agnostic. Best for teams with data residency requirements or those who want full control without per-request SaaS pricing at scale.
LangSmith: Commercial, deep LangChain/LangGraph integration. Best for teams already invested in the LangChain ecosystem who want native debugging tools.
Helicone: Proxy-based integration (change one URL, logging starts immediately). Best for fast time-to-value when using vanilla API calls.
Arize Phoenix: Open source, OpenTelemetry-native, strong for LLM and RAG evaluation. Best for teams wanting vendor-neutral instrumentation.
Maxim AI: Ultra-low latency gateway (<11 microseconds at 5,000 RPS) plus evaluation. Best for teams needing gateway + eval + observability in one platform.

Enterprise APM Platforms Adding LLM Support

Datadog LLM Observability: Native OTel GenAI Semantic Convention support (v1.37+) and "AI Guard" for real-time security guardrails. Best for enterprises already standardized on Datadog who want LLM observability in the same pane of glass as infrastructure monitoring.
Grafana + OpenLIT: For teams already running Prometheus/Grafana, OpenLIT (pip install openlit, two lines of setup) exports LLM metrics via OTLP to Grafana Cloud. Best for teams who want to stay in their existing observability stack.
Splunk: Hallucination detection, drift management, compliance audit trails. Best for enterprise security/compliance teams already standardized on Splunk.

Inference Platforms with Built-in Observability

An often-overlooked category: inference platforms that instrument your models at the infrastructure level, without requiring application-layer SDKs.

Nebius Token Factory is the clearest example of this approach. TTFT, token throughput, error categorization (4xx/429/5xx separately), active replica counts, and prompt size distributions are all tracked natively per endpoint, per API key, per region with near-real-time updates. Metrics export via Prometheus for integration with your existing Grafana dashboards.

The advantage of this approach: you get inference-layer observability without any instrumentation overhead in your application code. The disadvantage: it doesn't give you the application-layer context (session IDs, user IDs, evaluation scores) that purpose-built platforms provide. The right answer for most teams is both infrastructure observability at the inference layer and application observability at the SDK layer, unified in one Grafana dashboard.

How to Choose

Criteria	Recommendation
Data residency required	Self-hosted: Langfuse, Grafana + OpenLIT
Already on LangChain	LangSmith
Fastest time-to-value	Helicone (proxy)
Already on Datadog	Datadog LLM Observability
Evaluation-first approach	Arize Phoenix, Braintrust
Inference-layer visibility	Nebius Token Factory
Enterprise scale + compliance	Combination of the above

Final Thought

The teams that operate AI reliably are not necessarily the ones with the best models. They are the ones who can see what their models are actually doing in production and respond before their users notice.

LLM observability is not an ML concern or an infra concern. It is a product quality concern. Every blind spot in your observability stack is a failure mode that will eventually become an incident. The only variable is whether your team finds it first or your users do.

The four pillars metrics, traces, logs, and evaluation are your minimum. Start with the metrics and traces. Add evaluation before your team thinks you need it. Close the loop between observation and improvement.

Instrument everything. Ship nothing you can't see.

Top comments (2)

Umar zeb • Mar 10

great post

Arindam Majumder • Mar 10

Thanks for checking