Rafa Calderon

Posted on May 2

Observability for AI Systems with OpenTelemetry

#ai #architecture #monitoring #rag

A user reports that an AI application's response is bad. The dashboard, however, is green: 200 OK, latency within the expected percentile, clean logs. The complaint is legitimate, and traditional observability fails to capture it.

The tooling refined over a decade for web services measures latency, errors, and throughput. Generative AI systems demand more: cost per request, output quality, propagation of semantic errors, and the behavior of an agent that decides at each iteration what to do next. OpenTelemetry and its GenAI semantic conventions provide a foundation, but effective implementation requires addressing five distinct layers and, above all, the correlations between them.

This article walks through those layers with concrete attributes, named metrics, JSON span examples, and diagrams. The structure follows the natural flow of a request: agent orchestration, retrieval pipeline, cost, quality, and security as an initial filter. After that, a section dedicated to cross-layer correlations, the telemetry pipeline that ties everything together, the most frequent anti-patterns, and an actionable minimal implementation.

The system under observation

Before instrumenting, the system needs to be fixed in scope. The reference application receives user queries, executes an agent capable of reasoning across multiple iterations and calling tools, queries a private corpus through a retrieval pipeline, invokes one or more external language models in streaming mode, and returns a response. This pattern covers most modern AI applications: vertical assistants, conversational search engines, productivity copilots, automated support systems.

This system has four boundaries where traditional observability falls short. The boundary between application code and external models: the call is not a normal HTTP request, latency depends on prompt length, output arrives in streaming mode token by token, and the model may refuse to respond due to its own filters. The boundary between the query and the retrieved documents: relevance is a qualitative dimension that cannot be reduced to a simple numeric metric. The boundary between usage and cost: each token has a price, prices change, cached tokens cost differently, and the bill arrives a month later in aggregate. The boundary between response and validity: a 200 does not mean a correct response, quality is subjective, contextual, and only verifiable with additional effort.

The layers that compose the observability system respond to these boundaries. The rest of the article develops each one with concrete attributes and explicit positions.

The orchestration layer

The modern agent is not a function that takes input and returns output. It is a loop that decides at each iteration: reason, call a tool, query a model, reason again. The natural way to model it is hierarchical. Root span per interaction, child spans per iteration, grandchild spans per tool call, and, in more complex systems, sub-agents that generate sub-hierarchies.

The first strong position of the article: Jaeger is not adequate for agents with more than three levels of depth. Plan from the outset for a different UI: Langfuse, Arize Phoenix, SigNoz, or one built on top of ClickHouse. Reusing the tool designed for microservices typically results in weeks lost and the same conclusion.

To ground this in practice, an agent iteration span, following OpenTelemetry GenAI semantic conventions and extended with custom attributes, looks like this:

{
  "name": "agent.iteration",
  "kind": "INTERNAL",
  "attributes": {
    "agent.iteration.number": 3,
    "agent.iteration.reason": "tool_call_required",
    "gen_ai.system": "anthropic",
    "gen_ai.request.model": "claude-sonnet-4",
    "gen_ai.response.model": "claude-sonnet-4-20250514",
    "gen_ai.usage.input_tokens": 4523,
    "gen_ai.usage.output_tokens": 187,
    "gen_ai.usage.cached_tokens": 4200,
    "gen_ai.error.type": null,
    "cost.usd": 0.0143,
    "feature.name": "support_assistant",
    "tenant.id": "t_4521"
  },
  "events": [
    {"name": "first_token", "time_unix_nano": "..."},
    {"name": "tool_call.dispatched", "attributes": {"tool.name": "search_kb"}}
  ]
}

The gen_ai.* attributes are the ones standardized by OpenTelemetry, in experimental status but adopted by serious SDKs. Attributes without the gen_ai prefix, such as cost.usd, feature.name, and tenant.id, are custom extensions that should be isolated in their own namespace so they can be renamed when the standard matures. The events inside the span are crucial: they mark significant moments such as the arrival of the first token or the dispatch of a tool call, and they form the basis for derived metrics covered in the next subsection.

The most relevant architectural decision in this layer is whether to capture full prompts and completions in spans. Strong position: capture the full prompt only on errors and outliers. For everything else, hash plus separate storage, or nothing. There are three reasons, all of them painful. Size: a RAG-augmented prompt can contain tens of thousands of tokens, and storing the full text in every span inflates storage non-linearly. Privacy: the prompt may contain user PII, which then travels through the entire telemetry pipeline toward systems not designed with personal data controls. Cost: any tracing backend bills by volume, and storing natural-language text multiplies bills by one or two orders of magnitude compared to structured metadata.

In practice, teams do not pick a single point on the spectrum but rather a combination. Hash of the prompt always stored, full text only on errors and cost outliers, PII redaction at the collector. This is defense in depth, not maximalism in any direction.

Streaming, TTFT and TPOT

Here lies one of the most common gaps in AI observability: ignoring streaming. LLM output does not arrive as a single payload but in chunks, typically via Server-Sent Events. Total latency stops being the relevant metric. What matters is the TTFT (Time To First Token), which defines how quickly the user perceives that the system is responding, and the TPOT (Time Per Output Token), which defines the fluency of generation.

The client SDK or wrapper around the LLM emits span events: stream.start, stream.first_token, stream.last_token. The collector consumes them and derives metrics: TTFT as the difference between request and first_token, TPOT as (last_token - first_token) / (output_tokens - 1). These two metrics, not total latency, are the ones that belong in product-oriented SLOs.

{
  "name": "llm.generation",
  "attributes": {
    "gen_ai.streaming": true,
    "gen_ai.usage.output_tokens": 412,
    "stream.ttft_ms": 240,
    "stream.tpot_ms_avg": 18,
    "stream.duration_ms": 7656
  },
  "events": [
    {"name": "stream.start", "time_unix_nano": "..."},
    {"name": "stream.first_token", "time_unix_nano": "..."},
    {"name": "stream.last_token", "time_unix_nano": "..."}
  ]
}

A TTFT below one second feels instantaneous; above three seconds it kills the experience. A TPOT below 30 ms reads fluently; above 80 ms it feels like a slow connection. These numbers should be explicit system SLOs, not anecdotal references. Aggregate latency does not capture what the user perceives: TTFT and TPOT must be measured as first-class metrics.

Error types and tooling

Not all LLM errors are equal. gen_ai.error.type should use a controlled vocabulary: rate_limit, content_filter, max_tokens_exceeded, structured_output_malformed, timeout, provider_outage. Each requires a different corrective action. Grouping them as a generic error loses critical information.

On tooling, an explicit stance is warranted: OpenLLMetry from Traceloop is the option most aligned with pure OpenTelemetry, exporting to standard OTLP and integrating with any backend. Arize's OpenInference is more oriented toward evaluation. Langfuse and LangSmith are complete products with their own protocols. Vendors imply real lock-in: OTLP exports exist but are partial, and dashboards are built around their proprietary data models. The promised interoperability is more marketing than engineering at this moment.

The retrieval layer

If the orchestration layer governs the agent's logic, the retrieval layer governs its inputs. A typical query traverses five phases: text-to-vector embedding, vector search against the index, candidate re-ranking, final prompt construction with the chosen documents, and generation. Each phase has its own latency profile, its own failure mode, and its own relevant metrics.

A frequent anti-pattern: a single "rag" span covering the entire pipeline. If vector search degrades from 30 ms to 300 ms, that change should be visible in isolation rather than diluted in an aggregate latency that also includes generation. One span per phase, with specific attributes, is not a luxury: it is the difference between being able to operate the system and not.

A concrete vector search span:

{
  "name": "rag.search.vector",
  "attributes": {
    "vector.index": "docs_v3",
    "vector.top_k": 10,
    "vector.distance.min": 0.21,
    "vector.distance.max": 0.67,
    "vector.distance.avg": 0.43,
    "cache.embedding.hit": true,
    "cache.search.hit": false,
    "docs.retrieved": ["doc_4521", "doc_8931", "doc_2901"],
    "rag.session.id": "sess_abc123"
  }
}

The concrete metrics worth exposing in this layer, by name:

rag_phase_latency_ms{phase} — latency per phase, separated, not aggregated.
rag_cache_hit_ratio{level} — embedding, search, response. Three levels, three ratios.
rag_retrieval_relevance_proxy — average score of the top_k that actually ended up in the prompt after re-ranking. Not real relevance, but a proxy.
rag_docs_referenced_ratio — percentage of retrieved documents that appear cited in the model's response. Drift in this metric is an early signal of degradation.
rag_prompt_tokens_p99 — distribution of constructed prompt size. Uncontrolled growth heads toward silent truncation.

The defining problem of this layer is measuring quality without ground truth. That topic is developed in its own section later, because quality deserves its own layer rather than a paragraph hidden here.

The cache deserves specific attention. Three levels, three different economics. Embedding cache: same query, same vector. Hit ratio is typically high in applications with repetitive queries. Search cache: same vector, same results. Sensitive to corpus updates. Response cache: same query, same response. Controversial given the expected variability of LLMs, but in deterministic applications with zero temperature it saves enormously. Without hit/miss counters at each level, a drop in embedding cache rate goes unnoticed until the provider's bill arrives showing inexplicable growth.

Detecting silent regressions is where most teams get burned. Same query, same corpus, different results over time. Common causes: the embedding model provider changed versions without notice, someone re-indexed documents, the reranker was updated, the corpus grew. A defensive pattern: canonical queries executed periodically, score distribution archived as baseline, alert when the distribution diverges from the historical baseline. This is not optional, it is operational survival.

The cost layer

Of the five layers, cost is the only one where traditional observability actively misleads. A green dashboard while the system bleeds money. The bill arrives a month later, aggregated, with no breakdown by feature, no attribution by user, no distinction between legitimate use and a bug in a loop. An observability practice that does not include cost as a first-class citizen is incomplete for AI systems.

A decision that confuses many teams is how to model tokens. The correct answer is to model them simultaneously as span attribute and as metric. As an attribute, tokens travel alongside the individual trace: for this specific query from this specific user, how many tokens were consumed? In which phase? This is the queryable, debuggable, post-mortem dimension. As a metric with labels, tokens enable fast aggregation: total per model, per feature, per hour. Cardinality limits the labels: do not put user_id as a label if the TSDB is to remain alive. That is why both forms are needed. The metric answers "how much." The attribute answers "who and why."

The concrete cost metrics any serious system should expose:

cost_usd_total{model, feature, tenant} — aggregate cost in dollars with relevant low-cardinality labels.
tokens_total{model, type, cache_status} — type is input or output, cache_status is cached or fresh.
cost_per_request_p50/p95/p99{feature} — distribution per feature, not global.
cost_anomaly_score{tenant} — deviation from the tenant's baseline. Alert when it exceeds N standard deviations.

On real-time cost calculation, tokens are a proxy: what is billed are dollars, and the conversion requires a pricing table the system must know. Strong position: the table lives in the collector, not in each SDK or the backend. Reason: updating prices is a configuration change, not a redeploy, and cost is materialized before storage rather than in post-processing. Distributing this logic per service in a new architecture adds friction for zero return.

Cost anti-patterns that recur in real systems: not separating input from output tokens (their prices differ by orders of magnitude), treating cached and non-cached tokens equally (Anthropic and OpenAI charge fractions of the standard price for cached), not versioning the model in spans (when the provider updates, historical metrics stop being comparable), not labeling by feature (when finance asks for a breakdown by functionality, no answer exists).

On prompt caching specifically: the gen_ai.usage.cached_tokens attribute is the difference between knowing what is billed and guessing. Providers offer cached tokens at a fraction of the price. Optimizing prompts to maximize cache hits changes the system's economics, but only if the observability layer distinguishes them. Treating all tokens equally makes attribution lie and optimizations operate blind.

Cost anomaly detection is where this layer demonstrates its operational value. A user suddenly consuming a hundred times the normal usage may be a legitimate use case, a bug retrying in a loop, or an attack. The three situations require different responses, and only well-instrumented systems can distinguish them quickly. Useful signals: associated errors, temporal patterns, similarity between consecutive prompts, correlation with other users. Without these signals, the first attempt at resource exhaustion becomes visible only through the bill.

The strong idea of this layer: in AI systems, FinOps is not a department separate from observability. It is a dimension alongside latency, errors, and throughput. Treating it as a monthly finance problem leads to consistent lateness.

The quality layer

Here lies the gap most teams leave uncovered. Quality is the dimension that distinguishes an AI system from a traditional web service, and yet it is rarely treated as a first-class metric. A response can be fast, cheap, and technically successful, and still be bad. An observability practice that does not capture that difference operates a system whose output is not being measured.

The first problem is one of definition. What is "quality" when discussing LLM output? Not a single metric. It is a vector with at least four dimensions worth handling separately: groundedness (the response is supported by retrieved documents or fabricates content), relevance (the response addresses what the user actually asked), completeness (the response covers all aspects of the query), safety (the response does not include problematic content). Each dimension needs its own scorer, its own threshold, its own alert.

Operationalizing quality requires three complementary approaches. None alone is sufficient.

Indirect user signals. The user follows up with the same question rephrased, clicks a "this didn't help" button, abandons the session, copies or does not copy the response content. Each signal is noisy, but aggregated over thousands of interactions they reveal patterns of degradation. These signals are linked back to the original span via the session.id, closing the loop between what happened and how it was received.

Structural metrics. They do not measure quality directly, but their drift over time almost always indicates that something has changed. Distribution of similarity scores returned by vector search. Percentage of the top_k that ended up in the prompt. Proportion of retrieved documents referenced in the response. Size of the constructed prompt. Number of agent iterations. These metrics are cheap, deterministic, and useful as an early warning system.

LLM-as-judge. A model, typically more expensive or more capable than the production one, evaluates a sample of responses against the defined quality dimensions. The score returns to the original trace via session ID.

{
  "name": "quality.evaluation",
  "attributes": {
    "trace_id_ref": "4f7e9a8b...",
    "session.id": "sess_abc123",
    "quality.score.overall": 3.2,
    "quality.scorer.type": "llm_judge",
    "quality.scorer.model": "claude-opus-4",
    "quality.dimensions.groundedness": 4.0,
    "quality.dimensions.relevance": 2.0,
    "quality.dimensions.completeness": 4.0,
    "quality.dimensions.safety": 5.0,
    "quality.evaluated_at": "2026-04-30T14:23:00Z"
  }
}

This brings up the problem few teams address: the recursive observability hell. The judge model itself suffers from hallucinations, drift, and high cost. Who watches the watcher. Mandatory judge metrics: judge.cost.usd_total, judge.score.distribution, judge.agreement_rate (with humans on sub-samples), judge.refusal_rate. Without these, the judge is a black box evaluating another black box. Strong position: when the LLM-as-judge cost exceeds five percent of production cost, the evaluation pipeline ceases to be sustainable and must be redesigned.

The sampling strategy for evaluation is also non-trivial. Uniform sampling captures average behavior but misses the rare cases that matter most. Stratified sampling by feature, by tenant, by error type captures the extremes better. The defensible practice: 100% of responses with user-reported errors, stratified sampling of 1-5% of the rest, sampling of cost and latency outliers.

The concrete quality metrics any serious system exposes:

quality_score_p50/p95{dimension, feature} — distribution per dimension and feature, not aggregated.
quality_alert_threshold_breaches{dimension} — number of evaluations that fell below the defined threshold.
user_negative_feedback_rate{feature} — noisy but actionable proxy.
quality_drift_score — deviation of the current distribution from a baseline window. Automatic alert when it exceeds N sigmas.

The operating rule: quality needs its own dashboard, its own alerts, and its own weekly review. Without the same rigor applied to latency and errors, the system's output remains outside the scope of measurement.

Correlation across layers

Here lies the real value of having all layers instrumented: the correlations between them. An AI system is not modular, it is systemic. Problems do not appear isolated in one layer; they appear in chains that cross several. Treating each layer separately and watching independent dashboards loses the pattern.

Three correlation chains that recur in real systems:

Chain A: silent retrieval degradation. Vector search scores fall, retrieved documents become less relevant, the model compensates by fabricating content (hallucination), quality drops without latency or cost raising any flag, and users start abandoning. A retrieval-only dashboard shows scores falling but "still passing the threshold." A feedback-only analysis surfaces the problem weeks late. The signal appears by crossing both views.

Chain B: context overflow from corpus growth. The corpus grows, the constructed prompt becomes longer, at some point it begins silently truncating at the model's limit, responses lose completeness, and meanwhile the cost of input tokens keeps rising even though they add no real value. Without correlation between prompt.tokens (retrieval layer) and quality.dimensions.completeness (quality layer), the problem hides for months.

Chain C: runaway agent. A tool fails with an error the agent cannot interpret. The agent retries. Iterations grow. Each iteration adds tokens. The monthly bill arrives with a forty percent increase without anything having triggered traditional monitoring. The signal is the correlation between tool.error.rate (orchestration layer) and iterations.count (orchestration layer) and cost.usd (cost layer).

The operational consequence is that the useful dashboard is not one per layer. It is one per correlation. The most valuable: quality.score vs retrieval.scores.avg, cost.per_request vs iterations.count, prompt.tokens vs quality.completeness. Observability limited to isolated layers produces symptoms, not diagnoses.

Security as a telemetry filter

Security in AI systems is neither a future problem nor a footnote in a "what is not yet solved" section. It is an initial filter that blocks requests before they reach the LLM, and that filter generates its own massive telemetry that must be observed with the same rigor as the rest of the system. Treating it as a minor section is among the costliest mistakes that can be made in enterprise AI architecture.

Guardrails apply at two points: before sending the prompt to the LLM (input guardrails) and before returning the response to the user (output guardrails). Each detects different patterns. Input guardrails look for prompt injection, jailbreaks, out-of-scope queries, PII in external user requests. Output guardrails look for PII leaked by the model, generated problematic content, hallucinations detectable by source verification.

Each guardrail activation is a span with its attributes:

{
  "name": "guardrail.input.check",
  "attributes": {
    "guardrail.type": "prompt_injection",
    "guardrail.detector": "regex_v2 + llm_classifier",
    "guardrail.verdict": "blocked",
    "guardrail.confidence": 0.87,
    "guardrail.signals": [
      "imperative_instructions_in_user_content",
      "system_prompt_override_attempt"
    ],
    "guardrail.action": "request_rejected",
    "user.id_hash": "..."
  }
}

The concrete signals worth capturing to detect prompt injection through patterns in spans: anomalous prompt length compared to the user's distribution, presence of imperative instructions in user content ("ignore previous instructions", "act as"), embeddings close to known jailbreak corpora, unusual ratio of special characters or control tokens, attempts to inject markdown or structures that mimic system instructions.

Mandatory metrics for the security layer:

guardrail_blocks_total{type, detector} — blocks per type and detector. The distribution reveals which attacks are most frequent.
guardrail_false_positive_rate{type} — legitimate requests blocked, measured on a sample with human review.
guardrail_latency_ms_p95{type} — guardrails add latency before the LLM. Above 200ms, TTFT starts to feel slow.
guardrail_bypass_attempts{user_id_hash} — repeated attempts by the same user after a block. Adversarial actor signal.

Strong position: an AI API exposed to external users without input guardrails is not free of prompt injection attacks; it is simply not registering them. Prompt injection attacks on public APIs are the norm, not the exception.

Correlation with other layers is again where the value lies. A spike of blocks from a specific detector correlated with a particular endpoint indicates that the endpoint exposes a new surface. A high bypass attempt rate from a specific user is a candidate for investigation or aggressive rate limiting. False positives growing after a detector update is a regression in the classification model worth rolling back.

The telemetry pipeline

So far the discussion has covered what to generate. This section is operational: how to process it. The telemetry pipeline connects SDKs with backends, and in AI systems it has specific demands that justify a different architecture from traditional microservices.

The collector has three non-negotiable processors in AI systems. First, PII redaction, which in this domain is non-trivial because PII lives in natural language inside prompts, not in structured fields. Regex rules for obvious cases (emails, phone numbers), lightweight NER models as sidecar for names and entities, allowlist of safe fields rather than denylist (the former philosophy is more conservative, and in this domain the cost of a false negative far exceeds that of a false positive). Second, tail sampling: head sampling leaves too much value on the table. Observing the complete trace before deciding, retaining 100% of errors and cost outliers, and sampling the rest is the right approach. The RAM and CPU cost on the collector under tail sampling is real and must be planned for. Third, enrichment: cost calculation from tokens (the pricing table lives here), labeling by feature and tenant if not provided by the SDK, annotation of the active model version.

Storage architecture is where most teams collapse economically. Strong position: use data tiering from day one. Three tiers, three economics. Hot path on ClickHouse or equivalent: structured metadata, aggregable metrics, all attributes without long natural-language text. Queryable, low-latency. Intermediate buffer on Kafka: heavy prompts and completions are published to a dedicated topic, decoupled from the critical path, consumed asynchronously. Cold storage on S3 or GCS: full text is stored indexed by content hash. The hot span stores a pointer to the hash, not the text. Recovering full text is an on-demand operation, not automatic.

The result of this architecture is that the per-trace cost on the hot path remains predictable even as prompts grow, while cold storage has a far cheaper economics. Without tiering, every new long prompt linearly amplifies the hot path bill, and the system becomes unsustainable at scale.

On the backend decision: for teams smaller than ten people or with volume below a few thousand requests per day, the most efficient path is to start with a specialized vendor (Langfuse, Arize, Helicone). Self-hosting OTel for AI before reaching scale is operational friction with zero return. Migration to a custom stack is justified once volume offsets the operational cost. Large teams with mature internal platforms can skip that phase, but they are the exception, not the rule.

Anti-patterns

A short list of errors that recur in real systems and warrant explicit avoidance:

A single span for the entire RAG. Hides where the problem is when something fails.
Not separating input from output tokens. Their prices differ by orders of magnitude.
Not tracking cache hits per level. A drop in hit rate goes unnoticed until the bill arrives.
Not versioning the model in each span. When the provider updates, historical metrics stop being comparable.
No labeling by feature or tenant. Breakdowns by functionality or by client are inaccessible.
Capturing the full prompt in every span. Triples the tracing bill and leaks PII.
Using Jaeger for agents with sub-agents. A wall of horizontal bars that does not fit on screen.
Treating all LLM errors as one. Rate limit, content filter, and malformed structured output are different problems.
Measuring aggregate latency in streaming. TTFT and TPOT are the metrics the user perceives.
Not instrumenting guardrails. An exposed API without guardrails is already under attack, only invisibly.
LLM-as-judge without observing the judge. A black box evaluating another black box.
Quality as an anecdotal metric. Without its own dashboard and automatic alerts, it is not a metric, it is theater.

Minimal implementation

To bring this model into production immediately, the actionable minimum set is:

Mandatory spans: root span per user request, span per agent iteration, span per tool call, span per RAG phase (minimum: vector search and generation), span per guardrail activation.

Mandatory attributes on every LLM generation span: gen_ai.system, gen_ai.request.model, gen_ai.response.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cached_tokens, gen_ai.error.type, cost.usd, feature.name, tenant.id, stream.ttft_ms, stream.tpot_ms_avg.

Mandatory attributes on every retrieval span: vector.index, vector.top_k, vector.distance.avg, cache.embedding.hit, cache.search.hit, docs.retrieved (IDs), prompt.tokens (final).

Minimum metrics with low-cardinality labels: cost_usd_total{model, feature, tenant}, tokens_total{model, type, cache_status}, rag_phase_latency_ms{phase}, rag_cache_hit_ratio{level}, quality_score_p50/p95{dimension, feature}, guardrail_blocks_total{type}, stream_ttft_ms{model}, stream_tpot_ms{model}.

Collector with three processors: PII redaction (allowlist-first), tail sampling (100% errors, 100% cost outliers, sampling of the rest), enrichment with centralized cost calculation.

Storage in three tiers: ClickHouse for hot (or vendor if volume is low), Kafka as buffer for heavy prompts/completions, S3 for cold with hash indexing.

Dashboards from day one: latency per phase, TTFT and TPOT per model, cost per feature with anomaly detection, error rate per LLM type, sampled quality per dimension, guardrail blocks per type, the three cross-layer correlations (quality vs retrieval, cost vs iterations, prompt tokens vs completeness).

With this set, the system is observable. Lacking any of these pieces, the system appears observable without being so.

Closing

The observability of AI systems is a discipline still taking shape. There is not yet a decade of refined practices comparable to those available for web services. What is available is a partial map, a set of emerging patterns, and a tooling ecosystem that changes every quarter. Against that volatility, the most useful thing is to understand the problems each layer solves and how they intersect.

Three operational principles:

Without measuring cost per feature, the system operates blind. The monthly bill is not a metric, it is a delayed punishment. Cost must be treated as a real-time, attributable, alertable dimension, on equal footing with latency and errors.

Without separating quality from latency as independent metrics, claiming the system works has no foundation. A 200 OK is not success in AI. A fast response can be a bad response. Without a continuous evaluation system with its own dashboard, what is measured is server availability, not product functionality.

Without tracing each agent iteration with its nested tool calls, there is no real understanding of the system. Runaway agents, sub-agents that consume the budget, chains of tool calls that fail silently: all these problems are invisible without hierarchical instrumentation. And all of them translate into cost or reputational loss.

The ideal observability system for an AI application is not the one with the most metrics or the most expensive. It is the one that spans the five layers and their correlations at once and allows answering, in less than five minutes, the questions that matter: why a response is bad, what it cost, which documents support it, whether the agent behaved as expected, whether there were attack attempts. When the observability system fails to answer these questions, the problem lies not in the tool but in the underlying conceptual model.

DEV Community