- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
This post is not anti-Datadog. It is anti-stopping at Datadog.
Datadog is excellent at what APM has always been about: latency histograms, error budgets, service dependency graphs, the shape of a request as it moves through your stack. That stack was designed for HTTP. The correctness of an HTTP service is a property of the bytes on the wire. A 200 means the thing worked. A 500 means it didn't.
LLM applications break that assumption. The bytes on the wire are fluent paragraphs of natural language. Correctness lives inside them, not around them.
The green dashboard, the red inbox
The canonical scene: a support ticket lands overnight. A customer says the AI assistant is "confidently making things up." The on-call opens the laptop.
p99 latency on the chat endpoint: 1.8 seconds, flat. Error rate: 0.00%. The Datadog APM flame graph shows the outbound call to api.anthropic.com at 1.42 seconds, returning 200. The vector DB span is green. Redis is green. Every span in the trace is green.
The customer pastes the offending query. The on-call pastes it into staging. The model returns something that looks reasonable. The same query in production returns a different answer. Neither is obviously wrong on its face. Both are paragraphs of fluent English that cite sources and sound confident.
The only person in the loop who can tell which one is correct is the customer, and the customer is asleep.
This is not an APM bug. APM is answering a different question.
What APM measures
RED (rate, errors, duration). USE (utilization, saturation, errors). Both were built for a world where the correct behavior of a service was a question about its transport layer. Both are accurate. Both are useful. Both stop at the edge of the HTTP response.
Braintrust's 2026 piece puts the gap directly: "APM tools treat AI like any other service — they capture latency, error rates, and token counts, but don't evaluate whether the model's response was faithful, relevant, or safe."
Faithful. Relevant. Safe.
None of those are HTTP status codes.
The four signals APM cannot see
The book walks through eight failure classes that return 200. Here are the four that show up on dev.to postmortems most:
Hallucination. The response describes a universe that does not exist. APM view: HTTP 200, 1.4s, 312 output tokens, a few cents of cost. The string is well-formed. The string is wrong. Arize's LLM evaluation guide puts it plainly: "your application can exhibit perfect operational health while simultaneously failing users with factually incorrect, irrelevant, or unsafe content."
Silent provider drift. The model ID you call and the artifact your request is actually run against are not the same thing. The mapping is maintained by the vendor. It can change for a subset of traffic without a version bump. Your APM sees the same endpoint, the same status code, the same latency histogram. Users see a different model. The Anthropic August 2025 three-bug cascade is the canonical public example.
Retrieval drift in RAG. You ship a RAG pipeline in March. It works. In June, support tickets start drifting up. The vector DB dashboards are clean. The embedding provider pushed a new version of their model while your index still holds vectors from the old one, and the two spaces no longer align. None of this shows up as errors.
Tool-call misfire in agents. An agent calls search_orders, finds the order, then calls email_customer with a polite explanation of the cancellation policy. It never calls refund_order. The customer is not refunded. Every span in the trace returned 200. Each tool call succeeded. The unit of failure is the relationship between spans. APM has no vocabulary for that.
What the second signal looks like
Datadog, a few quarters late, has noticed. Datadog LLM Observability now consumes OpenTelemetry GenAI semconv v1.37+ natively. So does Langfuse (now owned by ClickHouse). So does Arize Phoenix. The way you ship to any of them is the same:
# instrumentation.py — vendor-neutral LLM tracing.
from opentelemetry import trace
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
OpenAIInstrumentor().instrument()
# Every OpenAI call now emits a span with:
# gen_ai.operation.name = "chat"
# gen_ai.provider.name = "openai"
# gen_ai.request.model = "gpt-4o-mini"
# gen_ai.response.model = "gpt-4o-mini-2024-07-18"
# gen_ai.usage.input_tokens / output_tokens
# gen_ai.response.finish_reasons
That's the tracing layer. On top of it, you run an online eval: a rolling sample of production traces, scored by an LLM-as-judge on a multi-axis rubric. Faithfulness. Relevance. Format adherence. Safety. Cost per session. Each axis is its own time series. Each axis alerts independently.
That is the signal the on-call wanted at 02:14. A judge score that dropped from 0.92 to 0.78 three hours before the customer's ticket. A retrieval-relevance graph that has been sliding since June. A tool-call success rate per agent per tool that shows exactly which tool the agent has been misusing.
Your APM is not wrong. It is just not enough.
If this was useful
The thesis of the book (Observability for LLM Applications) is narrow and load-bearing: the failure modes that matter in an LLM application are invisible to the observability stack you already run. Chapter 1 walks through the eight failure classes, Chapter 4 walks through OpenTelemetry GenAI semconv, and Part III (Chapters 8–11) walks through building the eval signal that sits on top.
- Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
- Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
- Me: xgabriel.com · github.com/gabrielanhaia.

Top comments (0)