APMs Traditionally Don't Measure Correctness — Here's What Does

#ai #observability #llm #devops

Book: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

At 11pm last Tuesday a support ticket came in: the bot quoted a 90-day refund policy that hasn't existed since 2022, cited a clause that was never in the contract, and offered to escalate to a manager who left in 2023. The APM dashboard was green. P99 latency at 1.4s. Error rate flat at 0.02%. Token throughput rising in a healthy line.

Every metric on the dashboard is correct. Every metric is also, for this kind of bug, useless.

APMs were not originally built to measure correctness. They were built to instrument HTTP calls, database queries, and queue lag: work where "did it succeed" is a status code. LLM calls succeed almost always. The 200 lies. Vendors including Datadog are now shipping LLM Observability features to close exactly this gap, but the broader category still treats correctness as a bolt-on. What you need on top is a correctness layer: a second class of telemetry that asks not "did the call return" but "was the answer right."

What APM sees and what it misses

Datadog, New Relic, and Dynatrace each treat an LLM call as one more outbound HTTP span. They give you what HTTP gives you: status code, latency, request size, response size. With LLM-aware product tiers, they also pull usage.input_tokens and usage.output_tokens from the response and compute a cost. Datadog shipped native OTel GenAI semconv support in v1.37, New Relic AI Monitoring ingests the same gen_ai.* attributes, and Dynatrace's Davis AI maps them into its observability views. The gen_ai.* attributes from the official OTel spec flow into all three.

That's a real upgrade. It's also still the same shape of telemetry: operational, not semantic. Latency, throughput, error, cost. None of these say anything about whether the model said something true.

Compare to a normal API. If a payments service charges $50 instead of $5, the operational metrics still go green: 200 OK, p99 fine, throughput fine. The bug is in the payload. We caught those bugs historically by adding contract tests and golden invariants. Code that asserts on the response body, not just the response code.

LLM correctness is the same problem one layer up. The "payload" is now a paragraph of natural language. You cannot grep it for a bug. You can, however, score it.

The four signals the correctness layer needs

After a year of working through this with teams shipping production LLM features, the same four signals show up everywhere worth deploying.

1. Judge-based output scoring. Run a second model against a sample of production responses. Ask it: given the prompt, the retrieved context, and the response, score this on faithfulness, relevance, and harmfulness. This is the LLM-as-a-judge pattern that Langfuse, Arize Phoenix, and Arize all ship templates for. It's noisy: judges and humans disagree often enough to matter, on the order of 10–20% in published evals (see Zheng et al., "Judging LLM-as-a-Judge"). It still catches the failure modes APM can't see at all.

2. Golden-set drift. Keep a fixed set of 100–500 known prompts with known acceptable answers. Re-run them on every model swap, every prompt edit, every retrieval-config change. Track the eval score over time. When the line moves down, you have a regression, even if your traffic dashboard is unchanged.

3. Retrieval-relevance check (for RAG). Score the retrieved chunks against the user's query before the LLM ever sees them. Low retrieval relevance → high hallucination risk, every time. This is the metric where the failure mode "the bot answered without the right context" becomes legible.

4. Per-tenant cost alerting. This one is operational, not semantic. It still lives in the correctness layer because the cost signal is one of the few that surfaces abuse, prompt injection, and runaway agents. A single tenant's daily token spend going from $4 to $400 in an hour is a feature you cannot get from aggregate dashboards.

Datadog now ships hallucination detection inside their LLM Observability product, which covers signal #1 directly. The other three (golden-set drift, retrieval relevance, per-tenant cost) either aren't shipped natively or sit behind "buy the right tier, configure the right monitor."

A working decorator

The correctness layer, as code. A Python decorator that wraps any LLM call, emits an OTel trace per the GenAI semconv, and on a sampled fraction of calls fires a synchronous judge call and writes the score back as a span attribute.

import os
import random
import json
from functools import wraps
from typing import Callable

from openai import OpenAI
from opentelemetry import trace

client = OpenAI()
tracer = trace.get_tracer("llm.correctness")

JUDGE_PROMPT = """You are a strict evaluator. Score the response on
faithfulness to the provided context (0.0 - 1.0). Return JSON:
{"score": float, "reason": str}

Question: {q}
Context: {ctx}
Response: {resp}"""


def judge(q: str, ctx: str, resp: str) -> dict:
    out = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": JUDGE_PROMPT.format(q=q, ctx=ctx, resp=resp),
            }
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(out.choices[0].message.content)

The judge function is plain. The decorator below is where the OTel work happens. Note: the gen_ai.prompt / gen_ai.completion span attributes used here track OTel GenAI semconv v1.27; newer semconv revisions move prompt and completion content onto log events instead, so pin the version your collector understands.

def observed_llm(judge_sample_rate: float = 0.05):
    def deco(fn: Callable):
        @wraps(fn)
        def wrapper(question: str, context: str, **kwargs):
            with tracer.start_as_current_span("llm.call") as span:
                span.set_attribute("gen_ai.system", "openai")
                span.set_attribute(
                    "gen_ai.request.model", "gpt-4o-mini"
                )
                span.set_attribute(
                    "gen_ai.prompt", question[:1000]
                )

                response = fn(question, context, **kwargs)

                span.set_attribute(
                    "gen_ai.completion", response[:1000]
                )

                if random.random() < judge_sample_rate:
                    with tracer.start_as_current_span(
                        "llm.judge"
                    ) as j:
                        result = judge(question, context, response)
                        j.set_attribute(
                            "llm.judge.score", result["score"]
                        )
                        j.set_attribute(
                            "llm.judge.reason",
                            result["reason"][:500],
                        )
                        span.set_attribute(
                            "llm.correctness.score", result["score"]
                        )

                return response
        return wrapper
    return deco

Use it like this:

@observed_llm(judge_sample_rate=0.05)
def answer(question: str, context: str) -> str:
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": question},
        ],
    )
    return r.choices[0].message.content

There are two spans. The parent llm.call carries the operational telemetry (model, prompt, completion). The child llm.judge carries the semantic telemetry (score, reason). Both share a trace ID. Your APM reads the parent and renders it the way it always has. Your observability backend reads llm.correctness.score and alerts when the rolling p50 drops below 0.7.

The sampling rate matters too. Judging every call doubles your LLM bill and adds latency to every request. 5% sampled is enough to detect a meaningful score drop within an hour at moderate traffic. Bump to 100% for golden-set runs in CI; keep it sampled in production.

Where this fits in OTel land

The OTel GenAI semconv defines a gen_ai.* namespace: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, prompts, completions. The decorator above writes a subset of those; a real implementation pulls token counts off the API response and writes them too.

The judge score is not in the standard yet. The GenAI semconv working group is still hashing out a gen_ai.evaluation.* namespace, so attributes like llm.correctness.score are vendor-shaped today. Langfuse, Phoenix, and Arize each map them differently into their UIs. Pick a name and stick to it across services until the spec stabilizes.

If you're already running an OTel collector, this decorator slots into the existing pipeline. The collector forwards the llm.call and llm.judge spans to whichever backend you point at: Datadog, Langfuse, Phoenix, all of the above. You don't pick "APM versus correctness." You run both and let them read different attributes.

The honest version

Three things this decorator does not do, and which the rest of your observability stack still has to.

It doesn't replace golden-set CI runs. Sampled judges in production catch regressions days after they ship; golden sets catch them before a deploy. You need both.

It doesn't catch silent retrieval failure by itself. To score retrieval, you'd add another span (rag.retrieve) with a relevance score against the query, before the LLM call ever happens. The pattern is the same; the prompt is different.

It doesn't tell you why. A score of 0.4 says the answer was bad. The reason field gives you a clue. Reproducing requires the prompt, the context, the model version, and the sampling parameters all on the same span. Make sure your tracer captures them.

What goes on the dashboard

If you're staring at your APM tomorrow and want to bolt one panel on, the order I'd build them:

Rolling p50 of llm.correctness.score over 1h, by route. This is your hallucination canary.
Count of llm.correctness.score < 0.5 per hour, by route. Failure rate, not average. Averages hide cliff edges.
Per-tenant token cost over 24h, alerting on > 5x of trailing 7-day p99. Catches abuse and prompt-injection loops.
Retrieval-relevance score, p10 over 1h. When this drops, the LLM is about to hallucinate; you have ~minutes of warning.

These four panels tell you everything the green dashboard was lying about. None of them are exotic; all of them are downstream of writing one extra attribute on a span. The hard part isn't the metric: it's deciding the metric matters before something embarrassing ships and a customer screenshots it.

If this was useful

The LLM Observability Pocket Guide goes deeper on the correctness layer end-to-end: what to put on each span per the OTel GenAI semconv, how to wire judges as child spans without doubling your bill, the eval rigs that catch silent regressions, and how to tell when "we need observability" actually means "we need a different model." If your APM is green and your support queue is on fire, the book is for you.