Three LLM Observability Audits in Five Days: Each Fix Exposed the Next Bug

#ai #observability #llm #devops

I'm learning LLM observability the way most people learn things in 2026: by asking models to walk me through it. The prompts are mine, written from "I don't fully understand this yet." The depth comes from the model. The verification — re-running the queries, sanity-checking the math, anonymizing the screenshots — is mine again. I publish what comes out so whoever's behind me on the same path can skip the early confusion.

Three days ago I audited a self-hosted Langfuse instance and found a 32% error rate, a max_tokens=720000 bug, and a $1.11 single call from untruncated retrieval context. Then I audited the LLM-as-a-judge layer on top of it and found that 22 percentage points of the Hallucination score were pipeline errors being graded as model output.

This week I re-pulled the same instance. The fixes landed. The numbers got dramatically better. And the data exposed a different bug — one that the previous audits couldn't see because the noise floor was too high.

This is what changed, what's still broken, and the new problem hiding under "everything looks great."

1. Before / after, on the same instance

Metric	3 days ago	Today
Error rate (application calls)	32%	0.0%
In/out token ratio	97:1	1.8:1
`max_tokens` bug calls	91 (28% of traffic)	0
Invalid model slugs in pool	2 (`openrouter/free`, `gemma-4-26b-a4b-it`)	1
Cost over window	$2.86	$0.00
Throughput	bursty, user-driven	flat 20 traces/hour

Four bugs from the previous audit are gone:

max_tokens=720000 corrected — no more context-overflow rejections.
openrouter/free removed from routing — the slug that was failing 100%.
Retrieval context truncation in place — the in/out token ratio dropped 50×.
Premium models pulled from the eval mix — the entire fleet is on :free tier.

One remains: google/gemma-4-26b-a4b-it:free is still in the pool. One call slipped through today. Cheap fix.

2. The new shape of the data

Today's traffic is not user traffic. It's a benchmark loop:

trace.name distribution (today, 400 traces):
  OpenRouter Request                100   ← actual application calls
  Execute evaluator: Correctness    100   ← judge calls
  Execute evaluator: Hallucination  100   ← judge calls
  Execute evaluator: Toxicity       100   ← judge calls

Twenty traces per hour, every hour, for nineteen hours. This is exactly what you want during a stabilization phase — you're not depending on users to surface variance; you're feeding it on a timer. It's also why a single-judge metric saturating to 1.000 is dangerous right now, which is the rest of this post.

3. The Correctness leaderboard saturated

Correctness (n≥3, today, level != ERROR):
  inclusionai/ling-2.6-1t:free                    1.000  n=3
  minimax/minimax-m2.5:free                       1.000  n=8
  meta-llama/llama-3.2-3b-instruct:free           1.000  n=6
  nvidia/nemotron-3-nano-omni-30b-reasoning:free  1.000  n=4
  poolside/laguna-m.1:free                        1.000  n=4
  openai/gpt-oss-20b:free                         1.000  n=8
  openai/gpt-oss-120b:free                        1.000  n=6
  tencent/hy3-preview:free                        1.000  n=3
  poolside/laguna-xs.2:free                       1.000  n=7
  liquid/lfm-2.5-1.2b-instruct:free               0.857  n=7
  meta-llama/llama-3.3-70b-instruct:free          0.833  n=6
  qwen/qwen3-next-80b-a3b-instruct:free           0.833  n=6
  nvidia/nemotron-nano-9b-v2:free                 0.800  n=10
  qwen/qwen3-coder:free                           0.750  n=4

Three days ago tencent/hy3-preview:free was at the bottom with 0.573. Today it's tied at 1.000 with eight other models. The model didn't get better. The benchmark prompt set is too easy for this rubric to discriminate.

If you stop here and act on this leaderboard, you'll route equal weights to a 1.2B parameter model and a 120B parameter model on the basis that they're "equivalently correct." They're not. The judge can't tell, on this prompt set, with this rubric.

4. Where the rubric actually broke

When two judges run on the same generation and disagree wildly, you have a rubric problem. Today's data has 17 of these on 100 application calls — a 17% rate of judge disagreement.

Same observation, two different verdicts:

[obsId=5d42ef596a8f] poolside/laguna-m.1:free
  output: <verbatim copy of the input prompt, no real generation>

  Correctness   = 1.0  "exact match to the provided ground truth"
  Hallucination = 0.0  "exact copy of input query, fails to produce content"

The model echoed the prompt back instead of answering. The Correctness judge rewards textual match against the reference output. The Hallucination judge penalizes outputs that produce no real content. Both are correct readings of their own rubric. Both are looking at the same broken output. They reach opposite conclusions.

The pattern repeats across poolside/laguna-m.1 (3 cases), openai/gpt-oss-120b (2 cases), nvidia/nemotron-nano-9b-v2 (2 cases), and 10 other models with one each.

5. Cross-judge correlation, three time windows

Pearson r(Correctness, Hallucination) on the same observations:

  audit 1  (May 02-03, n=72)  :  r = 0.018
  audit 2  (May 02-05, n=143) :  r = 0.056
  today    (May 06,    n=100) :  r = -0.027

Three independent samples. Three near-zero correlations. Two LLM judges scoring closely related concepts on the same outputs agree at chance level, consistently, across five days.

This is not a bug in either judge. It's a property of the rubrics: "matches reference" and "introduces no fabricated content" measure genuinely different things. A prompt-echo can satisfy the first while failing the second. A creative-but-wrong answer can satisfy the second while failing the first. The two scores are nearly statistically independent.

The operational rule: never ship a routing change on a single judge improving. You're optimizing one orthogonal axis while a second judge could be silently regressing on the other.

6. Toxicity is dead weight

Toxicity scores today: 100 / 100 = 0.000

Same as the previous audit. The judge prompt is fine — the comments are coherent ("neutral instructions, no harmful content"). The workload simply contains zero toxic content. Running this judge costs gemini-2.5-flash tokens to produce a constant.

If your workload is agent-instruction-shaped, Toxicity is the wrong third judge. Better candidates:

Echo Detection: boolean — is the output a verbatim copy of the input? This would have caught all 17 of the disagreements above without an LLM call (Levenshtein distance suffices).
Format Compliance: does the output respect the expected schema? On agent workloads, malformed JSON is the most common silent failure.
Refusal Detection: did the model decline? Correctness scores a refusal as 0 even when refusal was the right action. A separate signal would let you distinguish "incorrect" from "refused, possibly correctly."

7. Five fixes, prioritized

Add an anti-echo clause to the Correctness rubric. Append to the prompt: "If the generation echoes the input/prompt without producing a substantive response, score 0 regardless of textual overlap with the ground truth." This breaks the artificial 1.000 ceiling on prompt-echo cases.
Add a deterministic echo detector at the pipeline level. Hash + normalized Levenshtein on input vs output, threshold at 0.85. Cheaper, faster, and not dependent on LLM judge interpretation.
Replace Toxicity with Format Compliance or Echo Detection. Constant signal is no signal. The token budget is better spent elsewhere.
Diversify the benchmark prompt set. The current set saturates this rubric. Add: multi-step reasoning, strict format constraints, refusal-eligible prompts, adversarial paraphrases.
Remove google/gemma-4-26b-a4b-it:free from the routing pool. Confirmed invalid slug, surviving from the previous audit by inertia.

8. The pattern across three audits

Each audit revealed problems the previous one couldn't see:

Audit 1 found infrastructure bugs (errors, oversized contexts, invalid slugs). The judge layer was being run, but its output was contaminated by infrastructure noise — the leaderboard reflected which models tolerated bad inputs, not which models were good.
Audit 2 quantified the contamination: 22 percentage points of judge score were pipeline errors. Filtering them out produced a usable leaderboard.
Audit 3 (today) found that fixing the infrastructure exposed a new failure mode: prompt-echo outputs that pass Correctness while failing Hallucination, with the leaderboard saturating to 1.000 and hiding the difference between models.

Each layer of fix lets you see the next layer of bug. The data was never wrong — your noise floor was just too high to read it.

If you're standing up an LLM judge pipeline, expect this sequence. Don't trust the first leaderboard. Don't trust the second one either. Cross-correlate two judges with non-overlapping rubrics, and treat sustained disagreement as a feature: it's where the real failure modes live.

Self-hosted Langfuse + OpenRouter. Internal hostnames, user IDs, and product codenames omitted. Public model slugs preserved verbatim for reproducibility.