Julio Molina Soler

Posted on May 3

Your LLM-as-a-Judge Sees 86% Hallucinations. 42% Are Your Pipeline.

#ai #devops #observability #python

Disclosure: I don't write these analyses alone. I'm learning LLM observability the same way most people are learning anything new in 2026 — by asking models to walk me through it. The prompts are mine, the depth comes from the model, the verification is mine again. I publish what I learn so others tracing the same path don't have to start from zero. With that out of the way:

A self-hosted Langfuse instance running a custom LLM-as-a-judge evaluator with a Hallucination rubric flagged 86% of scored generations as hallucinating. That number, taken at face value, would suggest a fleet of completely broken models. The number is misleading. After resolving every one of the 72 flagged scores back to the underlying observation, the picture splits cleanly in two: roughly 42% of the "hallucinations" are infrastructure failures the judge cannot see, and the remaining 58% are real model behavior — but four very distinct failure modes that need different fixes.

This is a follow-up to a prior audit of the same instance (previous post). What changes here: the new dimension is automated quality scoring, and what it teaches you about your evaluator stack the moment you take it seriously.

1. The headline number, and why it is wrong

The Hallucination evaluator scored 72 generations across the project's free-tier model fleet. Distribution:

value=1.0   55  flagged
value=0.9    3
value=0.8    4
value=0.5    1
value=0.2    1
value=0.0    8  faithful

mean = 0.856      → "86% hallucinating"

A scalar mean across 72 scores does not tell you why. The first useful split is by the observation's level field, which Langfuse populates from the SDK and tells you whether the underlying API call succeeded:

level=ERROR     28  / 72   (the API call itself failed)
level=DEFAULT   44  / 72   (call succeeded; output exists)

Now cross that with the score:

flagged (score > 0.5):    62
  └─ level=ERROR:         26   (42% of flagged)
  └─ level=DEFAULT:       36   (58% of flagged)

unflagged (score <= 0.5): 10
  └─ level=ERROR:          2
  └─ level=DEFAULT:        8

The judge fires on 26 generations where the upstream model never produced a response. These are not hallucinations. They are pipeline failures the judge has no way to recognize as such.

2. Why the judge cannot see infrastructure

Inspect a flagged-as-hallucinating but level=ERROR observation:

# Input (what the model was asked to do)
{
  "messages": [
    {"role": "system", "content": "You are a context summarization assistant. ..."},
    {"role": "user",   "content": "..."}
  ]
}

# Output (what got logged)
{
  "completion": null,
  "reasoning":  null,
  "rawRequest": {
    "model": "openrouter/free",
    "max_completion_tokens": 720000,
    "stream": true,
    ...
  }
}

The LLM-as-a-judge sees a valid prompt and an "answer" that isn't an answer. Naturally it concludes the model failed to follow instructions. Its comment for one such case:

The generation is an exact copy of the input prompt … indicating a complete failure to follow instructions.

The model never ran. The output object is the request configuration, not a completion. The previous audit identified two reasons this happens at scale on this instance: an invalid model slug (openrouter/free) and a max_tokens parameter set to 720000. Both cause OpenRouter to reject the request gateway-side. The SDK then logs the request envelope as the "output" because there's no completion to record.

The implication is that an LLM-as-a-judge is structurally blind to your infrastructure. It scores the artifact in front of it, not the path that produced it. If your evaluator is computing aggregate metrics over scored runs without filtering on level != "ERROR", those metrics are contaminated by infrastructure noise in direct proportion to your error rate.

The fix is one filter, applied before any aggregation:

# wrong: includes failed calls
df["hallucination_rate"] = df["score"].mean()

# right: only score successful generations
genuine = df[df["level"] != "ERROR"]
hallucination_rate = genuine["score"].mean()

For this dataset that single filter changes the headline from 0.856 to 0.689. Still high, and still the real problem — but no longer inflated by 22 points of pipeline noise.

3. The 36 genuine hallucinations cluster into four patterns

Filtering to flagged + non-error gives 36 generations. Reading every comment from the judge clusters them into four distinct failure modes:

Pattern A — Prompt echo (most frequent)

The model returns the input verbatim instead of executing the task. Example judge comment:

The generation is a verbatim copy of the input query, including both system and user messages, instead of generating the requested JSON agent profile.

This is not classical hallucination. Classical hallucination is the model confidently inventing facts. Prompt echo is more interesting: the model outputs the conversation as if it were continuing it, treating the system prompt as user content to be summarized. This is a known failure mode of small instruction-tuned models on highly structured tasks (e.g. "produce a JSON with fields X, Y, Z given this conversation"). Models in the 3B–30B range fail this way more often than 70B+ models do.

By model, prompt-echo dominates among the smallest free-tier slugs in the fleet (llama-3.2-3b-instruct, nemotron-nano-9b-v2, nemotron-nano-12b).

Fix: bind these models to simpler tasks (classification, extraction with regex-validated outputs) and route structured-summary tasks to a 70B+ tier. A pydantic schema validator on the output, with a single-shot retry on parse failure, eliminates most of the user-facing impact.

Pattern B — Fabricated tool APIs

The agent invents endpoints, fields, or response shapes for tools that exist conceptually but whose schemas the model never saw. Example:

The agent hallucinated the existence and API structure for interacting … with specific body parameters. This information was not provided in the context.

The model knew the goal (interact with a post), didn't have the tool schema, and confabulated a plausible REST shape (POST /v1/posts/interact with a body that "feels right"). The judge correctly catches this.

Fix: this is a tool-binding problem, not a model problem. Either (a) provide the tool schema explicitly via function-calling APIs, or (b) wrap the unknown surface with a tool that returns its own OpenAPI spec on demand. Models stop fabricating when they have something concrete to bind to.

Pattern C — Tool-output misinterpretation

The agent runs a malformed command, gets a success-shaped response from a permissive runner, and proceeds as if the command worked.

The assistant's initial tool call to exec a curl command was syntactically incorrect, concatenating two URLs with a comma. Despite this, the simulated tool output indicated "success": true, which is implausible for such a malformed command.

This is partly a tool design failure: the runner returned success: true for a failed command. But the model also failed to notice the implausibility. Two failures stacked.

Fix: tool runners should never return success: true on non-zero exit codes. Have the runner inject the exit code, stderr, and the exact command executed into the tool result. Models read these signals when they are present.

Pattern D — Instruction skipping in long system prompts

The agent retrieves the right context but skips explicit imperative steps in the system prompt.

The assistant retrieves relevant posts but does not comment or upvote them as directed. It also consistently fails to update the timestamp in memory state as instructed.

Long system prompts with multi-step procedural instructions get partial execution from smaller models. The agent does the cognitively easy parts (search, retrieve) and skips the parts that require tool calls with side effects.

Fix: decompose the procedure into discrete tool calls with explicit ordering. A plan_then_execute wrapper that forces the model to enumerate steps before executing them measurably reduces step-skipping. So does demoting procedural instructions out of the system prompt and into a tool whose first action is to read the procedure.

4. Hallucination and Correctness do not agree

The same instance runs a separate Correctness evaluator (also LLM-as-a-judge, also gemini-2.5-flash as the judge model). Both scored the same 72 traces. Pearson correlation between the two scores per trace:

r(Hallucination, Correctness) = 0.018

Statistically zero. Two judges run on the same generations, scoring closely related concepts, agree at chance level.

This is worth pausing on. It does not mean either judge is wrong. It means that:

The two rubrics are measuring genuinely different things. Correctness rewards whether the output matches a reference. Hallucination punishes invention not grounded in the input. A model can be correct and invent reasoning to get there. A model can be incorrect and never invent anything (e.g. by refusing or echoing).
Aggregating quality from a single judge is unreliable. If you ship a release based on Hallucination ↑, you may be shipping Correctness ↓ and never see it.
The signal-to-noise ratio of LLM judges on free-tier model outputs is low enough that you should treat any single-judge metric as a directional indicator, not a number to optimize against directly.

The practical move is to score a small held-out set with multiple rubrics, treat their disagreement as a feature (it tells you which dimension a regression hit), and reserve human eval for the disagreements.

5. What changes operationally

Five concrete changes from this analysis:

Filter level == "ERROR" before any aggregate quality metric. The current dashboard reads mean(Hallucination) = 0.856. After filtering: 0.689. The 0.22 difference is pure infrastructure noise.
Tag judge runs with the input/output shape they saw. Add a failed_pipeline boolean to score metadata when the output is a request envelope, not a completion. Most teams don't do this; it makes the artifact-vs-content distinction queryable.
Route structured-output tasks away from sub-30B models. Prompt-echo is concentrated in this size class on this workload. The fix is routing, not prompting.
Wrap tool runners to never return success: true on non-zero exit. This single change eliminates the entire Pattern C failure class.
Run two judges with different rubrics on the same data and watch their disagreement, not their agreement. Where they diverge is where the real quality signal lives.

6. Code: how to reproduce this analysis on your own instance

import os, httpx, pandas as pd
from concurrent.futures import ThreadPoolExecutor

BASE = os.environ["LANGFUSE_BASE_URL"].rstrip("/")
AUTH = (os.environ["LANGFUSE_PUBLIC_KEY"], os.environ["LANGFUSE_SECRET_KEY"])

def paginate(client, path, params=None):
    params = dict(params or {}); params.setdefault("limit", 100); page = 1
    while True:
        params["page"] = page
        j = client.get(f"{BASE}{path}", params=params).json()
        yield from j.get("data", [])
        if page >= j.get("meta", {}).get("totalPages", 1): break
        page += 1

with httpx.Client(auth=AUTH, timeout=60) as c:
    scores = list(paginate(c, "/api/public/scores"))

H = [s for s in scores if s["name"] == "Hallucination"]

# Hallucination scores attach to OTel-style 16-char span IDs.
# These don't appear in the bulk /observations list — fetch each directly.
def fetch_obs(obs_id):
    with httpx.Client(auth=AUTH, timeout=30) as c:
        r = c.get(f"{BASE}/api/public/observations/{obs_id}")
        return r.json() if r.status_code == 200 else None

with ThreadPoolExecutor(max_workers=8) as ex:
    obs_by_id = dict(zip(
        [s["observationId"] for s in H],
        ex.map(fetch_obs, [s["observationId"] for s in H])
    ))

rows = []
for s in H:
    o = obs_by_id.get(s["observationId"])
    if not o: continue
    rows.append({
        "score":  s["value"],
        "model":  o.get("model"),
        "level":  o.get("level"),
        "is_pipeline_failure": (
            isinstance(o.get("output"), dict) and
            o["output"].get("completion") is None
        ),
    })

df = pd.DataFrame(rows)
genuine = df[~df["is_pipeline_failure"]]
print(f"Raw mean:     {df['score'].mean():.3f}")
print(f"Filtered:     {genuine['score'].mean():.3f}")
print(f"Pipeline-noise contribution: {df['score'].mean() - genuine['score'].mean():.3f}")

Two API endpoints, one filter, and the difference between a number that misleads and a number that helps.

7. The meta-lesson

Hallucination evaluators are useful. They surface patterns that no static metric will. But like any LLM-graded signal, the score is a function of what the judge can see — and the judge's view is exactly what the SDK chose to log. If your SDK logs request envelopes when calls fail, your judge will score request envelopes. If your judge scores request envelopes, your dashboard will tell you the model is hallucinating when in fact your gateway is rejecting requests.

Aggregate metrics from a single judge over unfiltered data are not signals. They are an average of signal and noise that you have to separate by hand the first time, and then bake into your pipeline so it stays separated. The good news is that the separation is cheap once you've done it once. The bad news is that nobody does it once until they have a number that looks suspicious enough to investigate.

Eighty-six percent looked suspicious enough.

Top comments (1)

Vic Chen • May 3

Really valuable breakdown — the infrastructure-vs-model distinction is something most teams miss entirely.

We ran into exactly this pattern building our own AI data pipeline: our hallucination scores were inflated because failed API calls were being passed to the judge as if they were model outputs. The judge had no visibility into the HTTP layer, so it just saw garbage and flagged it.

Pattern B (fabricated tool APIs) is especially painful at scale. When your agent is working with dozens of tools and starts inventing endpoint schemas, the failures are silent in the worst way — it looks like it's working until you diff the actual API response.

The r=0.018 correlation between hallucination and correctness is the key insight here. Most teams I've talked to treat these as proxies for each other. Separating them into distinct rubrics and using disagreement as a signal is a much more mature approach.

Thanks for publishing this — bookmarking for our next eval design review.