A Postmortem on Autonomous LLM-as-Judge: How My Eval Agent Got Two Verdicts Wrong Before I Found a Sandbox Bug

#ai #agents #postmortem #debugging

I run an autonomous eval agent against new coding-agent stacks before trusting their numbers. The setup is standard: same task, same workflow, swap the shell × model combo, score the resulting diff on six dimensions. Last week the eval gave me a verdict that turned out to be wrong — twice — for the same root cause. The agent generating the verdict never flagged any uncertainty.

I'm sharing the postmortem because the failure mode is the kind of thing that quietly poisons any LLM-as-judge pipeline running in production, and mine only got caught because I happened to ask the right follow-up question.

Three combos, identical task, scored autonomously by Claude Code (Opus 4.6) running headless in a fresh session each retest.

Exhibit A: the eval agent's verdicts

Run 1. C1 (OpenCode + MiniMax-M2.7) scored 15/60. Verdict in the auto-generated report:

"Consistent with previous results: fast execution but no meaningful code output."

Run 2. Fresh session, no memory of run 1. C1 scored 16/60. New verdict, written confidently:

"Consistent: MiniMax cannot implement the task. The model may lack the capability to read external files and produce code changes in this Rust codebase."

Read that quote again. The agent identified the exact symptom — "may lack the capability to read external files" — and immediately blamed the model. It never asked the next question: is something in my pipeline preventing the agent from reading external files in the first place?

Two independent autonomous reports, both confidently ranking MiniMax dead last. If I'd shipped this leaderboard at this point, no one downstream would have questioned it — the wording was airtight.

The investigation that should have happened on run 1

I sent one instruction to a fresh session: "go deeper, check the daemon logs before retrying."

That's all. No hint about where to look, no hypothesis.

The new session traced the plan step's output to a spill file at ~/.orchestratord/logs/<task_id>.txt. The plan step itself was working fine — producing 50KB of useful context. But the OpenCode shell runs its agent inside a sandbox that, by default, only allows reads inside the workspace directory. The spill file was outside the workspace. So implement was getting an empty string, not the plan output.

plan step:      ✅ success (50KB output spilled to disk)
implement step: receives empty string, produces nothing
eval step:      "MiniMax cannot implement the task."

Two confident wrong verdicts, one config bug.

The session filed a one-line config fix (spill path goes inside the workspace), then re-ran the whole benchmark. C1 produced real code this time: 219 lines added, a RetryConfig struct, an actual connect_with_retry helper. Score: 18/60 — still mediocre, because the model's unit tests had four type-mismatch compile errors. But that's a real model weakness, not an infrastructure mirage.

Same numerical score range as before (15→16→18). Completely different story underneath.

What this means for production LLM-as-judge

The piece that should make anyone running autonomous eval uncomfortable: the agent never decided on its own to check the daemon logs. The first two sessions ran the exact prompt that production eval pipelines use ("execute the benchmark, collect artifacts, write a report") and produced confident, well-structured, plausible failure analysis. Neither session paused on the line "may lack the capability to read external files" to ask whether the pipeline was the cause.

The bug was discoverable. The third session found it in a single investigation pass with no hint — it just had to be told to look. So the fix isn't "use a smarter model"; the fix is structural.

What I changed in the pipeline:

Spill paths now default to a workspace-relative location that is sandbox-readable from every agent sandbox in the harness. (Previously this was an undocumented assumption.)
The eval prompt now includes a mandatory "sanity-check the harness" step that runs before the agent is allowed to attribute failure to the model. The step looks for specific symptoms (empty stdin/stdout, missing context blocks, sandbox denials in logs) and surfaces them as harness candidates rather than letting them silently shape the verdict.
Any verdict containing absolute language like "cannot" or "incapable" is flagged for human review against quantitative artifacts (event logs, exit codes) before it lands in the leaderboard. Two of the three retests above produced exactly such language; both were wrong.

None of these are clever. They're the kind of thing you put in after something like this has happened, not before. Which is the actual point of the postmortem.

Bonus oddity for the eval-pipeline obsessives

Same retest, separate finding: in the post-fix run, the winning combo (Codex + GPT-5.4, 50/60, 12 passing tests, clippy-clean) had a step_finished success rate of 25% — three of its four orchestrator steps reported failure. Meanwhile the worst combo (the one that almost got blamed for not knowing how to read files, 18/60) had a 50% step success rate.

The "step success rate" dimension turned out to be inversely correlated with code quality in this run, because the failing steps were self_test and benchmark_eval — both downstream of implement, both apparently buggy themselves. Another reminder: agent eval metrics are mostly noise unless someone has personally verified each one means what you think it means.

(And yes, my eval agent — also Claude Code — gave Codex + GPT-5.4 the highest score but not a perfect one. It insists this is purely on the merits.)

Where this all happened

The orchestrator and workflow definitions are open-sourced at github.com/c9r-io/orchestrator. The fix is FR-092. The agent manifests, the benchmark workflow, and the exact prompts used for both the eval agent and the target agents are in fixtures/benchmarks/. If you're running an autonomous eval pipeline of your own and want to sanity-check it against this failure mode, the spill-path/sandbox interaction is the specific thing to look for.

The orchestrator isn't the interesting part of this post. The interesting part is that an autonomous evaluator confidently produced two wrong reports, never flagged uncertainty, and the only reason I caught it is one human follow-up question.