DEV Community: 陈瀚

A Postmortem on Autonomous LLM-as-Judge: How My Eval Agent Got Two Verdicts Wrong Before I Found a Sandbox Bug

陈瀚 — Wed, 08 Apr 2026 13:26:20 +0000

I run an autonomous eval agent against new coding-agent stacks before trusting their numbers. The setup is standard: same task, same workflow, swap the shell × model combo, score the resulting diff on six dimensions. Last week the eval gave me a verdict that turned out to be wrong — twice — for the same root cause. The agent generating the verdict never flagged any uncertainty.

I'm sharing the postmortem because the failure mode is the kind of thing that quietly poisons any LLM-as-judge pipeline running in production, and mine only got caught because I happened to ask the right follow-up question.

Three combos, identical task, scored autonomously by Claude Code (Opus 4.6) running headless in a fresh session each retest.

Exhibit A: the eval agent's verdicts

Run 1. C1 (OpenCode + MiniMax-M2.7) scored 15/60. Verdict in the auto-generated report:

"Consistent with previous results: fast execution but no meaningful code output."

Run 2. Fresh session, no memory of run 1. C1 scored 16/60. New verdict, written confidently:

"Consistent: MiniMax cannot implement the task. The model may lack the capability to read external files and produce code changes in this Rust codebase."

Read that quote again. The agent identified the exact symptom — "may lack the capability to read external files" — and immediately blamed the model. It never asked the next question: is something in my pipeline preventing the agent from reading external files in the first place?

Two independent autonomous reports, both confidently ranking MiniMax dead last. If I'd shipped this leaderboard at this point, no one downstream would have questioned it — the wording was airtight.

The investigation that should have happened on run 1

I sent one instruction to a fresh session: "go deeper, check the daemon logs before retrying."

That's all. No hint about where to look, no hypothesis.

The new session traced the plan step's output to a spill file at ~/.orchestratord/logs/<task_id>.txt. The plan step itself was working fine — producing 50KB of useful context. But the OpenCode shell runs its agent inside a sandbox that, by default, only allows reads inside the workspace directory. The spill file was outside the workspace. So implement was getting an empty string, not the plan output.

plan step:      ✅ success (50KB output spilled to disk)
implement step: receives empty string, produces nothing
eval step:      "MiniMax cannot implement the task."

Two confident wrong verdicts, one config bug.

The session filed a one-line config fix (spill path goes inside the workspace), then re-ran the whole benchmark. C1 produced real code this time: 219 lines added, a RetryConfig struct, an actual connect_with_retry helper. Score: 18/60 — still mediocre, because the model's unit tests had four type-mismatch compile errors. But that's a real model weakness, not an infrastructure mirage.

Same numerical score range as before (15→16→18). Completely different story underneath.

What this means for production LLM-as-judge

The piece that should make anyone running autonomous eval uncomfortable: the agent never decided on its own to check the daemon logs. The first two sessions ran the exact prompt that production eval pipelines use ("execute the benchmark, collect artifacts, write a report") and produced confident, well-structured, plausible failure analysis. Neither session paused on the line "may lack the capability to read external files" to ask whether the pipeline was the cause.

The bug was discoverable. The third session found it in a single investigation pass with no hint — it just had to be told to look. So the fix isn't "use a smarter model"; the fix is structural.

What I changed in the pipeline:

Spill paths now default to a workspace-relative location that is sandbox-readable from every agent sandbox in the harness. (Previously this was an undocumented assumption.)
The eval prompt now includes a mandatory "sanity-check the harness" step that runs before the agent is allowed to attribute failure to the model. The step looks for specific symptoms (empty stdin/stdout, missing context blocks, sandbox denials in logs) and surfaces them as harness candidates rather than letting them silently shape the verdict.
Any verdict containing absolute language like "cannot" or "incapable" is flagged for human review against quantitative artifacts (event logs, exit codes) before it lands in the leaderboard. Two of the three retests above produced exactly such language; both were wrong.

None of these are clever. They're the kind of thing you put in after something like this has happened, not before. Which is the actual point of the postmortem.

Bonus oddity for the eval-pipeline obsessives

Same retest, separate finding: in the post-fix run, the winning combo (Codex + GPT-5.4, 50/60, 12 passing tests, clippy-clean) had a step_finished success rate of 25% — three of its four orchestrator steps reported failure. Meanwhile the worst combo (the one that almost got blamed for not knowing how to read files, 18/60) had a 50% step success rate.

The "step success rate" dimension turned out to be inversely correlated with code quality in this run, because the failing steps were self_test and benchmark_eval — both downstream of implement, both apparently buggy themselves. Another reminder: agent eval metrics are mostly noise unless someone has personally verified each one means what you think it means.

(And yes, my eval agent — also Claude Code — gave Codex + GPT-5.4 the highest score but not a perfect one. It insists this is purely on the merits.)

Where this all happened

The orchestrator and workflow definitions are open-sourced at github.com/c9r-io/orchestrator. The fix is FR-092. The agent manifests, the benchmark workflow, and the exact prompts used for both the eval agent and the target agents are in fixtures/benchmarks/. If you're running an autonomous eval pipeline of your own and want to sanity-check it against this failure mode, the spill-path/sandbox interaction is the specific thing to look for.

The orchestrator isn't the interesting part of this post. The interesting part is that an autonomous evaluator confidently produced two wrong reports, never flagged uncertainty, and the only reason I caught it is one human follow-up question.

From Auth9 to Agent Orchestrator: how an AI-native development method evolved into a Harness Engineering control plane

陈瀚 — Sun, 29 Mar 2026 12:21:46 +0000

I have spent years practicing extreme programming and TDD. So when AI coding tools became good enough to handle a meaningful share of day-to-day work, I adopted them quickly and enthusiastically.

Then I hit a very predictable wall.

I became the bottleneck.

AI could write code quickly. It could write tests quickly too. But the final question, "is this actually correct?", still landed on me. I had to review the implementation, run the environment, click through flows in the browser, inspect application logs, check database state, and decide whether the output was real or just superficially plausible.

In other words, AI had accelerated generation, but I was still manually carrying too much of the verification burden. The faster the model became, the more manual review and QA work accumulated around me.

That was the moment I started pushing testing even further left, but this time not just in the classic TDD sense. I started pushing the entire validation loop left.

Shifting the validation loop left

I stopped waiting until after implementation to think seriously about testing.

Before writing code, I started requiring AI to produce explicit test plans. After implementation, AI was not allowed to stop at "done". It had to execute the validation plan: drive the browser, inspect logs, check database state, compare outcomes against expectations, create structured tickets for failures, fix them, and rerun the relevant checks until the results converged.

Over time, this stopped feeling like a set of prompting tricks and started feeling like a method. The method included:

test plans before implementation
structured docs for QA, security, and UI/UX verification
ticket-driven repair loops
doc governance to keep the verification layer from rotting
reusable Skills that encode repeatable development and validation behavior

At that point I needed a real proving ground.

Why Auth9 mattered

I chose Auth9, a full identity platform, because I wanted something difficult enough to make the method fail if it was weak.

Identity systems are full of dangerous edges: protocol semantics, state transitions, interoperability, security constraints, permission models, and long tails of compatibility work. If a method can help govern that kind of system, it is probably doing something real.

Auth9 was where this approach became concrete. While building it, I kept refining the method itself: how docs should be governed, which checks needed to become standard, how to turn recurring behavior into Skills, and how to keep the ticket/fix/retest loop honest.

As the project evolved through real iterations, I became convinced that this was not just a convenient way to ship features faster. It was becoming a viable way to govern complex software over time.

That was when Agent Orchestrator began.

I did not begin with a plan to build a platform. I began with a method that was proving useful, and I no longer wanted to supervise every step manually. If the method was real, it should be able to keep running after I stepped away from the keyboard. That requirement naturally pulled me toward a control plane.

The mid-March test

By mid-February, I was already using early Orchestrator-style automation inside Auth9. In mid-March, I decided to run a high-risk experiment: I wanted to see whether Orchestrator and this method could actually carry a complex low-level refactor.

I replaced the headless Keycloak setup under Auth9 with a native auth9-oidc engine.

For an identity platform, replacing the underlying OIDC engine is not a cosmetic change. It touches protocol behavior, state flow, interoperability assumptions, and a long list of edge cases that often surface only after the "main" work seems complete.

By then, I was already using the early Orchestrator to help govern the process. It did not magically remove the difficulty, but it did provide a structure around the work: execution flow, task state, logs, tickets, and repeatable validation.

The core replacement landed over three days. OIDC conformance and Keycloak legacy cleanup followed within the same week.

More importantly, the story did not end there. The same method and Orchestrator-assisted workflow helped converge the technical debt that surfaced after the change, and eventually completed the community OIDC Certification tests on the native engine.

That sequence mattered more to me than the initial three-day number. It showed that the method and the tool were not only useful for greenfield development. They could also help govern a high-risk system through long-running, uncomfortable change.

Why the project is called Orchestrator

At the time, the word I cared most about was orchestration.

What I wanted most was a reliable way to orchestrate this method: execute the next step, keep state, record logs, preserve intermediate outputs, stop safely, and resume later. So the project became Agent Orchestrator.

The broader conceptual framing came later.

Later, I found a better name for the category

When OpenAI later used the term Harness Engineering, I immediately recognized the shape of the work. Not because I thought I had coined the idea first, but because the term described something I had already been converging on through practice.

The point was larger than orchestration alone. What mattered was the full harness around the agent: workflow, constraints, observability, recovery, and feedback loops.

That is why I now describe Agent Orchestrator as a Harness Engineering control plane. The project name came first; the clearer positioning came later.

What Agent Orchestrator actually is

It is a local-first control plane for shell-based coding agents. Agents, workflows, and step templates are declared in YAML. A daemon (orchestratord) schedules steps, routes work by capability, keeps task state in SQLite, streams logs, and enforces guardrails such as sandboxing and output redaction. The CLI is machine-parseable so agents can drive it too.

Some design choices matter a lot here:

local-first runtime so the control plane stays close to the repository
SQLite-backed task state so long-running work remains inspectable
machine-readable CLI output so agents can participate directly
declarative YAML resources so the workflow logic lives outside one model session
support for heterogeneous shell agents so the method does not depend on one vendor

To me, Orchestrator is not another code-generation plugin. It is the control surface that lets this method keep running.

The part I trust most

One reason I trust this framing is that the project has also been used on itself.

Self-bootstrap and self-evolution were important validation paths from early on. If the method is real, it should not only work on downstream projects. It should also survive contact with the control plane itself.

If you are also trying to turn repeated AI-assisted engineering work into something more systematic and more durable, that is exactly the gap I built Orchestrator to address.

You can try it here:

brew install c9r-io/tap/orchestrator
# or
cargo install orchestrator-cli orchestratord

orchestratord --foreground --workers 2 &
orchestrator init
orchestrator apply -f manifest.yaml
orchestrator task create --goal "My first task"

Docs: docs.c9r.io
GitHub: github.com/c9r-io/orchestrator
Auth9: github.com/c9r-io/auth9
License: MIT

It is open source, still evolving, and I would genuinely like to hear how other people are turning repeated AI-assisted engineering work into something that can survive real software delivery.