Bohyeon Jang

Posted on May 31

Why I used three different critic roles instead of one (and what the eval taught me)

#llm #python #ai #evaluation

Why I used three different critic roles instead of one (and what the eval taught me)

I built Crucible: three specialized critic agents that audit any LLM output in parallel, an adjudicator that synthesizes their critiques into a confidence-scored verdict, and an eval harness that measures whether the whole thing actually works better than just asking a single model to check itself.

Here is what I learned, including the part where the honest answer is "not as much as I hoped."

The problem: a model cannot reliably audit its own blind spots

When a language model generates output, it has already committed to a direction. Ask it to self-review and it will often ratify the same confident mistake it just made, not because it is lazy, but because self-review activates the same internal heuristics that produced the error.

The failure mode that made this concrete for me: imagine an LLM answering a question about file storage and it says "save uploads to /uploads/ on the server." That looks reasonable in isolation. The model reviews it and says "looks good." But the advice assumes a single-server deployment. In a horizontally-scaled setup, that /uploads/ directory does not exist on every instance, and you now have a race condition that corrupts user data in production.

The model did not hallucinate. It gave correct advice for the wrong context. Self-review did not catch it because both passes made the same contextual assumption.

Multi-agent verification is the obvious response: get independent perspectives that do not share the same failure mode.

Three roles, not three instances of the same model

The naive version of "multi-agent review" is: run the same model three times with slightly different temperatures and hope disagreement surfaces problems. That is mostly noise. You get variance in phrasing, not in perspective.

Crucible uses three structurally different critic roles:

Accuracy critic: are the claims true and internally consistent? Hallucinated entities, wrong numbers, citations that do not exist.
Logic critic: does the reasoning follow? Is the conclusion actually supported by the premises given?
Completeness critic: what is missing? What did the prompt ask for that the output omitted?

Each critic has a narrow mandate: explicit instructions not to stray into the other dimensions. The accuracy critic is told: "Do NOT comment on logic flow or completeness. Stay strictly on factual correctness." This is deliberate. Focused critics produce cleaner signal. A generalist critic reviewing everything at once tends to cluster around the most obvious problem and miss the others.

The adjudicator then reads all three critiques and produces a typed verdict: confirmed_issues (issues the adjudicator judged real and consequential, where cross-critic agreement is strong signal but a clear high-severity single-critic flag also qualifies), dismissed_flags (issues a critic raised that the adjudicator overruled as out-of-scope, pedantic, or insufficiently supported), and a quality_score with a confidence rating.

The dismissed_flags field turned out to be one of the more useful things in practice. When only one critic fires on something, that is often a false positive, a critic being overzealous within its dimension. The adjudicator's job is to apply cross-critic weight, not just union every flag.

The asyncio.gather decision: why not LangGraph

I looked at LangGraph. For a 3-node fan-out (run three critics in parallel, collect results, pass to adjudicator) it is ceremony. Here is the actual orchestration in Crucible:

raw = await asyncio.gather(
    *(critic.run(output_text, original_prompt, model) for critic in CRITICS),
    return_exceptions=True,
)

That is four lines. LangGraph would have given me a graph definition, node registration, state management, and a debugging UI that I would never open. At this scale, the abstraction costs more than it saves.

The decision record I wrote for this project has a line I keep coming back to: "For a 3-node fan-out, LangGraph is ceremony. asyncio.gather is ~4 lines and easier to explain in an interview." Not a knock on LangGraph. It genuinely earns its keep at larger scale. But building something you cannot explain in five minutes is not a feature.

The provider question: Claude x 3, with a path to diversity

Here is a decision I want to be transparent about. The brief for Crucible called for three different providers: GPT-4o for accuracy, Claude for logic, Gemini for completeness. The theory is sound. Different training data means different failure modes, so you are less likely to have all three critics share the same blind spot.

I built the architecture to support this. The provider resolution is in src/providers.py: if OPENAI_API_KEY is set, the accuracy critic upgrades to GPT-4o. If GEMINI_API_KEY is set, the completeness critic upgrades to Gemini. Otherwise all three critics run on Claude.

But I explicitly cut multi-provider as a v1 requirement. The decision record:

"Three distinct critic prompts on one strong model already produce lens diversity, and removing the multi-provider dependency means a reviewer can run the demo with a single API key."

This is the honest tradeoff. Three well-scoped critic prompts on Claude produce genuinely different outputs because the task is structurally different: one is hunting for false facts, one is evaluating logical structure, one is checking completeness against the stated goal. That is real lens diversity. The additional diversity you get from different providers is real but incremental, and it comes with real cost: three sets of API keys, three different rate limits, three different latency profiles, three different pricing models.

For a v1 that needs to ship and be demonstrable, I chose the simpler version. The architecture is ready for the upgrade.

What the eval taught me (the honest version)

I built an eval harness with 12 test cases: 10 with planted errors (15 errors total across accuracy, logic, and completeness dimensions), 2 clean cases with no errors. Each planted error has a list of keywords that count as "caught."

Results: 15/15 planted errors caught by the panel, 0 false positives on clean cases.

Here is the part I did not expect: the single-model baseline also caught all 15.

My first reaction was that the eval was broken. But after looking at the cases, I think the result is right and what it tells me is more specific than I initially thought.

The panel is not dramatically better at detection than a single-model self-eval on a well-designed golden set. What it is better at is structure. The panel's output tells you:

Which specific dimension is failing (accuracy vs. logic vs. completeness)
Which critics agreed and which one dissented
Which flags were dismissed and why
A quality score broken down by confidence

The baseline gives you a flat list of findings. It might catch the same errors, but you do not know if it is confident, whether two independent perspectives agreed, or whether it is being overcautious in one dimension and undercautious in another.

If you need a quick yes/no on whether an output is broken, a single model with a well-crafted prompt might be fine. If you need structured, auditable signal with per-dimension accountability and confidence levels, the panel earns its complexity.

One thing that genuinely surprised me

The adjudicator's dismissed_flags list.

I expected critics to either agree or independently find different problems. What I did not anticipate was the frequency with which one critic would fire on something that the other two explicitly did not flag. The adjudicator correctly handling that case (not just unioning all flags) turned out to matter more than I expected.

In the eval, a few cases had the logic critic flagging something the accuracy and completeness critics ignored. In those cases, the adjudicator's job was to apply cross-critic corroboration and either confirm it (if the logic issue was real but the others were out of scope) or dismiss it (if it looked like an overcautious hit). Getting that right required the adjudicator to understand the mandate of each critic, not just count votes.

That structure (critics with narrow scopes, an adjudicator with full context) ended up being more important to output quality than any individual critic prompt.

What I would do differently

The keyword-match detection in the eval harness is deterministic and cheap, but it is too brittle for a real benchmark. A critic might correctly identify a problem using different terminology and the match fails. v2 needs an LLM-as-judge matcher that evaluates semantic equivalence rather than substring presence. The current harness gives clean numbers but probably undercounts slightly.

I would also push harder on the provider diversity sooner. The architecture is there. The next meaningful eval question is whether GPT-4o catches accuracy errors that Claude misses on the same cases, and that requires actually running it, not theorizing about it.

The code

The full project is at github.com/bhj37193/crucible. The entry point is python -m src.runner "<output text>". The eval runs with python -m evals.run_eval. No framework dependencies beyond FastAPI and the Anthropic SDK.

The decision records are in /planning/decisions. The cut list for v1 is explicit about what was removed and why. If you are reading this and thinking "but why didn't you just use LangGraph," that document is the answer.

Top comments (2)

Harjot Singh • May 31

One strong reviewer would be enough, I was wrong is the exact lesson, and the why is that a single critic has a single blind spot no matter how good it is. One reviewer, however well-prompted, evaluates along one implicit set of priorities, so the things it systematically doesn't look for, it misses every time, consistently, which feels like reliability but is just a stable gap. Splitting into distinct critic roles works because each lens fails differently: a correctness critic, a tone/clarity critic, and a factual-grounding critic catch non-overlapping problems, and the union of three narrow reviewers beats one generalist trying to hold every concern at once and diluting all of them. It's the same principle as diverse human reviewers, redundancy with identical perspectives adds little, diversity of perspective is what adds coverage. The part I'd love to hear more on is what the eval taught you, because the right metric here isn't each critic's solo accuracy, it's marginal coverage: what does critic C catch that A and B both missed? That's how you know a role is earning its slot versus just agreeing with the others. Diverse critics that fail differently beat one critic that's strong on average. That perspective-diversity-for-real-coverage instinct is core to how I think about verification in Moonshift. From the eval, did any of the three critics turn out to overlap so much it wasn't pulling its weight?

Bohyeon Jang • Jun 1

My eval has this gap. I measured panel vs single-model, not per-critic contribution, so there's no breakdown of "critic C caught X that A and B both missed."

The real bet is cross-provider diversity over role naming. GPT-4o and Claude have different training distributions and different priors about what "complete" means. That's where divergent failure modes should come from. Whether it's actually true, 15 cases isn't enough to detect.

Something I keep coming back to: baseline catches all 15 too. The panel's wins are in output structure (per-dimension verdicts, confidence scores) not catch rate. Whether the critic diversity is doing real marginal work is still genuinely open.

What I want to run next: each critic in isolation, same inputs, pairwise overlap on catches. If two critics share 85%+ of flags, one isn't earning its slot.