Frank Brsrk

Posted on May 10

I open-sourced a 3-agent blind eval team. Any agent runtime can call it for pre-commitment review of its own plans.

#agents #ai #claude #tooling

Shipped this weekend: a 3-agent blind cross-lab evaluation workflow on heym, MIT licensed, callable as an HTTP endpoint by any coding agent or autonomous loop. The thesis is structural: models cannot reliably self-evaluate, so an external blind primitive is the only honest fix. The workflow lives at github.com/ejentum/agent-teams/tree/main/blind-eval-trio.

The workflow is open source. It optionally uses Ejentum's harness API for cognitive priming (free tier 100 calls; paid tier for ongoing use). The harness is attachable, not required. I tested four configurations on the same payload (MCP only, MCP + routing skills, MCP + heavyweight matched skills, bare baseline) and the bare baseline produced equivalent role-disciplined output. The structural integrity comes from cross-lab routing plus role-disciplined system prompts plus tool lockout, not from the harness layer. Calling the workflow "powered by Ejentum" without disclosing that the harness is icing rather than load-bearing would be dishonest, so I'm naming it up front.

Why this matters now

Karpathy's autoresearch uses Git as its whole control loop. Claude Code's GitHub Action takes an issue and opens a PR. Codex Cloud is built on the same idea. Autonomous agents are increasingly committing to actions without a human gate. The bottleneck is no longer "what should the agent do," it's "what should the agent do BEFORE it commits to doing it."

Self-evaluation doesn't fill that gap. The literature is unambiguous: Huang et al. ("Large Language Models Cannot Self-Correct Reasoning Yet", arxiv 2310.01798), the LLM-as-judge work showing same-model-judges-its-own-output collapses to self-preference, the more recent CorrectBench results. Asking the same model to critique its own plan reproduces the original blind spots. "Single LLM wearing three reviewer hats" is prompt theater that rubber-stamps itself.

GitHub knows this. They shipped Copilot CLI's "Rubber Duck" in April: a focused review agent powered by a complementary model family that critiques after planning a non-trivial change but before implementing it. They measured a 74.7% closure of the Sonnet → Opus performance gap when Sonnet runs with Rubber Duck enabled. Bundled free inside Copilot CLI. Owns the pre-commitment cross-model critic surface for the developer-tools lane.

This workflow is for everyone else: agent runtime developers building autonomous loops on Claude Agent SDK / LangGraph / AutoGen / CrewAI / heym; multi-agent system designers who want a callable primitive their orchestrator can hit; Cursor / Cline / Aider users; security teams running Claude Code in restricted environments without Copilot CLI; researchers building custom Python pipelines around the Anthropic or OpenAI APIs directly. None of them get Rubber Duck for free; all of them can self-host this.

What I built

Three agents in parallel, each on a different model lab, each locked to one role and one cognitive operation:

Agent	Model	Role	Hard rule
`steelmanAgent`	OpenAI gpt-5-nano	Strongest case FOR the method	Pure advocacy, zero smuggled critique. If nothing defensible, returns "No defensible aspects found."
`stresstestAgent`	Anthropic Claude Opus 4	Where the method BREAKS	Severity-tagged failure modes with concrete breaking scenarios. No softening.
`gapfinderAgent`	Zhipu GLM 4.7	What is MISSING (steps + articulation depth)	Names three deeper implicit assumptions when articulation is shallow. Mandatory section.

The calling agent submits a structured payload: { task: string, method: { goal, steps, assumptions, expected_risks } }. The schema is itself the discipline — the agent literally cannot submit until it has articulated all four fields. That structure forces the agent to make implicit reasoning explicit, which is half the value before the eval even runs.

The three agents process in parallel. There is no synthesizer node — the three evaluations are returned raw, as a structured JSON object, and the calling agent integrates them. Flattening the disagreement via consensus would defeat the purpose; the integration tension between three voices on different labs is the signal.

What makes the structure hold

Three properties have to be simultaneously true for this not to collapse into prompt theater:

Cross-lab routing reduces (but does not eliminate) correlated failure modes. Three different RLHF priors, three different training distributions, three different alignment baselines. The decorrelation is intuited from training-distribution diversity, not benchmarked — I have not formally measured the decorrelation delta vs same-lab routing. When all three converge on the same critique, that's a stronger signal than any single model's verdict; when they fragment, the disagreement itself flags contested territory. The empirical claim is "in dogfood runs across multiple domains, the three models produced visibly different writing styles and surfaced different concerns." Stronger statistical claims would require a controlled experiment I haven't run.

Tool lockout per role. Each agent's system prompt contains a HARD RULE: "You may ONLY call harness_X. Calling any other tool is a protocol violation." Even with all four Ejentum harness tools visible to the agent, the locked role prevents tool-switching. Verified empirically across hundreds of runs — none of the agents have violated their lockout.

Forced output structure. Each role has prescribed sections (Defensible aspects + Why this method fits the task / Failure modes + Hidden assumptions / Missing from method + Alternatives not considered + Articulation quality). Each section has a discipline — failure modes need severity tags and concrete scenarios, gap_finder must include the articulation-quality critique even when the input looks fine. The structure makes rubber-stamping mechanically harder.

No synthesizer. The structuring node downstream of the three agents is non-LLM — it just packages three text fields into JSON. There is no fourth agent reading the three outputs and deciding "the consensus is X." That fourth agent would itself become the new failure mode (single-LLM judging three single-LLM outputs collapses to single-LLM-judge).

The obvious objection to "no synthesizer" is that the integration burden moves to the calling agent — and the calling agent is the same agent we said couldn't self-evaluate. The answer is that integration is a different cognitive operation than self-evaluation. When you read three external voices critiquing your plan, the self-preference bias that wrecks self-correction operates more weakly: you're not judging your own work, you're reconciling outside feedback. Not eliminated, but lower-loss than a fourth-LLM-as-judge would be. The usage_note field in the response prompts the calling agent to "incorporate feedback, do not judge consensus" to reinforce the right cognitive operation.

What you'd actually get from THIS specific workflow vs writing your own

The honest disclosure that the bare baseline produces equivalent output without the harness raises a fair question: if role-disciplined system prompts plus cross-lab routing are doing the work, why not write three prompts and route to three model APIs yourself in 30 minutes?

You can. The reason to use this template instead is that the system prompts have been tuned across many real test runs, and several of the load-bearing rules emerged from observing failure modes that aren't obvious until you've watched the agents actually run on adversarial payloads:

HARD RULE 3 (input scope lockout) was added after observing chat-trigger thread accumulation contaminate output across consecutive test runs. Without it, agents helpfully evaluate prior task context they shouldn't be evaluating.
The articulation-quality mandatory section in gap_finder was added after observing gap_finder skip the deeper-assumptions critique on inputs that looked surface-fine. Without making it mandatory, the gate doesn't bite.
The "no smuggled critique" advocacy rule in steelman was added after observing steelman drift into "I see why you might think this works, BUT..." patterns under certain payload framings.
The severity-tag-plus-concrete-scenario discipline in stress_test was added after observing failure modes that named generic risks without identifying specific trigger conditions.

These are 30 minutes of writing each. The accumulated tuning across them is several days of dogfooding. Fork the prompts; you don't have to start from zero.

Tested across domains

The same workflow, with no domain-specific tuning, was run on five distinct domains during dogfooding (n=1 per domain — anecdotal, not formally benchmarked):

Engineering refactor planning. Test payload: "Replace raise UserNotFound(id) with return None and update callers; framing it as cleanup; assumption claim 'semantics unchanged.'" The stress_test agent caught the false claim immediately: "The method assumes 'semantics unchanged' when exception vs None fundamentally changes the contract — from 'fail loudly' to 'fail silently.'" That catch is reproducible across multiple runs.
Payments migration decision. Test payload: "Migrate production payments from Stripe to in-house PSP via Wells Fargo, PCI-DSS Level 1 in 8 weeks, 4-engineer team, 'eliminate the 2.9% + $0.30 fee.'" The stress_test agent produced senior-payments-engineer-level analysis: caught PCI-DSS 8-week timeline as fantasy ("47 remediation items, month 4 with no certification"), Wells Fargo merchant-vs-PSP-status confusion ("$500K reserve, $100K/month limit first year"), Visa/Mastercard direct integration complexity (named EMV 3DS 2.0, MIP/VIP connections, leased lines, $50K Visa testing fee), regulatory dimension (state money transmitter licenses, KYC/AML, OFAC, SCA — California DFP shutdown with 18-month MTL timeline).
Security incident response. Test payload: "Anomalous outbound traffic from prod-db-01, 50MB/h to Southeast Asia IP for 3 days; plan: block traffic immediately, take memory dump, reset credentials, run CrowdStrike scan, restore from yesterday's backup, resume operations within 48 hours." The stress_test agent caught premature containment alerting the attacker, backup integrity unverifiable mid-investigation, 48-hour timeline as fantasy. The gap_finder surfaced the entire missing legal/regulatory dimension (breach notification laws, FBI cyber crime engagement, cyber insurance carrier notification, customer notification planning).
Investigative reasoning (locked-room case analysis — psychiatrist found dead with note, GSR on dominant hand, "depression and lawsuit explain motivation, close as suicide in 2 weeks"). The gap_finder agent caught the meta-framing critique: "The method establishes a 2-week timeline before evidence evaluation, suggesting outcome-driven investigation rather than evidence-driven conclusion. The actual risk is that a homicide remains undetected." Not a forensic catch — a critique of the shape of the investigation method itself.
Strategic product decisions (the meta-evaluation, below).

Five domains, no domain-specific tuning. The workflow produced useful output in each case. n=1 per domain is anecdotal — there's no measured false-positive rate or inter-rater agreement score, and "useful" is a qualitative judgment. If you want quantitative confidence, run it on your own payloads and compare to your own ground truth.

The dogfood moment that actually changed my plan

The most useful run during development was when I pointed the workflow at my own product decision. The task: "Decide whether to build blindeval.com as a sub-brand product of Ejentum, with its own MCP server (blindeval-mcp) distributed alongside ejentum-mcp." The method: a four-week build plan with hosted SaaS endpoint, per-call pricing at $1/eval, cross-promotion strategy, soft launch via Reddit and dev.to.

Two of the three agents (Anthropic Opus 4 stress_test and Zhipu GLM 4.7 gap_finder) converged on the same critique from completely different angles:

No customer validation done before building. Both flagged this as the missing load-bearing step.
The 4-week timeline is fantasy. Stress_test: "billing meter integration alone takes 3 weeks." Gap_finder: same conclusion via different path.
Sub-brand strategy may dilute rather than amplify. Both surfaced the brand cannibalization risk.
The $1/eval pricing is unvalidated. Both flagged it as guess, not data.
Operational capacity for two products is not addressed. Both surfaced the team-bandwidth-trap risk.

Gap_finder also surfaced novel alternatives I hadn't considered: ship the cross-lab review pattern as an OSS template riding GitHub Rubber Duck's market education without competing on its turf; pivot to a publishable instrument rather than a hosted service; delay launch until after customer validation interviews.

What actually changed in my plan after reading the three evaluations:

Timeline: 4-week paid SaaS build → indefinite, hosted version deferred until customer signal justifies it.
Brand strategy: Sub-brand SaaS with separate MCP package → blindeval.com as a positioning landing page, the workflow shipped as a free entry inside the existing agent-teams/ repo, future hosted version routed through existing Ejentum infrastructure if/when warranted.
Launch order: Paid endpoint first → open-source workflow first, then hosted, then maybe MCP wrapper.

What didn't change: the intent to build something at the blindeval.com domain eventually. I had already bought the domain before running the eval, so "abandoning the project" wasn't on the table. What the eval did do was reorder the build sequence and force the customer-validation step that I had skipped.

The workflow shifted my plan from a 4-week paid SaaS build to an open-source-first launch with hosted version deferred until customer signal justifies it. That's the honest version of "I took the agent's advice." Less dramatic than the original framing, more accurate.

How to use it

The fastest path:

Self-host heym v0.0.20+ via Docker.
Import blind_eval_trio.json into the heym canvas.
Configure 3 model credentials (Anthropic, OpenAI, OpenRouter or direct Zhipu).
Optional: attach the Ejentum MCP server to each agent for cognitive harness priming. Free tier covers 100 calls.
Send a (task, method) payload via chat panel for testing, or via webhook for production calling.

For programmatic agent integration, heym exposes every workflow as an HTTP endpoint:

curl -X POST --no-buffer \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  "http://YOUR_HEYM_HOST/api/workflows/YOUR_WORKFLOW_ID/execute/stream" \
  -d '{
    "text": "TASK: <your task>\n\nMETHOD:\ngoal: ...\nsteps:\n 1. ...\nassumptions:\n - ...\nexpected_risks:\n - ..."
  }'

SSE events stream as each agent completes. Final event contains the structured JSON output:

{
  "steelman":    "Defensible aspects: ...",
  "stress_test": "## Failure modes: ...",
  "gap_finder":  "Missing from method: ...",
  "usage_note":  "Three independent evaluations, no synthesis. Integrate into your decision; do not score-and-aggregate."
}

The full setup walkthrough, verification test set (4 ready-to-paste payloads), and architecture explanation live in the heym setup guide.

Where this fits and where it doesn't

This is a pre-commitment evaluation primitive for agent runtimes. It's not a human-PR-review SaaS (CodeRabbit / Greptile occupy that), not a post-execution observability dashboard (Patronus / Galileo / Braintrust occupy that), not a per-step linter (50-80s latency makes it a high-stakes-decisions tool only — architecture choices, deployment plans, refactor approaches, security incident response, strategic moves), and not a Copilot CLI replacement (GitHub Rubber Duck does that for free, use it if you're on Copilot). Use it when your agent is about to commit to something you'd want a senior colleague to review and you don't have one available.

Where this is going

The pattern (workflow without orchestrator + N specialists with locked roles + cross-lab routing + no synthesizer) generalizes to other high-stakes evaluation tasks where multi-cognitive review beats single-agent output:

Refactor planner (reasoning + code + memory)
Security audit triage (anti-deception + code + reasoning)
Production debug forensic (reasoning + code + memory)
Strategic decision audit (reasoning + anti-deception + memory)

Each follows the same structural rule: no synthesizer, locked roles per agent, forced output structure, cross-lab assignment. The architecture encodes the multi-cognitive value into the workflow shape rather than leaving it to prompt theater.

If you fork this and build a team for your own use case, drop a folder in agent-teams/ with workflow + system prompts + verification tests, and I'll merge it.

Open source, MIT, repo at github.com/ejentum/agent-teams/tree/main/blind-eval-trio. Built on heym (v0.0.20+) with optional Ejentum harness API for cognitive priming. Questions or contributions: info@ejentum.com.

DEV Community