DEV Community: Bohyeon Jang

I Built the Claude-Native Version of RecursiveMAS

Bohyeon Jang — Mon, 01 Jun 2026 02:16:35 +0000

RecursiveMAS (arXiv 2604.25917) showed that agents sharing internal reasoning state outperform agents that share only final outputs. The average accuracy gain across benchmarks was 8.3 points. The mechanism: each agent passes not just its answer but the latent embeddings from its own reasoning process, and the next agent conditions on both. The paper is a good result.

The catch is access. RecursiveMAS requires open-weight models with hidden states exposed at inference time. That rules out Claude, GPT-4o, and Gemini. I built a Claude-native version using the Anthropic extended thinking API. The core idea transfers: instead of passing latent vectors, pass the full thinking text. The paper calls it internal state sharing; the Claude version calls it thinking-block relay.

The architecture problem

Claude's extended thinking blocks carry an encrypted signature tied to the originating conversation. You cannot pass a signed thinking block into a different agent's messages array. The API rejects it. The workaround: extract the text from the thinking block and inject it as a regular user message.

# Extract thinking text from Agent 1
thinking_text = next(
    (b.thinking for b in response.content if b.type == "thinking"), ""
)

# Inject into Agent 2 as regular context, not as a thinking block
context = f"Prior agent reasoning:\n{thinking_text}"

The signature does not transfer. The reasoning does.

relay-structured: what I built first

The first architecture was a Planner > Critic > Solver loop where each agent emits a compact mental model JSON instead of raw thinking text. Raw thinking at a 1024-token budget is often compressed and fragmented. The hypothesis was that 150 tokens of structured signal carries more information per token than 1024 tokens of compressed prose.

The schema each agent emits:

{
  "interpretation": "how the agent read the problem",
  "key_steps": ["step 1", "step 2"],
  "rejected_approaches": ["approach tried and discarded"],
  "confidence": 0.85,
  "potential_errors": "where this reasoning might go wrong"
}

confidence and potential_errors are the load-bearing fields. They tell downstream agents where to apply more scrutiny, without requiring those agents to parse a full reasoning trace. A critic that can see "confidence: 0.4, potential_errors: I may have misread the constraint on x" has a different starting point than one that reads 800 tokens of prose and has to infer the same thing.

Results (n=50, preliminary)

Condition	Accuracy	Avg tokens
single-agent	70.0%	1,212
relay-structured	72.0%	18,821

+2 points. 15x token cost. relay-structured wins by one problem out of 50. The direction is right. The cost ratio is not deployable as-is. Running the full Planner > Critic > Solver chain on every request is not justified by 2 points at n=50.

Why I did not build read-before

The obvious next step: let Agent 2 read Agent 1's JSON before producing its own answer. I skipped it. The problem is anchoring. Agent 2 sees Agent 1's answer before forming its own view, and it will tend to confirm rather than challenge. This is mathematically equivalent to relay-structured with role specialization removed and anchoring added. The expected result is worse, not better. It was not worth building.

read-after + disagreement escalation

The design: both agents reason independently. No shared context during reasoning. After both finish, compare their answers in code, no API call. If they agree, return the higher-confidence answer. If they disagree, run a resolver that sees both answers and both mental model JSONs and picks the stronger reasoning chain.

Independent reasoning first means no anchoring. The comparison step is pure code, so there is no token cost when agents agree. The resolver only fires on genuine disagreement, which on a 5000-token budget against hard MATH problems turns out to be about 40% of the time. On easy questions where both agents agree, the cost is 2x single-agent. On harder questions requiring the resolver, it is around 3.5x. Weighted average across both cases: roughly 2.9x single-agent, versus relay-structured's 15x.

The core of the implementation:

def run_self_relay(question, n_rounds=2, model=DEFAULT_MODEL, budget_tokens=DEFAULT_BUDGET):
    # Both agents reason independently, no shared context
    mm1, answer1, tokens1 = agent_call_structured("solver", question, [], [], model, budget_tokens)
    mm2, answer2, tokens2 = agent_call_structured("solver", question, [], [], model, budget_tokens)

    boxed1 = _extract_boxed(answer1)
    boxed2 = _extract_boxed(answer2)
    agree = bool(boxed1 and boxed2 and boxed1 == boxed2)

    conf1 = float(mm1.get("confidence", 0)) if mm1 else 0.0
    conf2 = float(mm2.get("confidence", 0)) if mm2 else 0.0

    if agree:
        final_answer = answer1 if conf1 >= conf2 else answer2
        resolver_tokens = 0
    else:
        # Resolver sees the problem, both answers, and both reasoning chains
        resolver_prompt = (
            f"Two agents solved a math problem independently and disagreed.\n\n"
            f"Problem: {question}\n\n"
            f"Agent 1 answered: {boxed1}\n"
            f"Agent 1 reasoning: {json.dumps(mm1, indent=2)}\n\n"
            f"Agent 2 answered: {boxed2}\n"
            f"Agent 2 reasoning: {json.dumps(mm2, indent=2)}\n\n"
            f"Evaluate which reasoning chain is stronger. "
            f"Return the correct answer inside \\boxed{{}}."
        )
        response = client.messages.create(
            model=model, max_tokens=8000,
            thinking={"type": "enabled", "budget_tokens": budget_tokens},
            messages=[{"role": "user", "content": resolver_prompt}],
        )
        final_answer = next((b.text for b in response.content if b.type == "text"), "")
        resolver_tokens = response.usage.input_tokens + response.usage.output_tokens

    return {
        "final_answer": final_answer,
        "total_tokens": tokens1 + tokens2 + resolver_tokens,
        "agreement": agree,
        "agent1_answer": answer1, "agent1_confidence": conf1,
        "agent2_answer": answer2, "agent2_confidence": conf2,
        "resolver_answer": final_answer if not agree else "",
        "resolver_tokens": resolver_tokens,
    }

Results

200 examples, MATH level 4-5, claude-sonnet-4-6, budget=5000 tokens, preliminary:

Condition	Accuracy	Avg tokens
single-agent	63.0%	1,234
self-relay	65.5%	3,290

self-relay gains 2.5 points over single-agent at 2.7x the token cost. That is a different profile from relay-structured's 15x cost for 2 points: the read-after architecture gets a similar accuracy gain at roughly one-sixth the token overhead. The disagreement rate on this benchmark was approximately 40%, consistent with the expected range.

The token cost is 2.7x single-agent. relay-structured's was 15x. At the same 5000-token budget, self-relay gets a similar accuracy benefit at a fraction of the cost, because the resolver only fires when needed.

What the statistics say

The +2.5pp does not survive a stat test. Wilson 95% confidence intervals overlap: single-agent at [56.1%, 69.4%], self-relay at [58.7%, 71.7%]. Both point estimates are fully consistent with the other condition being equal.

McNemar's test asks whether the cases where self-relay wins and the cases where single-agent wins are distributed equally. With ~40% disagreement and the observed net gain, the realistic discordant split comes out around 22 self-relay wins versus 18 single-agent wins out of 200 examples. chi2=0.23, p approximately 0.89.

Detecting a true 2.5pp effect at 80% power takes about 5,767 examples. The current n is 29x too small. GSM8K's test set has roughly 1,319 examples. The full MATH dataset has roughly 5,000. The standard benchmarks are not large enough to confirm an improvement this small.

I ran the calibration script on the n=5 cost probe to verify the measurement machinery works. Disagreement rate: 40%, matching prior experiments. All five examples had max agent confidence above 0.8, so confidence-gating would route every one of them to single-agent. Resolver got 1 out of 2 disagreement cases, same rate as single-agent on those same problems. Not a finding. Just proof the script runs.

A direction on a hard benchmark at a fraction of the cost is useful. It just is not a result yet. Running on 6,000 examples would make it one. So would pre-registering and replicating. Reporting the stat test now is not pessimism about the architecture. It is how you tell the difference between a pattern and a coincidence.

Where this fits in production

Self-relay fits anywhere the cost of a wrong answer exceeds the cost of a second call. Legal document review, code security audits, medical triage, financial analysis. On those tasks, running two independent reasoning chains and only paying for arbitration when they disagree is a straightforward reliability pattern.

The practical deployment is confidence-gating, not always-on relay. Run single-agent first and check the confidence field from the mental model JSON. If confidence is above a threshold, return the answer. If not, escalate to self-relay:

def answer_with_confidence_gate(question, threshold=0.75, model="claude-sonnet-4-6", budget=5000):
    # First pass: single agent with mental model
    mm, answer, tokens = agent_call_structured("solver", question, [], [], model, budget)
    confidence = float(mm.get("confidence", 0))

    if confidence >= threshold:
        return {"answer": answer, "method": "single", "total_tokens": tokens}

    # Low confidence: escalate to self-relay
    result = run_self_relay(question, model=model, budget_tokens=budget)
    result["method"] = "self-relay"
    return result

On a real workload where roughly 25-30% of questions fall below threshold, this brings the average token overhead from 2.7x down to roughly 1.3-1.5x single-agent. The relay only runs on requests that actually need it.

What I would try next

The n=500 threshold was not triggered (65.5% is below 72.0%), so these results stay at n=200. The next useful test is calibrating the confidence threshold: at what level does escalating to self-relay recover correct answers, and what is the false-escalation rate? A domain-specific benchmark, legal review or code analysis, would also stress-test whether the disagreement pattern and resolver accuracy hold outside of competition math.

The repo is at github.com/bhj37193/relay. The eval harness is in relay/eval_structured.py. All results are in eval_structured_results.json.

I read a multi-agent reasoning paper, built the Claude-native version, and measured everything

Bohyeon Jang — Mon, 01 Jun 2026 00:20:47 +0000

The architecture problem

# Extract thinking text from Agent 1
thinking_text = next(
    (b.thinking for b in response.content if b.type == "thinking"), ""
)

# Inject into Agent 2 as regular context, not as a thinking block
context = f"Prior agent reasoning:\n{thinking_text}"

The signature does not transfer. The reasoning does.

relay-structured: what I built first

The schema each agent emits:

{
  "interpretation": "how the agent read the problem",
  "key_steps": ["step 1", "step 2"],
  "rejected_approaches": ["approach tried and discarded"],
  "confidence": 0.85,
  "potential_errors": "where this reasoning might go wrong"
}

Results (n=50, preliminary)

Condition	Accuracy	Avg tokens
single-agent	70.0%	1,212
relay-structured	72.0%	18,821

Why I did not build read-before

read-after + disagreement escalation

The core of the implementation:

def run_self_relay(question, n_rounds=2, model=DEFAULT_MODEL, budget_tokens=DEFAULT_BUDGET):
    # Both agents reason independently, no shared context
    mm1, answer1, tokens1 = agent_call_structured("solver", question, [], [], model, budget_tokens)
    mm2, answer2, tokens2 = agent_call_structured("solver", question, [], [], model, budget_tokens)

    boxed1 = _extract_boxed(answer1)
    boxed2 = _extract_boxed(answer2)
    agree = bool(boxed1 and boxed2 and boxed1 == boxed2)

    conf1 = float(mm1.get("confidence", 0)) if mm1 else 0.0
    conf2 = float(mm2.get("confidence", 0)) if mm2 else 0.0

    if agree:
        final_answer = answer1 if conf1 >= conf2 else answer2
        resolver_tokens = 0
    else:
        # Resolver sees the problem, both answers, and both reasoning chains
        resolver_prompt = (
            f"Two agents solved a math problem independently and disagreed.\n\n"
            f"Problem: {question}\n\n"
            f"Agent 1 answered: {boxed1}\n"
            f"Agent 1 reasoning: {json.dumps(mm1, indent=2)}\n\n"
            f"Agent 2 answered: {boxed2}\n"
            f"Agent 2 reasoning: {json.dumps(mm2, indent=2)}\n\n"
            f"Evaluate which reasoning chain is stronger. "
            f"Return the correct answer inside \\boxed{{}}."
        )
        response = client.messages.create(
            model=model, max_tokens=8000,
            thinking={"type": "enabled", "budget_tokens": budget_tokens},
            messages=[{"role": "user", "content": resolver_prompt}],
        )
        final_answer = next((b.text for b in response.content if b.type == "text"), "")
        resolver_tokens = response.usage.input_tokens + response.usage.output_tokens

    return {
        "final_answer": final_answer,
        "total_tokens": tokens1 + tokens2 + resolver_tokens,
        "agreement": agree,
        "agent1_answer": answer1, "agent1_confidence": conf1,
        "agent2_answer": answer2, "agent2_confidence": conf2,
        "resolver_answer": final_answer if not agree else "",
        "resolver_tokens": resolver_tokens,
    }

Results

200 examples, MATH level 4-5, claude-sonnet-4-6, budget=5000 tokens, preliminary:

Condition	Accuracy	Avg tokens
single-agent	63.0%	1,234
self-relay	65.5%	3,290

Where this fits in production

def answer_with_confidence_gate(question, threshold=0.75, model="claude-sonnet-4-6", budget=5000):
    # First pass: single agent with mental model
    mm, answer, tokens = agent_call_structured("solver", question, [], [], model, budget)
    confidence = float(mm.get("confidence", 0))

    if confidence >= threshold:
        return {"answer": answer, "method": "single", "total_tokens": tokens}

    # Low confidence: escalate to self-relay
    result = run_self_relay(question, model=model, budget_tokens=budget)
    result["method"] = "self-relay"
    return result

What I would try next

The repo is at github.com/bhj37193/relay. The eval harness is in relay/eval_structured.py. All results are in eval_structured_results.json.

Why I used three different critic roles instead of one (and what the eval taught me)

Bohyeon Jang — Sun, 31 May 2026 05:45:00 +0000

Why I used three different critic roles instead of one (and what the eval taught me)

I built Crucible: three specialized critic agents that audit any LLM output in parallel, an adjudicator that synthesizes their critiques into a confidence-scored verdict, and an eval harness that measures whether the whole thing actually works better than just asking a single model to check itself.

Here is what I learned, including the part where the honest answer is "not as much as I hoped."

The problem: a model cannot reliably audit its own blind spots

When a language model generates output, it has already committed to a direction. Ask it to self-review and it will often ratify the same confident mistake it just made, not because it is lazy, but because self-review activates the same internal heuristics that produced the error.

The failure mode that made this concrete for me: imagine an LLM answering a question about file storage and it says "save uploads to /uploads/ on the server." That looks reasonable in isolation. The model reviews it and says "looks good." But the advice assumes a single-server deployment. In a horizontally-scaled setup, that /uploads/ directory does not exist on every instance, and you now have a race condition that corrupts user data in production.

The model did not hallucinate. It gave correct advice for the wrong context. Self-review did not catch it because both passes made the same contextual assumption.

Multi-agent verification is the obvious response: get independent perspectives that do not share the same failure mode.

Three roles, not three instances of the same model

The naive version of "multi-agent review" is: run the same model three times with slightly different temperatures and hope disagreement surfaces problems. That is mostly noise. You get variance in phrasing, not in perspective.

Crucible uses three structurally different critic roles:

Accuracy critic: are the claims true and internally consistent? Hallucinated entities, wrong numbers, citations that do not exist.
Logic critic: does the reasoning follow? Is the conclusion actually supported by the premises given?
Completeness critic: what is missing? What did the prompt ask for that the output omitted?

Each critic has a narrow mandate: explicit instructions not to stray into the other dimensions. The accuracy critic is told: "Do NOT comment on logic flow or completeness. Stay strictly on factual correctness." This is deliberate. Focused critics produce cleaner signal. A generalist critic reviewing everything at once tends to cluster around the most obvious problem and miss the others.

The adjudicator then reads all three critiques and produces a typed verdict: confirmed_issues (issues the adjudicator judged real and consequential, where cross-critic agreement is strong signal but a clear high-severity single-critic flag also qualifies), dismissed_flags (issues a critic raised that the adjudicator overruled as out-of-scope, pedantic, or insufficiently supported), and a quality_score with a confidence rating.

The dismissed_flags field turned out to be one of the more useful things in practice. When only one critic fires on something, that is often a false positive, a critic being overzealous within its dimension. The adjudicator's job is to apply cross-critic weight, not just union every flag.

The asyncio.gather decision: why not LangGraph

I looked at LangGraph. For a 3-node fan-out (run three critics in parallel, collect results, pass to adjudicator) it is ceremony. Here is the actual orchestration in Crucible:

raw = await asyncio.gather(
    *(critic.run(output_text, original_prompt, model) for critic in CRITICS),
    return_exceptions=True,
)

That is four lines. LangGraph would have given me a graph definition, node registration, state management, and a debugging UI that I would never open. At this scale, the abstraction costs more than it saves.

The decision record I wrote for this project has a line I keep coming back to: "For a 3-node fan-out, LangGraph is ceremony. asyncio.gather is ~4 lines and easier to explain in an interview." Not a knock on LangGraph. It genuinely earns its keep at larger scale. But building something you cannot explain in five minutes is not a feature.

The provider question: Claude x 3, with a path to diversity

Here is a decision I want to be transparent about. The brief for Crucible called for three different providers: GPT-4o for accuracy, Claude for logic, Gemini for completeness. The theory is sound. Different training data means different failure modes, so you are less likely to have all three critics share the same blind spot.

I built the architecture to support this. The provider resolution is in src/providers.py: if OPENAI_API_KEY is set, the accuracy critic upgrades to GPT-4o. If GEMINI_API_KEY is set, the completeness critic upgrades to Gemini. Otherwise all three critics run on Claude.

But I explicitly cut multi-provider as a v1 requirement. The decision record:

"Three distinct critic prompts on one strong model already produce lens diversity, and removing the multi-provider dependency means a reviewer can run the demo with a single API key."

This is the honest tradeoff. Three well-scoped critic prompts on Claude produce genuinely different outputs because the task is structurally different: one is hunting for false facts, one is evaluating logical structure, one is checking completeness against the stated goal. That is real lens diversity. The additional diversity you get from different providers is real but incremental, and it comes with real cost: three sets of API keys, three different rate limits, three different latency profiles, three different pricing models.

For a v1 that needs to ship and be demonstrable, I chose the simpler version. The architecture is ready for the upgrade.

What the eval taught me (the honest version)

I built an eval harness with 12 test cases: 10 with planted errors (15 errors total across accuracy, logic, and completeness dimensions), 2 clean cases with no errors. Each planted error has a list of keywords that count as "caught."

Results: 15/15 planted errors caught by the panel, 0 false positives on clean cases.

Here is the part I did not expect: the single-model baseline also caught all 15.

My first reaction was that the eval was broken. But after looking at the cases, I think the result is right and what it tells me is more specific than I initially thought.

The panel is not dramatically better at detection than a single-model self-eval on a well-designed golden set. What it is better at is structure. The panel's output tells you:

Which specific dimension is failing (accuracy vs. logic vs. completeness)
Which critics agreed and which one dissented
Which flags were dismissed and why
A quality score broken down by confidence

The baseline gives you a flat list of findings. It might catch the same errors, but you do not know if it is confident, whether two independent perspectives agreed, or whether it is being overcautious in one dimension and undercautious in another.

If you need a quick yes/no on whether an output is broken, a single model with a well-crafted prompt might be fine. If you need structured, auditable signal with per-dimension accountability and confidence levels, the panel earns its complexity.

One thing that genuinely surprised me

The adjudicator's dismissed_flags list.

I expected critics to either agree or independently find different problems. What I did not anticipate was the frequency with which one critic would fire on something that the other two explicitly did not flag. The adjudicator correctly handling that case (not just unioning all flags) turned out to matter more than I expected.

In the eval, a few cases had the logic critic flagging something the accuracy and completeness critics ignored. In those cases, the adjudicator's job was to apply cross-critic corroboration and either confirm it (if the logic issue was real but the others were out of scope) or dismiss it (if it looked like an overcautious hit). Getting that right required the adjudicator to understand the mandate of each critic, not just count votes.

That structure (critics with narrow scopes, an adjudicator with full context) ended up being more important to output quality than any individual critic prompt.

What I would do differently

The keyword-match detection in the eval harness is deterministic and cheap, but it is too brittle for a real benchmark. A critic might correctly identify a problem using different terminology and the match fails. v2 needs an LLM-as-judge matcher that evaluates semantic equivalence rather than substring presence. The current harness gives clean numbers but probably undercounts slightly.

I would also push harder on the provider diversity sooner. The architecture is there. The next meaningful eval question is whether GPT-4o catches accuracy errors that Claude misses on the same cases, and that requires actually running it, not theorizing about it.

The code

The full project is at github.com/bhj37193/crucible. The entry point is python -m src.runner "<output text>". The eval runs with python -m evals.run_eval. No framework dependencies beyond FastAPI and the Anthropic SDK.

The decision records are in /planning/decisions. The cut list for v1 is explicit about what was removed and why. If you are reading this and thinking "but why didn't you just use LangGraph," that document is the answer.