Bohyeon Jang

Posted on Jun 1

I Built the Claude-Native Version of RecursiveMAS

#ai #python #programming #machinelearning

RecursiveMAS (arXiv 2604.25917) showed that agents sharing internal reasoning state outperform agents that share only final outputs. The average accuracy gain across benchmarks was 8.3 points. The mechanism: each agent passes not just its answer but the latent embeddings from its own reasoning process, and the next agent conditions on both. The paper is a good result.

The catch is access. RecursiveMAS requires open-weight models with hidden states exposed at inference time. That rules out Claude, GPT-4o, and Gemini. I built a Claude-native version using the Anthropic extended thinking API. The core idea transfers: instead of passing latent vectors, pass the full thinking text. The paper calls it internal state sharing; the Claude version calls it thinking-block relay.

The architecture problem

Claude's extended thinking blocks carry an encrypted signature tied to the originating conversation. You cannot pass a signed thinking block into a different agent's messages array. The API rejects it. The workaround: extract the text from the thinking block and inject it as a regular user message.

# Extract thinking text from Agent 1
thinking_text = next(
    (b.thinking for b in response.content if b.type == "thinking"), ""
)

# Inject into Agent 2 as regular context, not as a thinking block
context = f"Prior agent reasoning:\n{thinking_text}"

The signature does not transfer. The reasoning does.

relay-structured: what I built first

The first architecture was a Planner > Critic > Solver loop where each agent emits a compact mental model JSON instead of raw thinking text. Raw thinking at a 1024-token budget is often compressed and fragmented. The hypothesis was that 150 tokens of structured signal carries more information per token than 1024 tokens of compressed prose.

The schema each agent emits:

{
  "interpretation": "how the agent read the problem",
  "key_steps": ["step 1", "step 2"],
  "rejected_approaches": ["approach tried and discarded"],
  "confidence": 0.85,
  "potential_errors": "where this reasoning might go wrong"
}

confidence and potential_errors are the load-bearing fields. They tell downstream agents where to apply more scrutiny, without requiring those agents to parse a full reasoning trace. A critic that can see "confidence: 0.4, potential_errors: I may have misread the constraint on x" has a different starting point than one that reads 800 tokens of prose and has to infer the same thing.

Results (n=50, preliminary)

Condition	Accuracy	Avg tokens
single-agent	70.0%	1,212
relay-structured	72.0%	18,821

+2 points. 15x token cost. relay-structured wins by one problem out of 50. The direction is right. The cost ratio is not deployable as-is. Running the full Planner > Critic > Solver chain on every request is not justified by 2 points at n=50.

Why I did not build read-before

The obvious next step: let Agent 2 read Agent 1's JSON before producing its own answer. I skipped it. The problem is anchoring. Agent 2 sees Agent 1's answer before forming its own view, and it will tend to confirm rather than challenge. This is mathematically equivalent to relay-structured with role specialization removed and anchoring added. The expected result is worse, not better. It was not worth building.

read-after + disagreement escalation

The design: both agents reason independently. No shared context during reasoning. After both finish, compare their answers in code, no API call. If they agree, return the higher-confidence answer. If they disagree, run a resolver that sees both answers and both mental model JSONs and picks the stronger reasoning chain.

Independent reasoning first means no anchoring. The comparison step is pure code, so there is no token cost when agents agree. The resolver only fires on genuine disagreement, which on a 5000-token budget against hard MATH problems turns out to be about 40% of the time. On easy questions where both agents agree, the cost is 2x single-agent. On harder questions requiring the resolver, it is around 3.5x. Weighted average across both cases: roughly 2.9x single-agent, versus relay-structured's 15x.

The core of the implementation:

def run_self_relay(question, n_rounds=2, model=DEFAULT_MODEL, budget_tokens=DEFAULT_BUDGET):
    # Both agents reason independently, no shared context
    mm1, answer1, tokens1 = agent_call_structured("solver", question, [], [], model, budget_tokens)
    mm2, answer2, tokens2 = agent_call_structured("solver", question, [], [], model, budget_tokens)

    boxed1 = _extract_boxed(answer1)
    boxed2 = _extract_boxed(answer2)
    agree = bool(boxed1 and boxed2 and boxed1 == boxed2)

    conf1 = float(mm1.get("confidence", 0)) if mm1 else 0.0
    conf2 = float(mm2.get("confidence", 0)) if mm2 else 0.0

    if agree:
        final_answer = answer1 if conf1 >= conf2 else answer2
        resolver_tokens = 0
    else:
        # Resolver sees the problem, both answers, and both reasoning chains
        resolver_prompt = (
            f"Two agents solved a math problem independently and disagreed.\n\n"
            f"Problem: {question}\n\n"
            f"Agent 1 answered: {boxed1}\n"
            f"Agent 1 reasoning: {json.dumps(mm1, indent=2)}\n\n"
            f"Agent 2 answered: {boxed2}\n"
            f"Agent 2 reasoning: {json.dumps(mm2, indent=2)}\n\n"
            f"Evaluate which reasoning chain is stronger. "
            f"Return the correct answer inside \\boxed{{}}."
        )
        response = client.messages.create(
            model=model, max_tokens=8000,
            thinking={"type": "enabled", "budget_tokens": budget_tokens},
            messages=[{"role": "user", "content": resolver_prompt}],
        )
        final_answer = next((b.text for b in response.content if b.type == "text"), "")
        resolver_tokens = response.usage.input_tokens + response.usage.output_tokens

    return {
        "final_answer": final_answer,
        "total_tokens": tokens1 + tokens2 + resolver_tokens,
        "agreement": agree,
        "agent1_answer": answer1, "agent1_confidence": conf1,
        "agent2_answer": answer2, "agent2_confidence": conf2,
        "resolver_answer": final_answer if not agree else "",
        "resolver_tokens": resolver_tokens,
    }

Results

200 examples, MATH level 4-5, claude-sonnet-4-6, budget=5000 tokens, preliminary:

Condition	Accuracy	Avg tokens
single-agent	63.0%	1,234
self-relay	65.5%	3,290

self-relay gains 2.5 points over single-agent at 2.7x the token cost. That is a different profile from relay-structured's 15x cost for 2 points: the read-after architecture gets a similar accuracy gain at roughly one-sixth the token overhead. The disagreement rate on this benchmark was approximately 40%, consistent with the expected range.

The token cost is 2.7x single-agent. relay-structured's was 15x. At the same 5000-token budget, self-relay gets a similar accuracy benefit at a fraction of the cost, because the resolver only fires when needed.

What the statistics say

The +2.5pp does not survive a stat test. Wilson 95% confidence intervals overlap: single-agent at [56.1%, 69.4%], self-relay at [58.7%, 71.7%]. Both point estimates are fully consistent with the other condition being equal.

McNemar's test asks whether the cases where self-relay wins and the cases where single-agent wins are distributed equally. With ~40% disagreement and the observed net gain, the realistic discordant split comes out around 22 self-relay wins versus 18 single-agent wins out of 200 examples. chi2=0.23, p approximately 0.89.

Detecting a true 2.5pp effect at 80% power takes about 5,767 examples. The current n is 29x too small. GSM8K's test set has roughly 1,319 examples. The full MATH dataset has roughly 5,000. The standard benchmarks are not large enough to confirm an improvement this small.

I ran the calibration script on the n=5 cost probe to verify the measurement machinery works. Disagreement rate: 40%, matching prior experiments. All five examples had max agent confidence above 0.8, so confidence-gating would route every one of them to single-agent. Resolver got 1 out of 2 disagreement cases, same rate as single-agent on those same problems. Not a finding. Just proof the script runs.

A direction on a hard benchmark at a fraction of the cost is useful. It just is not a result yet. Running on 6,000 examples would make it one. So would pre-registering and replicating. Reporting the stat test now is not pessimism about the architecture. It is how you tell the difference between a pattern and a coincidence.

Where this fits in production

Self-relay fits anywhere the cost of a wrong answer exceeds the cost of a second call. Legal document review, code security audits, medical triage, financial analysis. On those tasks, running two independent reasoning chains and only paying for arbitration when they disagree is a straightforward reliability pattern.

The practical deployment is confidence-gating, not always-on relay. Run single-agent first and check the confidence field from the mental model JSON. If confidence is above a threshold, return the answer. If not, escalate to self-relay:

def answer_with_confidence_gate(question, threshold=0.75, model="claude-sonnet-4-6", budget=5000):
    # First pass: single agent with mental model
    mm, answer, tokens = agent_call_structured("solver", question, [], [], model, budget)
    confidence = float(mm.get("confidence", 0))

    if confidence >= threshold:
        return {"answer": answer, "method": "single", "total_tokens": tokens}

    # Low confidence: escalate to self-relay
    result = run_self_relay(question, model=model, budget_tokens=budget)
    result["method"] = "self-relay"
    return result

On a real workload where roughly 25-30% of questions fall below threshold, this brings the average token overhead from 2.7x down to roughly 1.3-1.5x single-agent. The relay only runs on requests that actually need it.

What I would try next

The n=500 threshold was not triggered (65.5% is below 72.0%), so these results stay at n=200. The next useful test is calibrating the confidence threshold: at what level does escalating to self-relay recover correct answers, and what is the false-escalation rate? A domain-specific benchmark, legal review or code analysis, would also stress-test whether the disagreement pattern and resolver accuracy hold outside of competition math.

The repo is at github.com/bhj37193/relay. The eval harness is in relay/eval_structured.py. All results are in eval_structured_results.json.

DEV Community