Alex Towell

Posted on Apr 10 • Originally published at metafunctor.com

I Spent $0.48 to Find Out When MCTS Actually Works for LLM Reasoning

#mcts #llm #reasoning #testtimecompute

Does tree search help LLM reasoning? The literature can't decide.

ReST-MCTS* says yes. AB-MCTS got a NeurIPS spotlight. "Limits of
PRM-Guided Tree Search" says no: MCTS with a process reward model used
11x more tokens than best-of-N for zero accuracy gain. Snell et al.
found beam search degrades performance on easy problems.

I built mcts-reasoning
and ran controlled experiments to find where the boundary is.
Total API cost: $0.48.

Setup

Four methods, same budget. Eight solution attempts per problem:

Pass@1: One shot.
Best-of-N: 8 independent solutions, verifier scores each, pick the best.
Self-consistency: 8 solutions, majority vote.
MCTS: 5 initial solutions scored by verifier, then 3 more guided by UCB1, informed by what worked and what failed.

Model: Claude Haiku 4.5. Problems: constraint satisfaction. Find integer
values for variables satisfying simultaneous constraints. Example:

Find A, B, C, D, E, F satisfying ALL constraints:
  1. A * D = 21
  2. C = A + F
  3. B * F = 20
  4. E = A + 2
  5. D + B = A
  6. E - F = B
  7. A mod 7 = 0
  8. C mod 4 = 0
  9. B * D = 12
  10. C > E > A > F > B > D
  11. A + B + C + D + E + F = 40
  12. E * D = 27

The verifier is a Python function. Checks each constraint, returns the
fraction satisfied. No LLM in the loop. Deterministic.

Calibration

Easy problems first (3-5 variables, 5-9 constraints). Haiku solved them
in one pass. All methods tied at 100%.

5-variable problems with 9 constraints: Pass@1 dropped to 65%.
Self-consistency failed one problem. But BestOfN still tied MCTS,
because with 8 independent samples at least one is usually correct.
BestOfN just picks it.

I needed problems where blind sampling hits a ceiling.

Results

Ten harder problems: 6-8 variables, 12-15 constraints. Products,
modular arithmetic, ordering chains, cascading dependencies. Pass@1
dropped to 29%.

Method	Solve Rate	Avg Score
Pass@1	29%	0.29
Pass@8 oracle	90%	0.90
SC@8	90%	0.90
BestOfN@8	90%	0.90
MCTS(5+3)	100%	1.00

MCTS solved all 10. Every other method failed on one problem (v6_3).

v6_3 is a 6-variable, 12-constraint problem where none of 8 independent
samples found the correct solution. Pass@8 oracle: 0/8.
Self-consistency picks the most popular wrong answer. BestOfN picks the
best wrong answer. Both fail.

MCTS sees that initial attempts satisfied 10/12 constraints but violated
specific ones. UCB1 selects the most promising partial solution. The
next attempt, informed by the failure pattern, satisfies all 12.

Total: $0.48. 180 API calls, about 190K tokens.

When MCTS Helps

The pattern across three rounds of experiments:

Easy problems (Pass@1 > 80%): No advantage. The model solves them.
Medium (Pass@1 40-70%): MCTS ties BestOfN. Blind sampling usually
contains a correct solution. The verifier selects it.
Hard (Pass@1 < 30%): MCTS pulls ahead. When Pass@8 oracle is
low, blind sampling can't find the answer. MCTS's informed exploration
does.

The condition: MCTS adds value when independent sampling hits a ceiling
and the verifier provides a gradient.

The gradient part matters. A binary pass/fail verifier says "wrong" but
not how wrong. Partial credit (constraints satisfied / total) gives MCTS
something to work with. The exploration phase sees what's close and
adjusts.

Why Self-Consistency Can't Help

Self-consistency and UCB1 have a structural conflict.

Self-consistency rewards consensus. UCB1 rewards diversity: it explores
undervisited branches precisely because they're undervisited. Using
self-consistency as a reward signal inside MCTS tells the tree to explore
and converge at the same time. The exploration term pushes toward novel
solutions. The consistency reward penalizes them.

On v6_3, all 8 samples failed. SC selected the most common failure
mode. A per-path verifier doesn't have this problem. Each solution is
scored against the constraints independently. Good solutions propagate
through the tree regardless of what other branches found.

I haven't seen this conflict discussed in the literature. Most prior
MCTS-for-LLM work uses per-path evaluation without explaining why
self-consistency is absent.

What the Literature Says

These results fit a pattern.

Snell et al. (2024): compute-optimal test-time scaling needs
difficulty-adaptive allocation. Easy problems need no search. Hard
problems need search plus good verifiers.

"Limits of PRM-Guided Tree Search" (2025): PRM-guided MCTS fails to
beat best-of-N because PRM quality degrades with depth. Noisy reward,
no benefit from search.

"Don't Get Lost in the Trees" (ACL 2025): verifier variance causes
search pathologies. Deterministic verifiers avoid this.

Chen et al. (2024): 90% discriminator accuracy threshold for tree
search to beat reranking. Deterministic constraint checkers hit 100%.

MCTS was built for games with perfect information. Chess and Go have
deterministic reward signals. When the reward is noisy, the search can't
exploit it.

The Verification Asymmetry

The problems where MCTS helps share a structure: easy to verify, hard to
solve.

A constraint satisfaction problem with 8 variables and 15 constraints is
hard to solve. The LLM has to coordinate assignments across all
variables simultaneously. Checking a proposed solution is trivial:
evaluate each constraint, count violations.

This is the asymmetry that makes NP problems interesting. Checking a
certificate is polynomial. Finding one is (presumably) not. It's why
search works for code generation (run the tests) and proof checking
(verify the steps) but not for open-ended essay writing (no verifier).

The same asymmetry shows up at other levels.
Reverse-Process Synthetic Data Generation exploits it
for training data: run the easy direction (differentiation) to get
solved examples of the hard direction (integration).
Science as Verifiable Search
is the same observation about scientific method: science is search
through hypothesis space, and the bottleneck is the cost of testing.
Cheap verification enables fast iteration.

At training time, verifiable rewards let you RL a model into producing
better reasoning (DeepSeek-R1, GRPO). At inference time, verifiable
rewards let you search over candidate solutions (MCTS, best-of-N). At
the level of scientific discovery, verifiable predictions let you prune
hypothesis space. Sutton's "Reward is Enough" is the abstract version
of this.

The practical question for LLM reasoning: can you write a verifier? If
yes, search is worth trying. If not, best-of-N with an LLM judge is
probably the ceiling.

Code

Open source: mcts-reasoning.

pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY=your-key
python experiments/run_csp.py --hard --budget 1.00

The provider tracks token usage and enforces budget caps.

Limitations

The problems are hand-crafted. A generator calibrated by Pass@1 rate
would be more convincing. Ten problems shows the pattern but isn't
enough for statistical significance.

I tested one model. A weaker model might show MCTS advantage on simpler
problems. A stronger one would need harder problems.

The MCTS exploration context shows which solutions scored well and
poorly, but not which specific constraints were violated. Adding
evaluator feedback to the exploration prompt is an obvious improvement
I haven't tried yet.

Summary

Three conditions for MCTS to help LLM reasoning:

A deterministic verifier. Not a learned reward model, not an LLM judge.
Partial credit from the verifier. A gradient, not just pass/fail.
A problem hard enough that blind sampling can't reliably solve it.

When all three hold, MCTS outperforms best-of-N, self-consistency, and
single-pass. When any one fails, it doesn't.

DEV Community