There's a technique everyone reaches for when they want a language model to be more reliable: run it a few times and take the majority answer. Self-consistency. Sample five, vote, ship the winner. It's in every "how to make your LLM more accurate" post.
I measured it against the alternative on problems at the model's failure frontier, and the result was the opposite of the folklore: majority voting barely helped, and it scored worse than the single best reasoner. The thing that actually worked — keep the best verified answer — beat it by 22 points.
Here's the data, on the same fixed model, same problems, graded against ground truth computed in Python (so the grading is objective, not the model marking its own homework):
| Scaffold around the model | Accuracy |
|---|---|
| single pass (one sample) | 28% |
| self-consistency (majority of 5) | 30% |
| best-of-N + verifier (keep the best valid of 5) | 50% |
Voting bought +2 points over doing nothing. The verifier bought +22.
Why voting fails, specifically
The problems were subset-sum — genuine search, no shortcut, the kind of task where careful reasoning can only approximate. When I looked at why voting flopped, two failure modes were sitting right there in the answers.
Correlated errors. On several problems all five reasoners returned the same wrong answer — a common greedy local optimum they all fell into. Majority vote cannot fix a unanimous mistake. Five samples that share a bias aren't five opinions; they're one opinion with error bars.
The correct answer is a minority. On two problems, exactly one of the five agents found the true optimum — and got outvoted four-to-one. Majority vote didn't just miss the right answer; it actively threw it away. It saw the correct solution and discarded it because it was rare.
That second one is the part everyone gets wrong. On any search or optimization task, finding the optimum is rare but checking it is easy. That asymmetry is the whole game — and voting is blind to it. It treats a correct-but-rare answer identically to a wrong-but-rare one: as noise to be outvoted.
What to do instead
If the answer is verifiable, don't vote — verify. Generate N attempts, check each one, keep the best that passes. On these problems every agent's answer was a constructively valid lower bound (it exhibited an actual subset summing to a value), so the verifier was literally "keep the largest valid claim." That captures the rare correct find instead of burying it.
This isn't a clever trick I invented. It's what frontier labs actually do — generate many, verify, keep the best — and it's why "sample-and-vote" underperforms "sample-and-check" on anything you can check. The verifier is whatever proves correctness for your task: run the test, recompute the value, type-check the output, execute the code. Voting is what you fall back to only when there's nothing to check and the errors are genuinely independent — a much narrower case than the technique's popularity suggests.
The honest limits
I'm not going to oversell a small experiment. This was ten frontier problems — a clear directional result, not a tight confidence interval. On easy, saturated tasks (deterministic arithmetic the model already solves at 100%) none of this matters: there's nothing to vote on and nothing to fix, and stacking five samples is just paying 5× for the same answer. The verifier edge shows up precisely where the model is at its limit and the answers start to diverge.
And "keep the max valid claim" was exactly right for this maximization. For your task the verifier is different — but the principle transfers cleanly: on verifiable work, a checker that keeps the best beats a vote that keeps the popular.
The takeaway I'd carry into any system that spends test-time compute: spending more samples is not the lever. Spending them on the right aggregator is. A verifier turns a rare correct answer into the shipped answer. A vote turns it into an outlier and deletes it.
Top comments (0)