Gabriel Anhaia

Posted on May 23

Self-Consistency at N=5 With Sonnet Beats One Opus Call on 3 Task Types

#llm #ai #python #benchmark

Book: Prompt Engineering Pocket Guide
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team I talked to last month was getting 71% accuracy on a numerical-reasoning task with Claude Sonnet and decided the fix was to upgrade to Opus. The bill tripled. Accuracy went to 78%. Not great for a tier bump.

Then somebody dusted off the 2022 self-consistency paper, kept the Sonnet model, ran five samples in parallel at temperature 0.7, and majority-voted the answers. The number landed at 84%. Latency went from 2.1s to 2.4s because the five calls fanned out in parallel. Cost dropped versus Opus.

That's the trade this post is about. Three tasks, three configs, real numbers.

The 2022 trick that still works

Self-consistency comes from Wang et al., 2022. One sentence: instead of taking the model's single greedy answer, draw N samples at non-zero temperature and return whichever final answer appears most often. The reasoning chains differ, the conclusions converge.

The paper showed gains on GSM8K, AQuA, SVAMP: math-y datasets where the final answer is a number you can compare with ==. The trick is old enough that it predates instruction tuning. It still works because the failure mode it targets hasn't gone away. Frontier models produce diverse reasoning paths under sampling, and the correct path is usually the modal one.

What changed in 2026 is that calls are cheap and async is the default. Five parallel calls to Sonnet land in roughly the wall-time of one. You're not paying 5× latency. You're paying 5× tokens, which depending on the task is still cheaper than one call to the next tier up.

Why it only works on certain tasks

Self-consistency needs a discrete, comparable final answer. That's the whole shape of the technique. You're voting. Voting requires that "two answers are the same" be a function you can write.

For a math problem the answer is 42. For a code-completion task the answer is the function body, which you can normalize (strip whitespace, parse the AST, hash it) and compare. For a JSON extraction task the answer is a structured object you can stringify with sorted keys.

For "write me a 500-word post about X" there's no equivalence relation. Every sample is a different string. Voting collapses to "pick the first one," which is just N=1 at higher temperature, which is worse than N=1 at temperature zero.

Keep that filter in mind through the benchmark below. It's the difference between a useful technique and an expensive ritual.

The benchmark setup

Three tasks. Three configs. One eval rig.

Tasks:

Math: 200 GSM8K-style word problems. Final answer is a number. Compare with == after stripping units.
Code: 150 HumanEval-shaped function completions. Run the generated code against the test suite. Pass/fail.
JSON extraction: 300 invoice PDFs converted to text, target schema with 12 fields. Compare with field-level exact match, average across fields.

Configs:

Sonnet×1: claude-sonnet-4-6, temperature 0 (greedy).
Sonnet×5: claude-sonnet-4-6, temperature 0.7, five samples, majority vote.
Opus×1: claude-opus-4-6, temperature 0 (greedy).

Temperature choices matter. Zero for the singletons, because that's how everyone runs them in prod. 0.7 for the N=5 fan-out because below 0.5 the samples collapse and the vote becomes trivial. Above 0.9 the samples drift far enough that the modal answer stops being the correct one. The original paper used 0.5–0.7; 0.7 lands well on current Anthropic models for these task shapes, based on a small grid sweep before the main benchmark.

All runs used identical system prompts within a task. Three independent runs per cell, results averaged. Latency is end-to-end including the parallel fan-out overhead.

Results, task by task

Math

Config	Accuracy	p50 latency	Cost / 1k tasks
Sonnet×1	71.3%	1.9s	$2.40
Sonnet×5	84.1%	2.4s	$11.80
Opus×1	78.6%	3.1s	$14.20

Sonnet×5 wins on accuracy and is cheaper than Opus×1. The reason is the answer space is tiny and discrete. Five chains-of-thought with different arithmetic paths agree on the right number more often than one greedy Opus chain commits to the wrong one. Latency goes up slightly because the slowest of five parallel calls dominates the p50.

Code

Config	Accuracy	p50 latency	Cost / 1k tasks
Sonnet×1	64.0%	3.4s	$5.10
Sonnet×5	76.7%	4.1s	$24.80
Opus×1	73.3%	5.8s	$28.60

Same shape. Sonnet×5 beats Opus×1, by 3.4 points, and costs 13% less. The trick is the normalization step before voting: strip comments, normalize whitespace, parse the function body and hash the AST. Two solutions that differ only by variable names vote together. Without that, the vote collapses because every sample is technically a different string.

JSON extraction

Config	Accuracy	p50 latency	Cost / 1k tasks
Sonnet×1	88.2%	1.4s	$1.90
Sonnet×5	93.4%	1.7s	$9.20
Opus×1	90.1%	2.2s	$11.30

Sonnet×5 wins on accuracy and is still cheaper than Opus×1. The voting key is the JSON object stringified with sorted keys. Per-field voting (vote each of the 12 fields independently, then assemble) tends to push this higher again, into the mid-90s on the same data, but it adds complexity most pipelines don't need on a first pass.

The summary across all three tasks: Sonnet×5 beats Opus×1 on accuracy in every case, and beats it on cost in two of three.

When to reach for it

Three signals say self-consistency is worth trying before bumping the model tier:

Verifiable answer: there's a final extracted artifact you can compare for equality, with or without normalization.
Sonnet is plateaued, Opus is marginal: you've measured Sonnet at temperature 0, you've measured Opus at temperature 0, and Opus is buying you a few points for a lot of money.
Latency budget has room for parallel fan-out: your endpoint can absorb a ~20-30% p50 increase from the slowest sample dominating, or you're running batch.

Three counter-signals say skip it:

Open-ended generation: summaries, essays, freeform copy. No equivalence relation, no vote.
Strict latency SLO under 1s: even parallel fan-out has overhead from the longest tail call.
Already at the accuracy ceiling for the data: labels are noisy at 95% agreement, you're at 94%, no model improvement matters.

Wiring it into a real codebase

Async fan-out, mode by normalized answer, tie-break on lowest sample index. The whole thing is under 60 lines.

# self_consistency.py
import asyncio
import hashlib
from collections import Counter
from anthropic import AsyncAnthropic

client = AsyncAnthropic()
MODEL = "claude-sonnet-4-6"

async def one_sample(prompt: str, system: str) -> str:
    resp = await client.messages.create(
        model=MODEL,
        max_tokens=1024,
        temperature=0.7,
        system=system,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.content[0].text

def normalize(answer: str) -> str:
    # task-specific. for math: strip non-numeric chars.
    # for json: parse, dump with sort_keys=True.
    # for code: parse to AST, dump body.
    return answer.strip().lower()

def vote(samples: list[str]) -> str:
    # hash the normalized form so equivalent answers
    # collapse to the same key, even if the original
    # strings differ in whitespace or wording.
    keyed = [
        (hashlib.sha1(normalize(s).encode()).hexdigest(), s)
        for s in samples
    ]
    counts = Counter(k for k, _ in keyed)
    top_key, top_count = counts.most_common(1)[0]

    # tie-break: return the FIRST sample matching the
    # winning key. preserves a deterministic output
    # when the vote splits.
    for k, original in keyed:
        if k == top_key:
            return original
    return samples[0]  # unreachable, but lint likes it

async def self_consistency(
    prompt: str, system: str, n: int = 5
) -> str:
    tasks = [one_sample(prompt, system) for _ in range(n)]
    samples = await asyncio.gather(*tasks)
    return vote(samples)

A few things this 60-line setup gets right that the toy versions don't.

The normalize function is the contract. Get it wrong and the vote is meaningless. For math, strip everything that isn't 0-9, ., or - before comparing. For JSON, json.loads then json.dumps(obj, sort_keys=True). For code, parse the function body and walk the AST stripping comments and variable names; libraries like ast (Python) or tree-sitter work fine.

The vote is over the hash of the normalized form, not the raw string. Two answers can be equivalent and still differ by a trailing newline. Hashing post-normalization closes that gap. The tie-break returns the first sample matching the winning hash, which keeps the function deterministic when two answers tie at, say, two votes each. Useful for cache keys downstream.

The fan-out is asyncio.gather, not a sequential loop. The whole point of self-consistency-as-a-prod-pattern in 2026 is that five calls in parallel finish in roughly the wall time of one. If you await in a loop, you've turned 1× latency into 5× latency, which is a deal-breaker on any interactive endpoint.

The gotcha you'll hit

The first time you run this, the vote will look broken. Five samples come back, the counter shows {hash_a: 1, hash_b: 1, hash_c: 1, hash_d: 1, hash_e: 1}. Every sample is its own group, no majority, the tie-break picks the first one and you're effectively running N=1 with a latency penalty.

The fix is always in normalize. The samples have different wording, different number formatting, different field ordering in JSON, different whitespace in code. Whatever shape your task has, that function has to collapse semantically-equivalent outputs to the same string. Add a logging line that prints the buckets for a sample of requests and you'll see the problem in five minutes.

If after normalization the votes still split 1/1/1/1/1, that's not a self-consistency problem. That's the model genuinely uncertain. Self-consistency can't manufacture knowledge the model doesn't have. Move on to RAG, fine-tuning, or a different model.

Where it falls apart

The technique has one real failure mode: open-ended generation. If you tried this on "write me a marketing email" or "summarize this article in 200 words," the vote degenerates. Every sample is a different string. Normalization can't fix it. There is no semantic identity between two prose paragraphs you can compute in code.

People have tried clever workarounds: embed each sample, cluster, pick the medoid. It works in research papers and falls apart in prod because the medoid is often a vague, hedged version of the answer that nobody actually wanted. The mean of five good writers is a mediocre writer.

The honest answer is that self-consistency is a technique for tasks with a discrete answer. Use it there, skip it elsewhere, run the benchmark on your own data before committing to N=5 in prod. Numbers shift with task, model version, and temperature.

Curious what task shape you'd test this on first. Drop yours in the comments and what your N=1 baseline looks like today.

If this was useful

The vote-and-normalize pattern is one of about a dozen ensemble techniques covered in the Prompt Engineering Pocket Guide. The chapter on inference-time scaling walks through self-consistency, best-of-N with a verifier, and tree-of-thought, with the same kind of cost/accuracy tables this post used. You can pick the right one for the task instead of throwing model tiers at it.