DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

Sample Your LLM 5 Times and Take a Majority Vote — Accuracy Jumps 35 Points

🌐 Live demo (LOOK · UNDERSTAND · BUILD): https://dev48v.infy.uk/prompt/day3-self-consistency.html

Day 3 of my PromptFromZero series. 50 LLM techniques in 50 days, each visualized with LOOK / UNDERSTAND / BUILD.

Today: self-consistency. The simplest +35-point accuracy lift you can give a small model. Pairs naturally with Chain of Thought (Day 2).


The setup

Same hard math problem. Same model. Five runs.

A train leaves at 60 km/h. 30 minutes later, a second train
leaves the same station at 80 km/h on the same track. How
many minutes after the second train leaves do they meet?
Enter fullscreen mode Exit fullscreen mode

Correct answer: 90 min. (First has 30 km lead. Second closes at 20 km/h. 30÷20 = 1.5h.)

A single CoT call on gemini-2.5-flash with temperature 0.7:

Run 1: 90  ✓
Enter fullscreen mode Exit fullscreen mode

Looks right. But run it again:

Run 2: 22  ✗  (the model divided by 80 instead of 20)
Enter fullscreen mode Exit fullscreen mode

You can't tell which run is the right one from a single call. The wrong run looks just as confident as the right one. Single-roll = stuck with whatever the dice say.


The fix

Sample the same prompt N times in parallel, extract each numeric answer, and take the majority.

import { generateText } from "ai";
import { google } from "@ai-sdk/google";

const model = google("gemini-2.5-flash");

const N = 5;
const samples = await Promise.all(
  Array.from({ length: N }, () =>
    generateText({ model, prompt, temperature: 0.7 })
  )
);

function extract(text) {
  const nums = text.match(/-?\d+(?:\.\d+)?/g);
  return nums ? nums[nums.length - 1] : null;
}

const answers = samples.map(s => extract(s.text));
const tally = {};
for (const a of answers) tally[a] = (tally[a] || 0) + 1;
const [winner, votes] = Object.entries(tally)
  .sort((a, b) => b[1] - a[1])[0];

console.log(`Final: ${winner}${votes}/${N} votes`);
Enter fullscreen mode Exit fullscreen mode

Now:

Samples: ["90", "90", "22", "90", "90"]
Final: 90  4/5 votes
Enter fullscreen mode Exit fullscreen mode

The outlier got out-voted. The wrong answer never reaches the user.


Why it works

Each LLM call is a stochastic sample from the model's probability distribution over outputs. With temperature 0 you'd get the SAME (often-wrong) answer every time. With temperature 0.7 the model takes slightly different reasoning paths, and independent errors don't all line up.

If the model is right 60% of the time on a problem:

  • 1 sample: 60% accuracy
  • 3 samples + majority: 1 - P(2 or 3 wrong) ≈ 1 - (0.4³ + 3·0.4²·0.6) = 65%
  • 5 samples + majority: ≈ 68% on this distribution
  • Multiply by Chain-of-Thought's lift over zero-shot (~+25 points): 95% accuracy on grade-school math.

Numbers depend on the model + problem. The shape always: more samples → fewer mistakes, with diminishing returns past N=10.


Temperature matters

  • temp = 0 → deterministic, all 5 samples identical, defeats the point
  • temp = 0.7 → sweet spot, diverse reasoning paths, math stays valid
  • temp = 1.5 → too random, the model starts writing nonsense

You want diversity without losing competence. 0.7 is the standard.


Confidence for free

votes / N gives you a free confidence score:

  • 5/5 → trust it, auto-accept
  • 3-4/5 → use but flag for human review
  • ≤2/5 → the model is guessing, refuse to answer

You can build a calibrated AI product on top of this signal alone.


The trade-off — cost

N=5 = 5× the tokens of a single call. Per request:

  • Single CoT: ~1k tokens, 60% accurate on hard math
  • Self-consistency (N=5): ~5k tokens, ~95% accurate

For high-stakes problems (medical, finance, code review, judgments), you pay 5× to lift accuracy from 60 → 95%. For low-stakes tasks (chat, summarization, creative writing), single-shot CoT is fine.


Non-numeric answers

For text answers (yes/no, multi-class), normalize before tallying. "Yes" / "yes." / " yes" should all count as one bucket.

function normalize(s) {
  return s.toLowerCase().replace(/[^a-z0-9]/g, "").trim();
}
const canonical = answers.map(normalize);
Enter fullscreen mode Exit fullscreen mode

Build it in 10 minutes

mkdir self-consistency && cd self-consistency
npm init -y
npm install ai @ai-sdk/google
echo "GOOGLE_GENERATIVE_AI_API_KEY=your_key" > .env
Enter fullscreen mode Exit fullscreen mode

Get a free Gemini key at https://aistudio.google.com/apikey — no credit card.

Drop the JS snippet above into self-consistency.mjs and run:

node --env-file=.env self-consistency.mjs
Enter fullscreen mode Exit fullscreen mode

5 parallel calls. Tally. Winner.


Try it now

Three tabs on one page:
https://dev48v.infy.uk/prompt/day3-self-consistency.html

  • LOOK — animated 5-sample run with live tally bar chart
  • UNDERSTAND — 9 click-through steps on why it works
  • BUILD — full Node script, copy + run

What's next in PromptFromZero

Day 4: Few-shot prompting. Drop 2-3 worked examples in the prompt → the model copies the format and reasoning depth on the actual question. The poor man's fine-tune.

🌐 All techniques: https://dev48v.infy.uk/promptfromzero.php

Top comments (0)