🌐 Live demo (LOOK · UNDERSTAND · BUILD): https://dev48v.infy.uk/prompt/day3-self-consistency.html
Day 3 of my PromptFromZero series. 50 LLM techniques in 50 days, each visualized with LOOK / UNDERSTAND / BUILD.
Today: self-consistency. The simplest +35-point accuracy lift you can give a small model. Pairs naturally with Chain of Thought (Day 2).
The setup
Same hard math problem. Same model. Five runs.
A train leaves at 60 km/h. 30 minutes later, a second train
leaves the same station at 80 km/h on the same track. How
many minutes after the second train leaves do they meet?
Correct answer: 90 min. (First has 30 km lead. Second closes at 20 km/h. 30÷20 = 1.5h.)
A single CoT call on gemini-2.5-flash with temperature 0.7:
Run 1: 90 ✓
Looks right. But run it again:
Run 2: 22 ✗ (the model divided by 80 instead of 20)
You can't tell which run is the right one from a single call. The wrong run looks just as confident as the right one. Single-roll = stuck with whatever the dice say.
The fix
Sample the same prompt N times in parallel, extract each numeric answer, and take the majority.
import { generateText } from "ai";
import { google } from "@ai-sdk/google";
const model = google("gemini-2.5-flash");
const N = 5;
const samples = await Promise.all(
Array.from({ length: N }, () =>
generateText({ model, prompt, temperature: 0.7 })
)
);
function extract(text) {
const nums = text.match(/-?\d+(?:\.\d+)?/g);
return nums ? nums[nums.length - 1] : null;
}
const answers = samples.map(s => extract(s.text));
const tally = {};
for (const a of answers) tally[a] = (tally[a] || 0) + 1;
const [winner, votes] = Object.entries(tally)
.sort((a, b) => b[1] - a[1])[0];
console.log(`Final: ${winner} — ${votes}/${N} votes`);
Now:
Samples: ["90", "90", "22", "90", "90"]
Final: 90 — 4/5 votes
The outlier got out-voted. The wrong answer never reaches the user.
Why it works
Each LLM call is a stochastic sample from the model's probability distribution over outputs. With temperature 0 you'd get the SAME (often-wrong) answer every time. With temperature 0.7 the model takes slightly different reasoning paths, and independent errors don't all line up.
If the model is right 60% of the time on a problem:
- 1 sample: 60% accuracy
- 3 samples + majority: 1 - P(2 or 3 wrong) ≈ 1 - (0.4³ + 3·0.4²·0.6) = 65%
- 5 samples + majority: ≈ 68% on this distribution
- Multiply by Chain-of-Thought's lift over zero-shot (~+25 points): 95% accuracy on grade-school math.
Numbers depend on the model + problem. The shape always: more samples → fewer mistakes, with diminishing returns past N=10.
Temperature matters
- temp = 0 → deterministic, all 5 samples identical, defeats the point
- temp = 0.7 → sweet spot, diverse reasoning paths, math stays valid
- temp = 1.5 → too random, the model starts writing nonsense
You want diversity without losing competence. 0.7 is the standard.
Confidence for free
votes / N gives you a free confidence score:
- 5/5 → trust it, auto-accept
- 3-4/5 → use but flag for human review
- ≤2/5 → the model is guessing, refuse to answer
You can build a calibrated AI product on top of this signal alone.
The trade-off — cost
N=5 = 5× the tokens of a single call. Per request:
- Single CoT: ~1k tokens, 60% accurate on hard math
- Self-consistency (N=5): ~5k tokens, ~95% accurate
For high-stakes problems (medical, finance, code review, judgments), you pay 5× to lift accuracy from 60 → 95%. For low-stakes tasks (chat, summarization, creative writing), single-shot CoT is fine.
Non-numeric answers
For text answers (yes/no, multi-class), normalize before tallying. "Yes" / "yes." / " yes" should all count as one bucket.
function normalize(s) {
return s.toLowerCase().replace(/[^a-z0-9]/g, "").trim();
}
const canonical = answers.map(normalize);
Build it in 10 minutes
mkdir self-consistency && cd self-consistency
npm init -y
npm install ai @ai-sdk/google
echo "GOOGLE_GENERATIVE_AI_API_KEY=your_key" > .env
Get a free Gemini key at https://aistudio.google.com/apikey — no credit card.
Drop the JS snippet above into self-consistency.mjs and run:
node --env-file=.env self-consistency.mjs
5 parallel calls. Tally. Winner.
Try it now
Three tabs on one page:
https://dev48v.infy.uk/prompt/day3-self-consistency.html
- LOOK — animated 5-sample run with live tally bar chart
- UNDERSTAND — 9 click-through steps on why it works
- BUILD — full Node script, copy + run
What's next in PromptFromZero
Day 4: Few-shot prompting. Drop 2-3 worked examples in the prompt → the model copies the format and reasoning depth on the actual question. The poor man's fine-tune.
🌐 All techniques: https://dev48v.infy.uk/promptfromzero.php
Top comments (0)