DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

LLM Jury: Get a Better Answer by Asking Multiple Models to Vote

One LLM can be wrong. Three LLMs voting on the same question are harder to fool.

This is not a new idea in ML — ensemble methods and majority voting have been standard practice for decades. Applying it to LLM inference is straightforward, but writing the parallel calls, collecting results, and implementing voting logic is boilerplate that every team rewrites.

llm-multi-vote packages the pattern.


The Shape of the Fix

from llm_multi_vote import MultiVote, VoteResult

mv = MultiVote(voters=[
    ("claude", lambda p: call_claude(p)),
    ("gpt4o",  lambda p: call_openai(p)),
    ("gemini", lambda p: call_gemini(p)),
])

result: VoteResult = mv.vote("Is this code safe to run? Answer yes or no.\n\n" + code)

print(f"Winner: {result.winner}")        # "yes" or "no"
print(f"Votes: {result.votes}")          # {"yes": 2, "no": 1}
print(f"Confidence: {result.confidence}") # 0.67 (2/3)
print(f"Agreement: {result.agreement}")  # False (not unanimous)
Enter fullscreen mode Exit fullscreen mode

Three voters. One call. A structured result with the winning answer, vote breakdown, and confidence score.


What It Does NOT Do

llm-multi-vote does not guarantee correctness. Three models can all be wrong about the same thing, especially when the question has a systematic bias that affects all three.

It does not handle structured output. If your voters return JSON objects, you need to normalize them before voting. The voting logic operates on string values.

By default, voters run sequentially. Parallel execution is opt-in with parallel=True, which uses ThreadPoolExecutor. If your LLM clients are thread-safe (most are), parallel mode cuts wall-clock time proportionally.


Inside the Library

The voting strategies:

class VoteStrategy(Enum):
    MAJORITY = "majority"     # most votes wins; tie goes to first in order
    UNANIMOUS = "unanimous"   # only return winner if all voters agree
    WEIGHTED = "weighted"     # caller provides weights per voter
Enter fullscreen mode Exit fullscreen mode

For MAJORITY voting:

def vote(self, prompt: str) -> VoteResult:
    responses = []
    for name, fn in self._voters:
        try:
            responses.append((name, fn(prompt).strip().lower()))
        except Exception as e:
            responses.append((name, None))  # voter failed, skip

    valid = [(n, r) for n, r in responses if r is not None]
    counts = Counter(r for _, r in valid)
    winner = counts.most_common(1)[0][0]

    return VoteResult(
        winner=winner,
        votes=dict(counts),
        confidence=counts[winner] / len(valid),
        agreement=len(counts) == 1,
        voter_responses=responses,
    )
Enter fullscreen mode Exit fullscreen mode

Failed voters are excluded from counting. If all voters fail, VoteResult.winner is None and confidence is 0.

For normalized outputs (yes/no, classification labels), voters should be instructed to return exactly one token or word. The voting logic does exact string matching. "Yes" and "yes" are different values — normalize with .strip().lower() in your voter lambda.


When to Use It

Use it for high-stakes binary or categorical decisions where you want a second opinion. Security classification (safe/unsafe), sentiment labels (positive/negative/neutral), content moderation flags.

The confidence score is the key output. A 1.0 confidence (unanimous agreement) is a strong signal. A 0.33 confidence (3-way tie or 1-of-3) is a weak signal that warrants human review.

The cost tradeoff is real: three voters costs three times as much as one. Use voting for decisions that justify the cost. For routine classifications with a well-calibrated single model, voting adds cost without proportional benefit.

Skip it for generation tasks. If you want a diverse set of responses (creative writing, brainstorming), run voters independently and present all responses. Voting collapses diversity, which is the opposite of what generation tasks need.


Install

pip install git+https://github.com/MukundaKatta/llm-multi-vote
Enter fullscreen mode Exit fullscreen mode
from llm_multi_vote import MultiVote, VoteStrategy, VoteResult

mv = MultiVote(
    voters=[
        ("claude-sonnet", lambda p: call_claude(p)),
        ("claude-haiku",  lambda p: call_claude_haiku(p)),
        ("gpt4o-mini",    lambda p: call_openai_mini(p)),
    ],
    strategy=VoteStrategy.MAJORITY,
    parallel=True,  # run all voters concurrently
)

def classify_intent(user_message: str) -> str:
    prompt = f"""Classify the user's intent.
Reply with exactly one word: question, complaint, request, or feedback.

User message: {user_message}"""

    result: VoteResult = mv.vote(prompt)

    if result.confidence < 0.67:
        # Weak agreement — default to human review queue
        return "unclear"

    return result.winner
Enter fullscreen mode Exit fullscreen mode

Sibling Libraries

Library What it solves
prompt-eval-rubric 0.0-1.0 quality scoring for freeform responses
llm-fallback-chain Try providers in order on failure
llm-batch-coalesce Collapse duplicate LLM calls from concurrent callers
agentsnap Record responses for regression testing
llm-output-validator Rule-based validation of individual responses

The combined pattern: use MultiVote for the decision, llm-output-validator to validate each voter's output before counting, and agentsnap to track confidence scores over time.


What's Next

Weighted voting where voter weights are calibrated from historical accuracy. If Claude agrees with human reviewers 92% of the time on a specific task and GPT-4o agrees 85% of the time, weight Claude more heavily. The calibration data would come from a labeled dataset.

Abstain support: allow voters to return a special "abstain" value when they are not confident. Only votes from non-abstaining voters count. This is useful for tasks where it is better to say "I don't know" than to guess.

Async voters for async agent frameworks: the current implementation uses ThreadPoolExecutor for parallel mode. An async_vote() with asyncio.gather() would be cleaner for async code.


Built as part of the agent-stack family: composable Python primitives for production LLM agents.

Top comments (0)