DEV Community

Stephen Collins
Stephen Collins

Posted on

Building a Reliable and Accurate LLM Application with Voting Systems

robot voting

Ensuring the reliability and accuracy of applications powered by large language models (LLMs) is of the upmost importance. One effective strategy to enhance these aspects is by implementing a voting system using multiple LLMs. This tutorial will guide you through the process of setting up a simple yet powerful voting system with various LLM APIs, ensuring your application delivers consistent and high-quality results.

All of the code for this tutorial is available in my GitHub repo.

Why Use a Voting System?

Voting systems combine the strengths of multiple models, leveraging their diverse perspectives to produce a more reliable and accurate outcome. Here are some key benefits:

  • Increased Accuracy: Aggregating outputs from multiple models often yields better performance than any single model.
  • Robustness: The system is less prone to individual model errors, ensuring more stable predictions.
  • Better Generalization: Combining multiple models helps in capturing a broader range of knowledge, improving generalization to new inputs.

Step-by-Step Guide to Implementing a Voting System

Step 1: Choose Your LLM APIs

First, select multiple LLM APIs from different providers. For this tutorial, we’ll use Google Gemini 1.5 Flash, Claude Sonnet 3.5, and OpenAI GPT-4o, three popular LLM providers.

Step 2: Initialize the APIs

Ensure you have the necessary API keys and initialize the clients for each provider. Here is how to get started with Anthropic's API key, Google's credential JSON, and OpenAI's API key. And here is our initial setup for our AI-powered image processing application:



from dotenv import load_dotenv
from ai.factory import create_ai_processor

load_dotenv()

google_processor = create_ai_processor("google", "gemini-1.5-flash-001")
openai_processor = create_ai_processor("openai", "gpt-4o")
anthropic_processor = create_ai_processor(
    "anthropic", "claude-3-5-sonnet-20240620")
voters = [google_processor, openai_processor, anthropic_processor]


Enter fullscreen mode Exit fullscreen mode

We initialize each processor from each vendor. Check out the abstract base class AIProcessor to see how to implement your own processor.

Step 3: Define Functions to Get Responses

Create functions to send the same input query to each API and collect their responses:



def majority_voting_system_votes(prompt, image):
    votes = []
    for voter in voters:
        vote = voter.process(prompt, image)
        votes.append(int(vote) if vote.isdigit() else vote)
        print(f"VENDOR: {voter.get_vendor()} MODEL: {voter.get_model_name()} VOTE: {vote}")
    return max(set(votes), key=votes.count)


def weighted_voting_system_votes(prompt, image, weights):
    weighted_responses = {}

    for voter, weight in zip(voters, weights):
        vote = voter.process(prompt, image)
        vote = int(vote) if vote.isdigit() else vote
        print(f"VENDOR: {voter.get_vendor()} MODEL: {voter.get_model_name()} VOTE: {vote} WEIGHT: {weight}")
        weighted_responses[vote] = weighted_responses.get(vote, 0) + weight

    return max(weighted_responses, key=weighted_responses.get)


Enter fullscreen mode Exit fullscreen mode

Step 4: Test the System

Now that your voting system is set up, test it with an example prompt:



# Example usage
prompt = "How many coins are in the image? Only respond with a number."

with open("./images/coins.png", "rb") as image_file:
    image = image_file.read()

final_vote = majority_voting_system_votes(prompt, image)
print("Majority Voting Final Vote:", final_vote)

# Example weights for Google Gemini, OpenAI GPT-4o, and Claude Sonnet respectively
weights = [0.4, 0.3, 0.3]
final_vote = weighted_voting_system_votes(prompt, image, weights)
print("Weighted Voting Final Vote:", final_vote)


Enter fullscreen mode Exit fullscreen mode

If the app is working as expected, you should see the following output:



VENDOR: google MODEL: gemini-1.5-flash-001 VOTE: 3
VENDOR: openai MODEL: gpt-4o VOTE: 3
VENDOR: anthropic MODEL: claude-3-5-sonnet-20240620 VOTE: 3
Majority Voting Final Vote: 3
VENDOR: google MODEL: gemini-1.5-flash-001 VOTE: 3 WEIGHT: 0.4
VENDOR: openai MODEL: gpt-4o VOTE: 3 WEIGHT: 0.3
VENDOR: anthropic MODEL: claude-3-5-sonnet-20240620 VOTE: 3 WEIGHT: 0.3
Weighted Voting Final Vote: 3

Enter fullscreen mode Exit fullscreen mode




Enhancing the Voting System

While majority and weighted voting are great starts, you can enhance your system with techniques like performance monitoring.

Performance Monitoring

Monitor the performance of each model by keeping track of their accuracy over time. Adjust the weights based on this performance data to ensure the most reliable models have more influence on the final decision.

Best Practices

  • Model Diversity: Use different types of LLMs to benefit from diverse perspectives.
  • Data Management: Properly handle training and validation data to avoid data leakage and ensure fair evaluation.
  • Regular Evaluation: Continuously evaluate the performance of your voting system and individual models to maintain high accuracy and reliability.

Conclusion

Implementing a voting system with multiple LLMs can significantly enhance the performance of your application. By leveraging the strengths of different models, you can achieve higher accuracy, improved robustness, and better generalization. Start with a simple majority voting system and gradually incorporate more sophisticated techniques like weighted voting and performance monitoring to optimize your application further.

By following this tutorial, you'll be well on your way to building a reliable and accurate LLM application. Whether you're working on an image recognition system, a sentiment classification tool, or any other AI-powered application, a voting system can help ensure your results are consistently top-notch.

Top comments (2)

Collapse
 
hs_cc0efd218d07658794b59 profile image
H-S

do you have the equivalent, but for a vector database ? without images

Collapse
 
stephenc222 profile image
Stephen Collins

So here’s how I would implement a similar system for text-based tasks without images:

I would create a “judge” or “evaluator”. You send multiple LLM outputs to a single (or panel) “judge” LLM who determines the which output is the best LLM output to send forward.

Perhaps a simple blog post to explain further!