When dealing with LLMs like GPT-4, PaLM2, etc., we can often get varying outputs. This begs the question - how can we easily assess what the actual output is given these varied responses?
This is where a chain-of-thought prompting comes into play. By passing the same prompt across different LLMs, we can use them to verify and enhance the accuracy of the final output. In the example, we show here we use "marjority-vote / quorum" amongst the responses to determine the final output. You can of course use other heuristics as well. The idea is inspired by the principles outlined in the "Self Consistency" paper, which looks at how sampling a diverse set of reasoning paths from a single language model can improve consistency.
How to use multiple LLMs in seconds 🤖
AIConfig steps in as the orchestrator to easily coordinate your prompts against multiple LLMs without having to wrangle with the different APIs and providers.
-
Prompt each LLM: We send out the same single prompt from the AIConfig to several LLMs to get various reasoning paths. With a simple,
config.run
and setting the model name for the prompt template, you're able to get your desired output across several models. - Evaluate Responses: AIConfig then uses GPT-4 (or any other model you set up) to determine a quorum—basically, a majority vote on the best response.
The benefits? You get enhanced accuracy and a safety net in case one LLM hallucinates and produces the wrong output!
🎁 Template for Multi-LLM Consistency: https://github.com/lastmile-ai/aiconfig/tree/main/cookbooks/Multi-LLM-Consistency
⭐️ We recently launched AIConfig as our first open-source project. Please support us with a star ⭐️ on Github!
https://github.com/lastmile-ai/aiconfig/tree/main
Top comments (2)
@tanyarai This is indeed an interesting concept altogether. However, I am more interested in understanding the mechanism of majority vote for realistic example of summarization, data extraction or data mining related aspects. How can we evaluate the quality of output generated by each of the LLMs? What if the quorum determiner itself is flawed or hallucinates itself, is a question?
That is a great question - in this case GPT4 is used as the evaluator of the quorum but you could enable function calling with GPT4 to run user-defined functions for the heuristic of your choice to ensure more accuracy. It's actually a very cool use case of using both the LLM but also ensuring a level of accuracy given you're using functions, thanks for bringing it up!