Which LLM to Choose: 12 key aspects to consider when building AI solutions

Overview of the Leading LLMs

The leaderboard below presents a high-level comparison of leading large language models (LLMs) from various providers such as OpenAI, Google, Anthropic, Cohere, Meta, Mistral AI, and Databricks. Models are evaluated based on key factors including performance (price, quality, and speed), context window length, and licensing. Models are rated on a star-based system for price, quality, and speed to help quickly identify the ideal model based on these key factors. Later in this post, we’ll dive deeper into each of these categories as well as other aspects to consider when building applications with LLMs.

A more detailed version of this leaderboard can be found here.

Navigating the LLM Revolution

The rise of large language models (LLMs) has revolutionized natural language processing, enabling companies to seamlessly apply AI to a broad range of tasks. This analysis explores leading LLMs, examining their capabilities, applications, and performance as well as key considerations to make when deciding which model to use. Our focus includes not only OpenAI models but also notable contenders like Anthropic, Meta, Google, and more.

LLMs have evolved from traditional NLP models designed for specific tasks to versatile tools capable of handling a wide range of applications. Models like OpenAI's ChatGPT have shifted the paradigm, excelling in various tasks without the need for specialized training. By connecting these models to internal data, businesses can seamlessly integrate AI, outperforming traditional NLP models in most major tasks.

Over the past year, the adoption of LLM technologies has surged, with startups and major corporations developing their own foundation models. Giants like OpenAI, Google, and Meta lead the charge, while newcomers like Mistral AI and Databricks are also making significant strides. This blog aims to guide you through the complex process of choosing the right model for your needs.

Understanding LLM Benchmarks

When selecting a model, the instinct might be to choose the “best” one. However, picking an LLM can be more complicated than it seems. Standard benchmarks rank models based on their performance on test sets, which can be generalized knowledge or domain-specific (e.g., coding, multilingual tasks). While benchmarks are useful, they have limitations:

Data Leakage: Often time test data leaks into training datasets, causing models to memorize answers skewing leaderboard results and not accurately reflecting real world performance.
Errors: Some leaderboards have fundamental flaws and errors, so their rankings should be taken with a grain of salt.
Real World Performance: Benchmark performance might not correlate directly with domain-specific tasks, especially if your use case differs from the benchmark scenarios.

There are many different benchmarks all with different pros and cons, and it’s generally a good practice to look at model performance across different benchmarks. Some of the popular benchmarks to consider include the following:

MMLU (Massive Multitask Language Understanding): A benchmark that evaluates the performance of large language models across 57 diverse subjects at varying difficulty levels using multiple-choice questions.
Chatbot Arena: A web application where users chat with multiple models and vote on the best output. This benchmark considers response quality, speed, and conciseness.
MT Bench (Multi-Task Benchmark): A benchmark designed to measure the versatility and robustness of large language models across a variety of tasks including translation, summarization, and question answering.
HumanEval: A benchmark that assesses the code generation capabilities of large language models by evaluating their ability to generate correct and functional code based on given programming problems.

We reference Artificial Analysis to pull model benchmarks and recommend checking them out for more information. For easier reference, in our comparison chart, we rank models based on price and their MMLU score out of 3 stars using quartile rankings. Note that these benchmarks are not perfect representations of model performance. Recently, Scale AI launched a private leaderboard to provide better model evaluations, which we also recommend exploring. While leaderboards are useful guides, remember to consider other factors like cost, speed, privacy, and specific features when choosing an LLM.

Why Experiment with Different Models

Benchmarks and performance aren’t the only things to consider when picking which model to use. Choosing the right LLM involves considering various factors such as cost, speed, and performance. For example, if developing a local application where models are meant to run on device, using a large model would be very slow and almost unusable. While most leaderboards place OpenAI’s GPT-4 at the top based on standard benchmarks, it may not always be the best choice for every use case. Swapping in different open-source models and augmenting them with external data using techniques like RAG (retrieval-augmented generation) can bridge performance gaps and reduce costs while also offering other benefits in terms of speed and specialized capabilities. For instance, some models are faster (beneficial for real-time applications) and cheaper (useful for processing large volumes of text).

5 Key Aspects for Picking an LLM:

Performance: Models are typically evaluated on standard benchmarks and ranked based on scores. Depending on the use case, consider benchmarks like MMLU for general knowledge and HumanEval for code-related tasks.
Cost: LLMs are usually hosted by a company and priced per token. Costs can vary significantly; for example, smaller open-source models like Llama-3-8b ($0.20/1M tokens) cost significantly less than GPT-4 ($30/1M tokens). While cheaper models may not perform as well, they can be sufficient for some tasks. Exploring different models across various price ranges can help identify the best model for your budget.
Output Speed: Depending on the use case, speed can be crucial. In real-time applications like voice apps, slow responses can impact user experience, whereas speed may be less critical for tasks like processing meeting transcripts overnight. Output speed is measured in two ways: the latency or time to first token (TTFT) and overall tokens per second (throughput). A quick TTFT allows for streaming responses, making the application feel faster, while throughput is crucial for tasks requiring the full text to be generated quickly.
Privacy Features: Proprietary models from companies like OpenAI, Google, and Anthropic are accessed via API, requiring data to be shared with the hosting company. For some applications, this is acceptable, but in others, using an open-source model that can be hosted locally might be better, ensuring data remains on-premises.
Specific Capabilities: Some models are specialized for tasks like code generation, tool use, function calling, or processing multiple modalities (e.g., images, audio). Using these specialized models can be more cost-effective and result in better performance compared to larger, more expensive general models. Some examples include Code Llama and Cohere’s retrieval models.

Which model to use

Choosing the right model is often the first step in building a robust application. LLMs are large pretrained models with extensive general knowledge and reasoning capabilities, suitable for a wide range of tasks. However, they have limitations, including a fixed knowledge cutoff date.

To keep models up-to-date, they often need external data, which can be integrated using tools like search APIs or techniques like retrieval-augmented generation (RAG). Some tasks, like sentiment analysis, classification, or translation, can be handled by models without additional data, especially when prompted with a few examples. However, tasks requiring specific or private data, such as chatbots referencing internal documents, need supplementary data to function correctly.

When an application relies heavily on a model's internal knowledge, proprietary models like GPT-4, Gemini, and Opus tend to outperform smaller open-source models like Llama or Mistral. This is because larger models possess more extensive general knowledge and superior reasoning abilities. However, the performance gap narrows when model outputs are augmented with external data and techniques like few-shot prompting.

General Guidance for Picking a Model

Start with High-Performing Models: Begin prototyping and developing your LLM application with top-performing models like OpenAI’s GPT-4, Google’s Gemini, or Claude’s Opus to ensure high output quality.
Iterate and Optimize: After establishing a baseline with a high-performing model, explore swapping in different models based on specific use cases and cost considerations. Techniques such as supplementing with few-shot examples or connecting to external tools can help match the performance of larger models.
Evaluate Trade-offs: Consider other benefits beyond performance, such as speed. In some cases, a slight drop in performance might be acceptable if it results in significant improvements in generation speed or cost savings. By experimenting with different models and techniques, you can optimize both the cost and performance of your LLM applications, potentially achieving better results than sticking with a single model.

7 Key Considerations for Building an LLM Application

LLMs are excellent for quick demos since they provide a great out-of-the-box experience with their internal knowledge. However, creating a robust and reliable application involves more than just the model. Building an end-to-end AI application includes several key components:

Data Connectors: LLM applications often need to connect models with data from various sources like databases, APIs, and cloud storage. Tools like MindsDB simplify this process by integrating multiple data sources into a single platform, making data management and utilization more efficient.
Data Preprocessing: Preparing and cleaning data ensures quality inputs for the model. LLMs perform best with structured data, so preprocessing raw data is crucial for improving model accuracy and efficiency.
Embedding Models: Embedding models encode data into dense vector representations, capturing semantic meaning and aiding in tasks like similarity search and classification. High-quality embeddings enhance the model’s ability to understand and process data effectively.
Vector Databases: Vector databases store and query embeddings efficiently, enabling fast similarity searches and handling large volumes of high-dimensional data. They are crucial for applications requiring real-time responses and high scalability.
RAG Pipelines: RAG pipelines enhance LLM responses by integrating external data. Relevant documents or data are retrieved and used to augment the model’s output, providing more accurate and up-to-date responses. Setting up RAG pipelines involves many steps from retrieving the correct documents either with traditional methods like keyword search or semantic similarity to reranking or preprocessing them with models to craft a working pipeline.
Prompt Engineering/Management: Effective prompts guide the model to produce specific responses. Prompt engineering involves designing and managing prompts to be contextually relevant and optimized for performance, significantly enhancing the model’s output accuracy and relevance.
Observability and Evaluation: Monitoring and evaluating model performance is crucial for reliability. Observability tools track metrics like response time and accuracy, while evaluation tools assess outputs against benchmarks. These tools help in detecting issues and making data-driven improvements. Testing and building robust pipelines that integrate these components is crucial for creating production-grade LLM applications. MindsDB can help by providing a streamlined way to connect and preprocess data, making the process more efficient and reliable.

Introducing “Minds” - pre-packaged AI systems

Minds are AI systems with built-in expertise designed to help AI agents accomplish tasks. These plug-and-play systems need little setup and are designed to be consumed in a seamless manner just like traditional LLMs with an OpenAI compatible API.

Minds abstract away the complexities of building an AI application and bundles all of the components into a “Mind” that creates an agent to seemingly accomplish the task it was created for. Our first mind - the Database-Mind - is designed to directly interact with data using natural language. To use it all you need to do is pass in a database schema and ask questions in natural language and the Mind handles the rest and just returns answers.

Database-Mind is the first of many to come. You can check it out for free here.

Deployment Options: Self Hosting vs Serverless

When deploying LLMs, you have several options, each with its own advantages and considerations. Here's a comparison between self-hosting and serverless deployment, along with insights on using inference providers.

Self Hosting
Self-hosting LLMs provides greater control over the environment and ensures that all data remains on-premises, which can be crucial for maintaining privacy and security. This approach is particularly beneficial for applications that handle sensitive information and cannot afford to share data with third parties. However, self-hosting requires significant upfront investment in infrastructure and technical expertise to manage and maintain the systems. While this can lead to lower costs for high-volume usage in the long run, the initial setup and ongoing management can be complex and resource-intensive.

Serverless Deployment
Serverless deployment offers the advantage of scalability and reduced maintenance overhead. This option is ideal for applications that need to scale quickly and efficiently without the need for significant infrastructure investment. With serverless deployment, you can focus on developing your application while the service provider handles the infrastructure, scaling, and maintenance. This model is particularly useful for variable workloads, where the demand can fluctuate, as it allows for automatic scaling to meet the demand without manual intervention.

Inference Providers
Inference providers like Anyscale, Fireworks AI, and Together AI offer services that simplify the deployment and management of LLMs. These providers offer several advantages:

Ease of Integration: Inference providers offer standardized APIs, making it simple to integrate LLMs into your applications.
Scalability: They provide auto-scaling capabilities to handle varying workloads efficiently.
Cost Efficiency: By hosting open-source models, these providers can offer serverless endpoints at a lower cost compared to proprietary models, enabling you to swap out expensive models for cheaper alternatives without sacrificing performance.
Advanced Features: Many inference providers offer additional features such as model fine-tuning and assistance in deploying custom instances, allowing you to tailor the models to your specific needs.
Monitoring and Optimization: These services include tools for monitoring and optimizing model performance, helping to ensure reliability and efficiency.

Inference providers can significantly reduce the complexity of deploying and scaling LLMs, making it easier for businesses to leverage the power of these models without the need for extensive infrastructure and technical expertise.

Wrapping up

Navigating the landscape of large language models (LLMs) requires a careful balance of multiple factors such as performance, cost, speed, privacy, and specific capabilities. The evolution of LLMs from task-specific NLP models to versatile, general-purpose tools has revolutionized natural language processing and broadened the range of applications where these models can excel.

While benchmarks provide valuable insights into model performance, they should be considered alongside real-world application needs and constraints. It's crucial to experiment with different models and techniques, such as retrieval-augmented generation (RAG) and prompt engineering, to find the optimal balance for your specific use case.
Understanding the deployment options is also vital. Whether you choose to self-host or use serverless deployment, each approach comes with its own set of benefits and trade-offs. Inference providers can simplify the deployment process, offering scalable and cost-effective solutions that integrate seamlessly into your application infrastructure.

In conclusion, the right LLM for your application depends on a thorough evaluation of your requirements and constraints. By staying informed about the capabilities and limitations of different models, and by leveraging the right tools and techniques, you can harness the full potential of LLMs to drive innovation and efficiency in your projects. The landscape of LLMs is rapidly evolving, and staying adaptable and open to experimentation will be key to success in this dynamic field.