Recently, I've been working on integrating RAG (Bedrock Knowledge Base) into a chatbot. However, I wasn't sure how to choose the right LLM model.
As I started researching, I realized that there are countless leaderboards available, which left me feeling lost. To organize my understanding, I compiled this summary. I hope other engineers find it useful.
1. Leaderboards for Open-Source Models
Open LLM Leaderboard
This is the most well-known leaderboard for comparing open-source models. It allows filtering based on criteria such as whether a model is provided by an official provider or optimized for edge devices, making it easy to search.
Key benchmarks include:
- IFEval: Evaluates instruction-following capabilities
- BBH (Big Bench Hard): A challenging benchmark for future capabilities
- MATH: Measures problem-solving skills in mathematics
- GPQA: Google Proof Q&A, tests logical reasoning
- MUSR: Reasoning test using detective novels
- MMLU-PRO: Assesses language understanding ability
Big Code Models Leaderboard
This leaderboard focuses on evaluating models for code generation. It provides performance scores for different programming languages like Python, Java, JavaScript, and C++, making it easier to select a model that excels in your target language. The Win Rate metric gives an overall assessment, which is useful if you're building a coding assistant.
LLM-Perf Leaderboard
In real-world applications, factors beyond accuracy, such as speed and memory usage, are critical. This leaderboard prioritizes these practical considerations.
The Find Your Best Model tab is particularly useful. You can select your GPU specs (e.g., A10-24GB-150W in the screenshot) and immediately find the most optimal model.
The chart is very intuitive—models in the upper-left corner strike the best balance between speed and accuracy. Smaller bubbles indicate lower memory usage. If you plan to deploy on an AWS g5.xlarge
instance (24GB VRAM), you can quickly identify the best-suited models here.
2. Domain-Specific Leaderboards
Hugging Face also provides leaderboards tailored to specific domains. If your project belongs to a specialized field, these are worth checking out.
Open Medical-LLM Leaderboard
A leaderboard dedicated to medical AI models, evaluating capabilities in clinical knowledge and medical genetics. Essential for healthcare-related applications.
Language-Specific Leaderboards
If your project requires working with languages like Japanese, these leaderboards can be particularly helpful.
3. Comparing Open-Source and Closed-Source Models
Vellum Leaderboard
This leaderboard compares both closed-source models (e.g., GPT, Claude) and open-source models (e.g., Llama) side by side.
It provides detailed insights, including model generation speed, latency, context window size, and per-token pricing. Additionally, its comparison feature allows you to directly evaluate new models against previous generations. If you're deciding between using an API-based service or hosting a model yourself, this leaderboard's cost information is invaluable.
SEAL Leaderboard
This leaderboard evaluates models based on specific practical skills rather than general benchmarks.
For example, the MultiChallenge benchmark assesses capabilities such as conversational consistency, clarity of explanations, and information synthesis—key factors for chatbot applications.
LMS Chatbot Arena
Unlike other leaderboards that rely on mechanical benchmarks, this one evaluates models based on human judgment. Since it incorporates user preferences, it can help predict real-world user experience more accurately.
Additional Thoughts
Even after evaluating models through various leaderboards, real-world constraints often dictate final choices.
In my case, I needed to build a Bedrock Knowledge Base in the Tokyo region, where only Claude 3.5 Sonnet and Claude 3 Haiku were available (and surprisingly, 3.5 Haiku wasn't). Additionally, the Tokyo region doesn't support custom model imports, which limited my options.
Ultimately, I decided to proceed with Claude 3.5 Sonnet. It performed well, especially after optimizations such as query decomposition, FMP, and chunking**, which significantly improved response accuracy.
However, query decomposition increased latency (response times exceeded 10 seconds), so I'm currently considering mitigation strategies. Possible solutions include:
- Using Bedrock in the Virginia region, accepting cross-region communication overhead but gaining access to Claude 3.5 Haiku.
- Importing an open-source model like Llama as a custom model.
Choosing the right LLM isn't just about benchmarks—deployment constraints matter just as much. Hopefully, this summary helps others navigate their own model selection process!
Top comments (0)