Atsushi Suzuki

Posted on Mar 15

LLM Model Selection Made Easy: The Most Useful Leaderboards for Real-World Applications

#machinelearning #ai #llm #beginners

Recently, I've been working on integrating RAG (Bedrock Knowledge Base) into a chatbot. However, I wasn't sure how to choose the right LLM model.

As I started researching, I realized that there are countless leaderboards available, which left me feeling lost. To organize my understanding, I compiled this summary. I hope other engineers find it useful.

1. Leaderboards for Open-Source Models

Open LLM Leaderboard

This is the most well-known leaderboard for comparing open-source models. It allows filtering based on criteria such as whether a model is provided by an official provider or optimized for edge devices, making it easy to search.

Open LLM Leaderboard

Key benchmarks include:

IFEval: Evaluates instruction-following capabilities
BBH (Big Bench Hard): A challenging benchmark for future capabilities
MATH: Measures problem-solving skills in mathematics
GPQA: Google Proof Q&A, tests logical reasoning
MUSR: Reasoning test using detective novels
MMLU-PRO: Assesses language understanding ability

Big Code Models Leaderboard

This leaderboard focuses on evaluating models for code generation. It provides performance scores for different programming languages like Python, Java, JavaScript, and C++, making it easier to select a model that excels in your target language. The Win Rate metric gives an overall assessment, which is useful if you're building a coding assistant.

Big Code Models Leaderboard

LLM-Perf Leaderboard

In real-world applications, factors beyond accuracy, such as speed and memory usage, are critical. This leaderboard prioritizes these practical considerations.

LLM-Perf Leaderboard

The Find Your Best Model tab is particularly useful. You can select your GPU specs (e.g., A10-24GB-150W in the screenshot) and immediately find the most optimal model.

The chart is very intuitive—models in the upper-left corner strike the best balance between speed and accuracy. Smaller bubbles indicate lower memory usage. If you plan to deploy on an AWS g5.xlarge instance (24GB VRAM), you can quickly identify the best-suited models here.

2. Domain-Specific Leaderboards

Hugging Face also provides leaderboards tailored to specific domains. If your project belongs to a specialized field, these are worth checking out.

Open Medical-LLM Leaderboard

A leaderboard dedicated to medical AI models, evaluating capabilities in clinical knowledge and medical genetics. Essential for healthcare-related applications.

Open Medical-LLM Leaderboard

Language-Specific Leaderboards

If your project requires working with languages like Japanese, these leaderboards can be particularly helpful.

Open Japanese-LLM Leaderboard

3. Comparing Open-Source and Closed-Source Models

Vellum Leaderboard

This leaderboard compares both closed-source models (e.g., GPT, Claude) and open-source models (e.g., Llama) side by side.

Vellum Leaderboard

It provides detailed insights, including model generation speed, latency, context window size, and per-token pricing. Additionally, its comparison feature allows you to directly evaluate new models against previous generations. If you're deciding between using an API-based service or hosting a model yourself, this leaderboard's cost information is invaluable.

SEAL Leaderboard

This leaderboard evaluates models based on specific practical skills rather than general benchmarks.

SEAL Leaderboard

For example, the MultiChallenge benchmark assesses capabilities such as conversational consistency, clarity of explanations, and information synthesis—key factors for chatbot applications.

LMS Chatbot Arena

Unlike other leaderboards that rely on mechanical benchmarks, this one evaluates models based on human judgment. Since it incorporates user preferences, it can help predict real-world user experience more accurately.

LMS Chatbot Arena

Additional Thoughts

Even after evaluating models through various leaderboards, real-world constraints often dictate final choices.

In my case, I needed to build a Bedrock Knowledge Base in the Tokyo region, where only Claude 3.5 Sonnet and Claude 3 Haiku were available (and surprisingly, 3.5 Haiku wasn't). Additionally, the Tokyo region doesn't support custom model imports, which limited my options.

Ultimately, I decided to proceed with Claude 3.5 Sonnet. It performed well, especially after optimizations such as query decomposition, FMP, and chunking**, which significantly improved response accuracy.

However, query decomposition increased latency (response times exceeded 10 seconds), so I'm currently considering mitigation strategies. Possible solutions include:

Using Bedrock in the Virginia region, accepting cross-region communication overhead but gaining access to Claude 3.5 Haiku.
Importing an open-source model like Llama as a custom model.

Choosing the right LLM isn't just about benchmarks—deployment constraints matter just as much. Hopefully, this summary helps others navigate their own model selection process!

DEV Community

LLM Model Selection Made Easy: The Most Useful Leaderboards for Real-World Applications

1. Leaderboards for Open-Source Models

Open LLM Leaderboard

Big Code Models Leaderboard

LLM-Perf Leaderboard

2. Domain-Specific Leaderboards

Open Medical-LLM Leaderboard

Language-Specific Leaderboards

3. Comparing Open-Source and Closed-Source Models

Vellum Leaderboard

SEAL Leaderboard

LMS Chatbot Arena

Additional Thoughts

Top comments (0)