DEV Community

Atsushi Suzuki
Atsushi Suzuki

Posted on

2

LLM Model Selection Made Easy: The Most Useful Leaderboards for Real-World Applications

Recently, I've been working on integrating RAG (Bedrock Knowledge Base) into a chatbot. However, I wasn't sure how to choose the right LLM model.

As I started researching, I realized that there are countless leaderboards available, which left me feeling lost. To organize my understanding, I compiled this summary. I hope other engineers find it useful.

1. Leaderboards for Open-Source Models

Open LLM Leaderboard

This is the most well-known leaderboard for comparing open-source models. It allows filtering based on criteria such as whether a model is provided by an official provider or optimized for edge devices, making it easy to search.

Open LLM Leaderboard

Key benchmarks include:

  • IFEval: Evaluates instruction-following capabilities
  • BBH (Big Bench Hard): A challenging benchmark for future capabilities
  • MATH: Measures problem-solving skills in mathematics
  • GPQA: Google Proof Q&A, tests logical reasoning
  • MUSR: Reasoning test using detective novels
  • MMLU-PRO: Assesses language understanding ability

Screenshot

Big Code Models Leaderboard

This leaderboard focuses on evaluating models for code generation. It provides performance scores for different programming languages like Python, Java, JavaScript, and C++, making it easier to select a model that excels in your target language. The Win Rate metric gives an overall assessment, which is useful if you're building a coding assistant.

Big Code Models Leaderboard

Screenshot

LLM-Perf Leaderboard

In real-world applications, factors beyond accuracy, such as speed and memory usage, are critical. This leaderboard prioritizes these practical considerations.

LLM-Perf Leaderboard

The Find Your Best Model tab is particularly useful. You can select your GPU specs (e.g., A10-24GB-150W in the screenshot) and immediately find the most optimal model.

Screenshot

The chart is very intuitive—models in the upper-left corner strike the best balance between speed and accuracy. Smaller bubbles indicate lower memory usage. If you plan to deploy on an AWS g5.xlarge instance (24GB VRAM), you can quickly identify the best-suited models here.

2. Domain-Specific Leaderboards

Hugging Face also provides leaderboards tailored to specific domains. If your project belongs to a specialized field, these are worth checking out.

Open Medical-LLM Leaderboard

A leaderboard dedicated to medical AI models, evaluating capabilities in clinical knowledge and medical genetics. Essential for healthcare-related applications.

Open Medical-LLM Leaderboard

Language-Specific Leaderboards

If your project requires working with languages like Japanese, these leaderboards can be particularly helpful.

Open Japanese-LLM Leaderboard

3. Comparing Open-Source and Closed-Source Models

Vellum Leaderboard

This leaderboard compares both closed-source models (e.g., GPT, Claude) and open-source models (e.g., Llama) side by side.

Vellum Leaderboard

It provides detailed insights, including model generation speed, latency, context window size, and per-token pricing. Additionally, its comparison feature allows you to directly evaluate new models against previous generations. If you're deciding between using an API-based service or hosting a model yourself, this leaderboard's cost information is invaluable.

Screenshot

SEAL Leaderboard

This leaderboard evaluates models based on specific practical skills rather than general benchmarks.

SEAL Leaderboard

For example, the MultiChallenge benchmark assesses capabilities such as conversational consistency, clarity of explanations, and information synthesis—key factors for chatbot applications.

Screenshot

LMS Chatbot Arena

Unlike other leaderboards that rely on mechanical benchmarks, this one evaluates models based on human judgment. Since it incorporates user preferences, it can help predict real-world user experience more accurately.

LMS Chatbot Arena

Screenshot

Additional Thoughts

Even after evaluating models through various leaderboards, real-world constraints often dictate final choices.

In my case, I needed to build a Bedrock Knowledge Base in the Tokyo region, where only Claude 3.5 Sonnet and Claude 3 Haiku were available (and surprisingly, 3.5 Haiku wasn't). Additionally, the Tokyo region doesn't support custom model imports, which limited my options.

Ultimately, I decided to proceed with Claude 3.5 Sonnet. It performed well, especially after optimizations such as query decomposition, FMP, and chunking**, which significantly improved response accuracy.

However, query decomposition increased latency (response times exceeded 10 seconds), so I'm currently considering mitigation strategies. Possible solutions include:

  • Using Bedrock in the Virginia region, accepting cross-region communication overhead but gaining access to Claude 3.5 Haiku.
  • Importing an open-source model like Llama as a custom model.

Choosing the right LLM isn't just about benchmarks—deployment constraints matter just as much. Hopefully, this summary helps others navigate their own model selection process!

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry 👀

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay