DEV Community

Firoj Alam
Firoj Alam

Posted on

Benchmarking LLMs Made Easy with LLMeBench

๐Ÿ”น Are you evaluating Large Language Models (LLMs) for your NLP tasks?
๐Ÿ”น Do you need a flexible, scalable framework that supports multiple providers?

Look no furtherโ€”LLMeBench is here!

What is LLMeBench?

LLMeBench is an open-source benchmarking framework designed to help researchers and developers evaluate LLMs across different tasks, providers, and languages.

LLMeBench

With LLMeBench 1.1.0, weโ€™ve added:

โœ… Expanded modality support (text, vision, multimodal tasks)
โœ… More evaluation metrics for precise comparisons
โœ… Improved dataset integration for smoother benchmarking

๐Ÿ”— GitHub Repo โ†’ github.com/qcri/LLMeBench

๐Ÿ’ก Why Benchmarking LLMs is Important

The rapid rise of GPT-4, BLOOMZ, Falcon, and LLaMA has created a need for systematic performance evaluation. LLMs behave differently across tasks, datasets, and languages, making standardized benchmarking essential for:

๐Ÿ“Œ Model Comparison โ†’ Which LLM performs best for a specific task?
๐Ÿ“Œ Cost & Latency Analysis โ†’ Is an LLM efficient for real-world deployment?
๐Ÿ“Œ Fairness & Bias Detection โ†’ Does the model exhibit language-specific biases?

LLMeBench addresses these challenges with a structured benchmarking approach that supports various model providers like:
๐ŸŸข OpenAI (GPT models)
๐ŸŸข Hugging Face Inference API
๐ŸŸข Azure AI models
๐ŸŸข Models deployed through VLLM

Getting Started with LLMeBench

  1. Install LLMeBench

    pip install 'llmebench[fewshot]'

  2. Download the current assets:


python -m llmebench assets download

This will fetch assets and place them in the current working directory.

  1. Download one of the dataset, e.g. ArSAS.


python -m llmebench data download ArSAS

This will download the data to the current working directory inside the data folder.

  1. Evaluate!

For example, to evaluate the performance of a random baseline for Sentiment analysis on ArSAS dataset, you can run:


python -m llmebench --filter 'sentiment/ArSAS_Random*' assets/ results/

which uses the ArSAS_random "asset": a file that specifies the dataset, model and task to evaluate. Here, ArSAS_Random is the asset name referring to the ArSAS dataset name and the Random model, and assets/ar/sentiment_emotion_others/sentiment/ is the directory where the benchmarking asset for the sentiment analysis task on Arabic ArSAS dataset can be found. Results will be saved in a directory called results.

  1. View the Results

LLMeBench generates a performance report with:

๐Ÿ“Š Accuracy
โณ Response time
๐Ÿ“ˆ Task-specific metrics

๐ŸŽฏ Why Use LLMeBench?

โœ” Works with any NLP model & dataset
โœ” Supports multiple providers (OpenAI, HF, Azure, Petals)
โœ” Handles multimodal & multilingual benchmarking
โœ” Saves time & effort in evaluation

โญ Join the Community & Contribute

Weโ€™re excited to see researchers & developers using LLMeBench for their benchmarking needs! ๐Ÿš€

๐Ÿ”— Try LLMeBench today: github.com/qcri/LLMeBench
โญ If you find it useful, give us a star on GitHub!

๐Ÿ’ฌ Have feedback or feature requests? Open an issue or PR -- weโ€™d love to hear from you!

๐Ÿ’ก Whatโ€™s Next?

Weโ€™re constantly improving LLMeBench with new features & optimizations. Stay tuned for:
โœ… More task-specific benchmarking modules
โœ… Fine-grained evaluation for multilingual models
โœ… Support for additional model providers

๐Ÿ”ฅ If youโ€™re working with LLMs and benchmarking, weโ€™d love to hear how LLMeBench can help your workflow! Drop a comment below or connect with us on GitHub! ๐Ÿš€โœจ

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

While many AI coding tools operate as simple command-response systems, Qodo Gen 1.0 represents the next generation: autonomous, multi-step problem-solving agents that work alongside you.

Read full post

Top comments (0)

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post