๐น Are you evaluating Large Language Models (LLMs) for your NLP tasks?
๐น Do you need a flexible, scalable framework that supports multiple providers?
Look no furtherโLLMeBench is here!
What is LLMeBench?
LLMeBench is an open-source benchmarking framework designed to help researchers and developers evaluate LLMs across different tasks, providers, and languages.
With LLMeBench 1.1.0, weโve added:
โ
Expanded modality support (text, vision, multimodal tasks)
โ
More evaluation metrics for precise comparisons
โ
Improved dataset integration for smoother benchmarking
๐ GitHub Repo โ github.com/qcri/LLMeBench
๐ก Why Benchmarking LLMs is Important
The rapid rise of GPT-4, BLOOMZ, Falcon, and LLaMA has created a need for systematic performance evaluation. LLMs behave differently across tasks, datasets, and languages, making standardized benchmarking essential for:
๐ Model Comparison โ Which LLM performs best for a specific task?
๐ Cost & Latency Analysis โ Is an LLM efficient for real-world deployment?
๐ Fairness & Bias Detection โ Does the model exhibit language-specific biases?
LLMeBench addresses these challenges with a structured benchmarking approach that supports various model providers like:
๐ข OpenAI (GPT models)
๐ข Hugging Face Inference API
๐ข Azure AI models
๐ข Models deployed through VLLM
Getting Started with LLMeBench
Install LLMeBench
pip install 'llmebench[fewshot]'
Download the current assets:
python -m llmebench assets download
This will fetch assets and place them in the current working directory.
- Download one of the dataset, e.g. ArSAS.
python -m llmebench data download ArSAS
This will download the data to the current working directory inside the data folder.
- Evaluate!
For example, to evaluate the performance of a random baseline for Sentiment analysis on ArSAS dataset, you can run:
python -m llmebench --filter 'sentiment/ArSAS_Random*' assets/ results/
which uses the ArSAS_random "asset": a file that specifies the dataset, model and task to evaluate. Here, ArSAS_Random is the asset name referring to the ArSAS dataset name and the Random model, and assets/ar/sentiment_emotion_others/sentiment/ is the directory where the benchmarking asset for the sentiment analysis task on Arabic ArSAS dataset can be found. Results will be saved in a directory called results.
- View the Results
LLMeBench generates a performance report with:
๐ Accuracy
โณ Response time
๐ Task-specific metrics
๐ฏ Why Use LLMeBench?
โ Works with any NLP model & dataset
โ Supports multiple providers (OpenAI, HF, Azure, Petals)
โ Handles multimodal & multilingual benchmarking
โ Saves time & effort in evaluation
โญ Join the Community & Contribute
Weโre excited to see researchers & developers using LLMeBench for their benchmarking needs! ๐
๐ Try LLMeBench today: github.com/qcri/LLMeBench
โญ If you find it useful, give us a star on GitHub!
๐ฌ Have feedback or feature requests? Open an issue or PR -- weโd love to hear from you!
๐ก Whatโs Next?
Weโre constantly improving LLMeBench with new features & optimizations. Stay tuned for:
โ
More task-specific benchmarking modules
โ
Fine-grained evaluation for multilingual models
โ
Support for additional model providers
๐ฅ If youโre working with LLMs and benchmarking, weโd love to hear how LLMeBench can help your workflow! Drop a comment below or connect with us on GitHub! ๐โจ
Top comments (0)