TL;DR: I built an open-source LLM evaluation framework that uses academic benchmarks (MMLU, TruthfulQA, HellaSwag) to provide reproducible performance comparisons. Published on PyPI as llm-benchmark-toolkit.
Why I Built This
When I started evaluating LLMs for production use, I needed a way to make confident decisions about which models to deploy. I wanted evaluation metrics that I could:
- Verify independently - Run the same tests and get the same results
- Compare fairly - Use consistent benchmarks across different models
- Share with confidence - Point my team to public datasets they could validate
I couldn't find a simple tool that did all of this, so I built one.
The Challenge
Choosing the right LLM for production is hard. You need to balance:
- Accuracy - Does it give correct answers?
- Performance - How fast does it run on our hardware?
- Size - Can we deploy it with our infrastructure constraints?
To make these decisions confidently, I needed metrics based on standardized tests that anyone could reproduce.
The Solution: Academic Benchmarks
I built llm-benchmark-toolkit around academic benchmarks - the same datasets cited in research papers:
MMLU (Massive Multitask Language Understanding)
- 14,042 questions across 57 subjects
- Tests general knowledge (history, science, math, etc.)
- Multiple choice format
TruthfulQA
- 817 questions testing factual accuracy
- Focuses on common misconceptions
- Measures how truthful answers are
HellaSwag
- 10,042 questions on commonsense reasoning
- Tests ability to predict what happens next
- Evaluates real-world understanding
Real-World Example
Here's what these benchmarks show for a lightweight model running on CPU:
Model: qwen2.5:0.5b (500M parameters)
MMLU: 35.2% (14,042 questions)
TruthfulQA: 42.1% (817 questions)
HellaSwag: 48.3% (10,042 questions)
Performance: 288 tokens/sec
Hardware: AMD Ryzen 9 5950X, 64GB RAM
These numbers tell a clear story:
- 35% MMLU is good for a 500M parameter model
- 288 tok/s is fast enough for real-time applications
- Results are reproducible - anyone can verify them
Comparing Models
The framework makes it easy to compare different models fairly:
qwen2.5:0.5b vs phi3.5:3.8b
| Metric | qwen2.5:0.5b | phi3.5:3.8b |
|---|---|---|
| MMLU | 35% | 58% |
| TruthfulQA | 42% | 61% |
| HellaSwag | 48% | 72% |
| Tokens/sec | 288 | 47 |
| RAM Usage | 1.2GB | 4.5GB |
The tradeoff is clear: The smaller model is 6x faster but 20-30% less accurate. This data helps you choose based on your specific requirements.
How It Works
1. Simple Installation
pip install llm-benchmark-toolkit
2. CLI Evaluation
# Evaluate a single model
llm-eval --model qwen2.5:0.5b --benchmarks mmlu,truthfulqa
# Compare multiple models
llm-eval --model qwen2.5:0.5b --model phi3.5:3.8b --benchmarks all
3. Python API
from llm_evaluator import LLMEvaluator
# Initialize evaluator
evaluator = LLMEvaluator(
provider="ollama",
model="qwen2.5:0.5b"
)
# Run benchmarks
results = evaluator.evaluate([
"mmlu",
"truthfulqa",
"hellaswag"
])
# Generate dashboard
evaluator.create_dashboard("results.html")
Architecture
Provider Abstraction
The framework supports multiple LLM providers through a unified interface:
# Works with any provider
evaluator = LLMEvaluator(
provider="ollama", # or "openai", "anthropic", "huggingface"
model="qwen2.5:0.5b"
)
Caching System
Benchmark runs are cached to avoid redundant API calls:
from llm_evaluator.providers import CachedProvider
cached = CachedProvider(
base_provider=ollama_provider,
cache_dir=".eval_cache"
)
Visualization Dashboard
The framework generates interactive HTML dashboards with:
- Benchmark scores
- Performance metrics
- System information
- Comparison charts
What I Learned
1. Context Matters
Raw scores need context to be meaningful:
- 35% MMLU sounds low in isolation
- But for a 500M parameter model on CPU, it's actually impressive
- GPT-4 scores ~86% (with 175x more parameters)
- Random guessing = 25% on multiple choice
Always include model size and hardware specs with your results.
2. Reproducibility Builds Trust
Using public datasets means:
- Anyone can verify your claims
- Results can be compared across papers/projects
- Teams can validate findings independently
This transparency is crucial for production decisions.
3. Performance Varies by Hardware
The same model performs differently on different hardware:
# Example: qwen2.5:0.5b performance
CPU (Ryzen 9): 288 tok/s
GPU (RTX 3090): 450 tok/s
MacBook M2: 320 tok/s
Always include hardware specs in your benchmarks.
4. Standardized Tests Enable Fair Comparison
With academic benchmarks, you can:
- Compare your results to published papers
- Evaluate new models against established baselines
- Make data-driven deployment decisions
Tech Stack
The framework is built with production-grade practices:
Core Technologies
- Python 3.11+ with strict mypy typing
- HuggingFace datasets for benchmark data
- Plotly + Matplotlib for visualizations
- Click for CLI interface
- Pydantic for configuration
Quality Standards
- 58 passing tests with 89% coverage
- Strict typing enforced by mypy
- CI/CD pipeline with GitHub Actions
- Code quality validated by ruff + black
# Run tests
pytest tests/ -v --cov=src
# Type checking
mypy src/ --strict
# Linting
ruff check src/
black src/ --check
Installation & Usage
Quick Start
# Install
pip install llm-benchmark-toolkit
# Run evaluation
llm-eval --model qwen2.5:0.5b --benchmarks mmlu
# Get help
llm-eval --help
Python API Example
from llm_evaluator import LLMEvaluator
# Create evaluator
evaluator = LLMEvaluator(
provider="ollama",
model="phi3.5:3.8b"
)
# Run benchmarks
results = evaluator.evaluate(["mmlu", "hellaswag"])
# Print results
for benchmark, score in results.items():
print(f"{benchmark}: {score['accuracy']:.1f}%")
# Generate dashboard
evaluator.create_dashboard("evaluation.html")
Future Plans
I'm planning to add:
More Benchmarks
- GSM8K - Math reasoning (8,500 questions)
- HumanEval - Code generation (164 problems)
- BBH - Big-Bench Hard (challenging reasoning)
Enhanced Features
- Multi-GPU support for distributed evaluation
- Cost tracking for API-based models
- Live monitoring dashboard
- Automated model comparison reports
Community Contributions
- Custom benchmark support
- Additional provider integrations
- Performance optimizations
Contributing
This is an open-source project and contributions are welcome!
Ways to contribute:
- Report bugs or suggest features (GitHub issues)
- Add new benchmarks or providers (Pull requests)
- Improve documentation
- Share your evaluation results
Check out the contributing guide to get started.
Resources
Links
- GitHub: https://github.com/NahuelGiudizi/llm-evaluation
- PyPI: https://pypi.org/project/llm-benchmark-toolkit/
- Documentation: Coming soon
Installation
pip install llm-benchmark-toolkit
Related Project
I also built ai-safety-tester - a security testing framework for LLMs:
- Prompt injection detection
- Bias analysis
- CVE-style vulnerability scoring
pip install ai-safety-tester
What's Your Experience?
I'd love to hear from others working on LLM evaluation:
- What benchmarks do you use?
- How do you make production deployment decisions?
- What evaluation challenges have you faced?
Drop a comment or reach out - I'm always interested in learning from the community.
Conclusion
Building this framework taught me that reproducibility is more valuable than impressive-looking scores.
Using standardized academic benchmarks provides:
- Confidence in model selection
- Fair comparisons across models
- Reproducible results anyone can verify
If you're evaluating LLMs and need reproducible metrics, give llm-benchmark-toolkit a try. Feedback and contributions are always welcome!
Questions? Open an issue on GitHub or connect with me on LinkedIn.
Want to contribute? Check out the contributing guide.
Building open-source tools for transparent and reproducible LLM evaluation.
PyPI: https://pypi.org/project/llm-benchmark-toolkit/
Install:
pip install llm-benchmark-toolkit
llm-eval --help
Related Projects
I also built ai-safety-tester - a security testing framework for LLMs:
- Prompt injection detection
- Bias analysis
- CVE-style vulnerability scoring
PyPI: pip install ai-safety-tester
Conclusion
Building an evaluation framework taught me that reproducibility scales better than hype.
Standardized benchmarks show real performance. Users can make informed decisions. Developers can optimize based on facts.
If you're evaluating LLMs and need reproducible results, try academic benchmarks. They provide a common language for comparing models and making production decisions.
Questions or feedback? I'd love to hear your experience with LLM evaluation.
Questions? Open an issue on GitHub or reach out on Twitter.
Want to contribute? PRs welcome! Check out the contributing guide.
Building tools that prioritize transparency and reproducibility.
Top comments (0)