Nahuel Giudizi

Posted on Nov 30

Building a Production-Grade LLM Evaluation Framework: From Demo Datasets to Academic Benchmarks

#llm #machinelearning #benchmarking #opensource

TL;DR: I built an open-source LLM evaluation framework that uses academic benchmarks (MMLU, TruthfulQA, HellaSwag) to provide reproducible performance comparisons. Published on PyPI as llm-benchmark-toolkit.

Why I Built This

When I started evaluating LLMs for production use, I needed a way to make confident decisions about which models to deploy. I wanted evaluation metrics that I could:

Verify independently - Run the same tests and get the same results
Compare fairly - Use consistent benchmarks across different models
Share with confidence - Point my team to public datasets they could validate

I couldn't find a simple tool that did all of this, so I built one.

The Challenge

Choosing the right LLM for production is hard. You need to balance:

Accuracy - Does it give correct answers?
Performance - How fast does it run on our hardware?
Size - Can we deploy it with our infrastructure constraints?

To make these decisions confidently, I needed metrics based on standardized tests that anyone could reproduce.

The Solution: Academic Benchmarks

I built llm-benchmark-toolkit around academic benchmarks - the same datasets cited in research papers:

MMLU (Massive Multitask Language Understanding)

14,042 questions across 57 subjects
Tests general knowledge (history, science, math, etc.)
Multiple choice format

TruthfulQA

817 questions testing factual accuracy
Focuses on common misconceptions
Measures how truthful answers are

HellaSwag

10,042 questions on commonsense reasoning
Tests ability to predict what happens next
Evaluates real-world understanding

Real-World Example

Here's what these benchmarks show for a lightweight model running on CPU:

Model: qwen2.5:0.5b (500M parameters)

MMLU:       35.2%  (14,042 questions)
TruthfulQA: 42.1%  (817 questions)
HellaSwag:  48.3%  (10,042 questions)
Performance: 288 tokens/sec
Hardware:   AMD Ryzen 9 5950X, 64GB RAM

These numbers tell a clear story:

35% MMLU is good for a 500M parameter model
288 tok/s is fast enough for real-time applications
Results are reproducible - anyone can verify them

Comparing Models

The framework makes it easy to compare different models fairly:

qwen2.5:0.5b vs phi3.5:3.8b

Metric	qwen2.5:0.5b	phi3.5:3.8b
MMLU	35%	58%
TruthfulQA	42%	61%
HellaSwag	48%	72%
Tokens/sec	288	47
RAM Usage	1.2GB	4.5GB

The tradeoff is clear: The smaller model is 6x faster but 20-30% less accurate. This data helps you choose based on your specific requirements.

How It Works

1. Simple Installation

pip install llm-benchmark-toolkit

2. CLI Evaluation

# Evaluate a single model
llm-eval --model qwen2.5:0.5b --benchmarks mmlu,truthfulqa

# Compare multiple models
llm-eval --model qwen2.5:0.5b --model phi3.5:3.8b --benchmarks all

3. Python API

from llm_evaluator import LLMEvaluator

# Initialize evaluator
evaluator = LLMEvaluator(
    provider="ollama",
    model="qwen2.5:0.5b"
)

# Run benchmarks
results = evaluator.evaluate([
    "mmlu",
    "truthfulqa",
    "hellaswag"
])

# Generate dashboard
evaluator.create_dashboard("results.html")

Architecture

Provider Abstraction

The framework supports multiple LLM providers through a unified interface:

# Works with any provider
evaluator = LLMEvaluator(
    provider="ollama",  # or "openai", "anthropic", "huggingface"
    model="qwen2.5:0.5b"
)

Caching System

Benchmark runs are cached to avoid redundant API calls:

from llm_evaluator.providers import CachedProvider

cached = CachedProvider(
    base_provider=ollama_provider,
    cache_dir=".eval_cache"
)

Visualization Dashboard

The framework generates interactive HTML dashboards with:

Benchmark scores
Performance metrics
System information
Comparison charts

What I Learned

1. Context Matters

Raw scores need context to be meaningful:

35% MMLU sounds low in isolation
But for a 500M parameter model on CPU, it's actually impressive
GPT-4 scores ~86% (with 175x more parameters)
Random guessing = 25% on multiple choice

Always include model size and hardware specs with your results.

2. Reproducibility Builds Trust

Using public datasets means:

Anyone can verify your claims
Results can be compared across papers/projects
Teams can validate findings independently

This transparency is crucial for production decisions.

3. Performance Varies by Hardware

The same model performs differently on different hardware:

# Example: qwen2.5:0.5b performance
CPU (Ryzen 9):     288 tok/s
GPU (RTX 3090):    450 tok/s
MacBook M2:        320 tok/s

Always include hardware specs in your benchmarks.

4. Standardized Tests Enable Fair Comparison

With academic benchmarks, you can:

Compare your results to published papers
Evaluate new models against established baselines
Make data-driven deployment decisions

Tech Stack

The framework is built with production-grade practices:

Core Technologies

Python 3.11+ with strict mypy typing
HuggingFace datasets for benchmark data
Plotly + Matplotlib for visualizations
Click for CLI interface
Pydantic for configuration

Quality Standards

58 passing tests with 89% coverage
Strict typing enforced by mypy
CI/CD pipeline with GitHub Actions
Code quality validated by ruff + black

# Run tests
pytest tests/ -v --cov=src

# Type checking
mypy src/ --strict

# Linting
ruff check src/
black src/ --check

Installation & Usage

Quick Start

# Install
pip install llm-benchmark-toolkit

# Run evaluation
llm-eval --model qwen2.5:0.5b --benchmarks mmlu

# Get help
llm-eval --help

Python API Example

from llm_evaluator import LLMEvaluator

# Create evaluator
evaluator = LLMEvaluator(
    provider="ollama",
    model="phi3.5:3.8b"
)

# Run benchmarks
results = evaluator.evaluate(["mmlu", "hellaswag"])

# Print results
for benchmark, score in results.items():
    print(f"{benchmark}: {score['accuracy']:.1f}%")

# Generate dashboard
evaluator.create_dashboard("evaluation.html")

Future Plans

I'm planning to add:

More Benchmarks

GSM8K - Math reasoning (8,500 questions)
HumanEval - Code generation (164 problems)
BBH - Big-Bench Hard (challenging reasoning)

Enhanced Features

Multi-GPU support for distributed evaluation
Cost tracking for API-based models
Live monitoring dashboard
Automated model comparison reports

Community Contributions

Custom benchmark support
Additional provider integrations
Performance optimizations

Contributing

This is an open-source project and contributions are welcome!

Ways to contribute:

Report bugs or suggest features (GitHub issues)
Add new benchmarks or providers (Pull requests)
Improve documentation
Share your evaluation results

Check out the contributing guide to get started.

Resources

Installation

pip install llm-benchmark-toolkit

Related Project

I also built ai-safety-tester - a security testing framework for LLMs:

Prompt injection detection
Bias analysis
CVE-style vulnerability scoring

pip install ai-safety-tester

What's Your Experience?

I'd love to hear from others working on LLM evaluation:

What benchmarks do you use?
How do you make production deployment decisions?
What evaluation challenges have you faced?

Drop a comment or reach out - I'm always interested in learning from the community.

Conclusion

Building this framework taught me that reproducibility is more valuable than impressive-looking scores.

Using standardized academic benchmarks provides:

Confidence in model selection
Fair comparisons across models
Reproducible results anyone can verify

If you're evaluating LLMs and need reproducible metrics, give llm-benchmark-toolkit a try. Feedback and contributions are always welcome!

Questions? Open an issue on GitHub or connect with me on LinkedIn.

Want to contribute? Check out the contributing guide.

Building open-source tools for transparent and reproducible LLM evaluation.

PyPI: https://pypi.org/project/llm-benchmark-toolkit/

Install:

pip install llm-benchmark-toolkit
llm-eval --help

Related Projects

I also built ai-safety-tester - a security testing framework for LLMs:

Prompt injection detection
Bias analysis
CVE-style vulnerability scoring

PyPI: pip install ai-safety-tester

Conclusion

Building an evaluation framework taught me that reproducibility scales better than hype.

Standardized benchmarks show real performance. Users can make informed decisions. Developers can optimize based on facts.

If you're evaluating LLMs and need reproducible results, try academic benchmarks. They provide a common language for comparing models and making production decisions.

Questions or feedback? I'd love to hear your experience with LLM evaluation.

Questions? Open an issue on GitHub or reach out on Twitter.

Want to contribute? PRs welcome! Check out the contributing guide.

Building tools that prioritize transparency and reproducibility.

Why I Built This

The Challenge

The Solution: Academic Benchmarks

MMLU (Massive Multitask Language Understanding)

TruthfulQA

HellaSwag

Real-World Example

Comparing Models

How It Works

1. Simple Installation

2. CLI Evaluation

3. Python API

Architecture

Provider Abstraction

Caching System

Visualization Dashboard

What I Learned

1. Context Matters

2. Reproducibility Builds Trust

3. Performance Varies by Hardware

4. Standardized Tests Enable Fair Comparison

Tech Stack

Core Technologies

Quality Standards

Installation & Usage

Quick Start

Python API Example

Future Plans

More Benchmarks

Enhanced Features

Community Contributions

Contributing

Resources

Links

Installation

Related Project

What's Your Experience?

Conclusion

Related Projects

Conclusion