DEV Community

Cover image for Building a Production-Grade LLM Evaluation Framework: From Demo Datasets to Academic Benchmarks
Nahuel Giudizi
Nahuel Giudizi

Posted on

Building a Production-Grade LLM Evaluation Framework: From Demo Datasets to Academic Benchmarks

TL;DR: I built an open-source LLM evaluation framework that uses academic benchmarks (MMLU, TruthfulQA, HellaSwag) to provide reproducible performance comparisons. Published on PyPI as llm-benchmark-toolkit.


Why I Built This

When I started evaluating LLMs for production use, I needed a way to make confident decisions about which models to deploy. I wanted evaluation metrics that I could:

  • Verify independently - Run the same tests and get the same results
  • Compare fairly - Use consistent benchmarks across different models
  • Share with confidence - Point my team to public datasets they could validate

I couldn't find a simple tool that did all of this, so I built one.


The Challenge

Choosing the right LLM for production is hard. You need to balance:

  • Accuracy - Does it give correct answers?
  • Performance - How fast does it run on our hardware?
  • Size - Can we deploy it with our infrastructure constraints?

To make these decisions confidently, I needed metrics based on standardized tests that anyone could reproduce.


The Solution: Academic Benchmarks

I built llm-benchmark-toolkit around academic benchmarks - the same datasets cited in research papers:

MMLU (Massive Multitask Language Understanding)

  • 14,042 questions across 57 subjects
  • Tests general knowledge (history, science, math, etc.)
  • Multiple choice format

TruthfulQA

  • 817 questions testing factual accuracy
  • Focuses on common misconceptions
  • Measures how truthful answers are

HellaSwag

  • 10,042 questions on commonsense reasoning
  • Tests ability to predict what happens next
  • Evaluates real-world understanding

Real-World Example

Here's what these benchmarks show for a lightweight model running on CPU:

Model: qwen2.5:0.5b (500M parameters)

MMLU:       35.2%  (14,042 questions)
TruthfulQA: 42.1%  (817 questions)
HellaSwag:  48.3%  (10,042 questions)
Performance: 288 tokens/sec
Hardware:   AMD Ryzen 9 5950X, 64GB RAM
Enter fullscreen mode Exit fullscreen mode

These numbers tell a clear story:

  • 35% MMLU is good for a 500M parameter model
  • 288 tok/s is fast enough for real-time applications
  • Results are reproducible - anyone can verify them

Comparing Models

The framework makes it easy to compare different models fairly:

qwen2.5:0.5b vs phi3.5:3.8b

Metric qwen2.5:0.5b phi3.5:3.8b
MMLU 35% 58%
TruthfulQA 42% 61%
HellaSwag 48% 72%
Tokens/sec 288 47
RAM Usage 1.2GB 4.5GB

The tradeoff is clear: The smaller model is 6x faster but 20-30% less accurate. This data helps you choose based on your specific requirements.


How It Works

1. Simple Installation

pip install llm-benchmark-toolkit
Enter fullscreen mode Exit fullscreen mode

2. CLI Evaluation

# Evaluate a single model
llm-eval --model qwen2.5:0.5b --benchmarks mmlu,truthfulqa

# Compare multiple models
llm-eval --model qwen2.5:0.5b --model phi3.5:3.8b --benchmarks all
Enter fullscreen mode Exit fullscreen mode

3. Python API

from llm_evaluator import LLMEvaluator

# Initialize evaluator
evaluator = LLMEvaluator(
    provider="ollama",
    model="qwen2.5:0.5b"
)

# Run benchmarks
results = evaluator.evaluate([
    "mmlu",
    "truthfulqa",
    "hellaswag"
])

# Generate dashboard
evaluator.create_dashboard("results.html")
Enter fullscreen mode Exit fullscreen mode

Architecture

Provider Abstraction

The framework supports multiple LLM providers through a unified interface:

# Works with any provider
evaluator = LLMEvaluator(
    provider="ollama",  # or "openai", "anthropic", "huggingface"
    model="qwen2.5:0.5b"
)
Enter fullscreen mode Exit fullscreen mode

Caching System

Benchmark runs are cached to avoid redundant API calls:

from llm_evaluator.providers import CachedProvider

cached = CachedProvider(
    base_provider=ollama_provider,
    cache_dir=".eval_cache"
)
Enter fullscreen mode Exit fullscreen mode

Visualization Dashboard

The framework generates interactive HTML dashboards with:

  • Benchmark scores
  • Performance metrics
  • System information
  • Comparison charts

What I Learned

1. Context Matters

Raw scores need context to be meaningful:

  • 35% MMLU sounds low in isolation
  • But for a 500M parameter model on CPU, it's actually impressive
  • GPT-4 scores ~86% (with 175x more parameters)
  • Random guessing = 25% on multiple choice

Always include model size and hardware specs with your results.

2. Reproducibility Builds Trust

Using public datasets means:

  • Anyone can verify your claims
  • Results can be compared across papers/projects
  • Teams can validate findings independently

This transparency is crucial for production decisions.

3. Performance Varies by Hardware

The same model performs differently on different hardware:

# Example: qwen2.5:0.5b performance
CPU (Ryzen 9):     288 tok/s
GPU (RTX 3090):    450 tok/s
MacBook M2:        320 tok/s
Enter fullscreen mode Exit fullscreen mode

Always include hardware specs in your benchmarks.

4. Standardized Tests Enable Fair Comparison

With academic benchmarks, you can:

  • Compare your results to published papers
  • Evaluate new models against established baselines
  • Make data-driven deployment decisions

Tech Stack

The framework is built with production-grade practices:

Core Technologies

  • Python 3.11+ with strict mypy typing
  • HuggingFace datasets for benchmark data
  • Plotly + Matplotlib for visualizations
  • Click for CLI interface
  • Pydantic for configuration

Quality Standards

  • 58 passing tests with 89% coverage
  • Strict typing enforced by mypy
  • CI/CD pipeline with GitHub Actions
  • Code quality validated by ruff + black
# Run tests
pytest tests/ -v --cov=src

# Type checking
mypy src/ --strict

# Linting
ruff check src/
black src/ --check
Enter fullscreen mode Exit fullscreen mode

Installation & Usage

Quick Start

# Install
pip install llm-benchmark-toolkit

# Run evaluation
llm-eval --model qwen2.5:0.5b --benchmarks mmlu

# Get help
llm-eval --help
Enter fullscreen mode Exit fullscreen mode

Python API Example

from llm_evaluator import LLMEvaluator

# Create evaluator
evaluator = LLMEvaluator(
    provider="ollama",
    model="phi3.5:3.8b"
)

# Run benchmarks
results = evaluator.evaluate(["mmlu", "hellaswag"])

# Print results
for benchmark, score in results.items():
    print(f"{benchmark}: {score['accuracy']:.1f}%")

# Generate dashboard
evaluator.create_dashboard("evaluation.html")
Enter fullscreen mode Exit fullscreen mode

Future Plans

I'm planning to add:

More Benchmarks

  • GSM8K - Math reasoning (8,500 questions)
  • HumanEval - Code generation (164 problems)
  • BBH - Big-Bench Hard (challenging reasoning)

Enhanced Features

  • Multi-GPU support for distributed evaluation
  • Cost tracking for API-based models
  • Live monitoring dashboard
  • Automated model comparison reports

Community Contributions

  • Custom benchmark support
  • Additional provider integrations
  • Performance optimizations

Contributing

This is an open-source project and contributions are welcome!

Ways to contribute:

  • Report bugs or suggest features (GitHub issues)
  • Add new benchmarks or providers (Pull requests)
  • Improve documentation
  • Share your evaluation results

Check out the contributing guide to get started.


Resources

Links

Installation

pip install llm-benchmark-toolkit
Enter fullscreen mode Exit fullscreen mode

Related Project

I also built ai-safety-tester - a security testing framework for LLMs:

  • Prompt injection detection
  • Bias analysis
  • CVE-style vulnerability scoring
pip install ai-safety-tester
Enter fullscreen mode Exit fullscreen mode

What's Your Experience?

I'd love to hear from others working on LLM evaluation:

  • What benchmarks do you use?
  • How do you make production deployment decisions?
  • What evaluation challenges have you faced?

Drop a comment or reach out - I'm always interested in learning from the community.


Conclusion

Building this framework taught me that reproducibility is more valuable than impressive-looking scores.

Using standardized academic benchmarks provides:

  • Confidence in model selection
  • Fair comparisons across models
  • Reproducible results anyone can verify

If you're evaluating LLMs and need reproducible metrics, give llm-benchmark-toolkit a try. Feedback and contributions are always welcome!


Questions? Open an issue on GitHub or connect with me on LinkedIn.

Want to contribute? Check out the contributing guide.


Building open-source tools for transparent and reproducible LLM evaluation.

PyPI: https://pypi.org/project/llm-benchmark-toolkit/

Install:

pip install llm-benchmark-toolkit
llm-eval --help
Enter fullscreen mode Exit fullscreen mode

Related Projects

I also built ai-safety-tester - a security testing framework for LLMs:

  • Prompt injection detection
  • Bias analysis
  • CVE-style vulnerability scoring

PyPI: pip install ai-safety-tester


Conclusion

Building an evaluation framework taught me that reproducibility scales better than hype.

Standardized benchmarks show real performance. Users can make informed decisions. Developers can optimize based on facts.

If you're evaluating LLMs and need reproducible results, try academic benchmarks. They provide a common language for comparing models and making production decisions.

Questions or feedback? I'd love to hear your experience with LLM evaluation.

Questions? Open an issue on GitHub or reach out on Twitter.

Want to contribute? PRs welcome! Check out the contributing guide.


Building tools that prioritize transparency and reproducibility.

Top comments (0)