What is an LLM evaluation harness? A deep dive into lm-eval-harness
You fine-tuned a 7B model. It aced your smoke tests, your colleague ran a few prompts and shrugged approvingly, and the README is now full of cherry-picked outputs that look great in a screenshot. Then someone asks: how good is it, really? — and you realize you have no number to point at. No MMLU score. No HellaSwag. Nothing reproducible, nothing you can defend in a PR review, nothing you can compare to last week's checkpoint.
That's the gap an evaluation harness fills. It turns "vibes-based evaluation" into something with a score, a stderr, and a config file you can re-run next Tuesday.
Why evaluate LLMs at all?
Two reasons that actually matter:
- Comparability. If you can't put a number on a model, you can't compare it to anything else — not the previous checkpoint, not the open-source baseline, not the commercial API you're trying to replace. Leaderboards are noisy and gaming-prone, but a local leaderboard with the tasks you care about is one of the most useful artifacts a team can build.
- Regression detection. Most model regressions are silent. A 0.3-point drop on MMLU won't show up in a chat session, but it will show up in CI. People who ship models for a living treat evals the way backend engineers treat unit tests: mandatory, run on every PR, and blocking on regressions.
You don't need a hundred benchmarks. You need the three to five tasks that map to your actual use case, plus one or two general capability anchors (MMLU, HellaSwag) so you can sanity-check that you didn't accidentally destroy basic reasoning while you were tuning for your domain.
What is an "evaluation harness"?
An evaluation harness is the software that sits between a model and a benchmark. It handles the boring-but-critical parts: loading the model weights, tokenizing prompts in the way the benchmark expects, running inference, extracting the answer from a longer generation, scoring it against a ground-truth key, aggregating across examples, and writing out a JSON or CSV you can diff against last week's run.
The key insight is the separation between the model and the test. The benchmark is just a dataset plus a scoring rule. The harness is the plumbing. Keeping them separate is what lets you evaluate the same model on many benchmarks, or many models on the same benchmark, without reimplementing either side.
Here's what the pipeline looks like end to end:
flowchart LR
A[Load model<br/>HF / vLLM / API] --> B[Format prompt<br/>task template]
B --> C[Generate<br/>logprobs or text]
C --> D[Extract answer<br/>regex / logprob argmax]
D --> E[Score<br/>acc, F1, BLEU, …]
E --> F[Aggregate<br/>mean, stderr, fewshot splits]
F --> G[Write results<br/>JSON / CSV / wandb]
Every box above is configurable in lm-eval-harness. That's the whole game.
lm-eval-harness, in detail
EleutherAI started the project in 2020 as a unified way to reproduce published LLM benchmark numbers. It's now at v0.4.12 (May 2026), ships with 200+ tasks spanning reasoning, knowledge, coding, math, multilingual, and long-context benchmarks, and supports a long list of model backends: Hugging Face transformers, vLLM, SGLang, GPT-NeoX, Megatron-DeepSpeed, plus API endpoints for OpenAI, Anthropic, and a few others.
A few things changed in the last year that are worth knowing about:
-
The CLI got refactored (v0.4.x). The old flat
lm_eval --tasks ...still works, but the new style uses subcommands:lm_eval run,lm_eval ls,lm_eval validate. You can now also drive a whole run from a YAML config file via--config, which is the only sane way to manage more than a handful of tasks. -
The install got lighter. The base package no longer pulls in
transformersortorch. You install the backend you actually need:pip install lm_eval[hf]orlm_eval[vllm]orlm_eval[api]. A 30 MB wheel instead of a 4 GB one. -
Multimodal is in prototype via
hf-multimodalandvllm-vlmmodel types, withmmmuas the first real task. If you're doing vision-language, look at lmms-eval instead — it's a fork that has a much broader multimodal task coverage.
Anatomy of a task
Every benchmark in the registry is a YAML file. Here's a real one — hellaswag.yaml, straight from the repo:
tag:
- multiple_choice
task: hellaswag
dataset_path: Rowan/hellaswag
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: null
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
doc_to_choice: "choices"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
The fields you'll touch most:
-
task— the task's registered name, what you pass to--tasks. -
dataset_path— a Hugging Face dataset id. Most tasks point at a public dataset; private ones need anHF_TOKENenv var. -
output_type— drives the whole scoring pipeline.multiple_choiceuses logprob-based argmax (fast, no generation).generaterequires the model to actually produce text. There's alsologlikelihoodfor older perplexity-style tasks. -
doc_to_text/doc_to_target/doc_to_choice— Jinja2 templates that extract fields from each dataset row.{{query}}is a column in the row. -
metric_list— what to compute.accis raw accuracy,acc_normis accuracy after length normalization (matters for HellaSwag and a few others where longer choices have an unfair advantage). -
metadata.version— bumped whenever a task definition changes, so old result files don't get conflated with new ones. If you change a task, bump this.
You can write your own task by dropping a YAML file in a directory and pointing at it with --include_path. People do this for domain-specific benchmarks constantly.
Running it yourself
Install with the Hugging Face backend:
pip install lm_eval[hf]
Run HellaSwag on a small public model:
lm_eval run \
--model hf \
--model_args pretrained=meta-llama/Llama-3.2-1B,dtype=bfloat16 \
--tasks hellaswag \
--batch_size 8 \
--output_path ./results
You'll get a results.json (machine-readable) and a results/ directory with per-sample logs. A 1B model on HellaSwag runs in a few minutes on a single A100. The first run downloads the dataset, so give it a few extra seconds.
For vLLM (much faster on bigger models):
pip install lm_eval[vllm]
lm_eval run --model vllm --model_args pretrained=mistralai/Mistral-7B-v0.3 --tasks mmlu,hellaswag,arc_easy
lm-eval-harness vs the alternatives
| Harness | Best at | Not great at | Maintained by |
|---|---|---|---|
| lm-eval-harness | breadth, OSS community, YAML-defined tasks, multi-backend | UI, custom metric UX | EleutherAI |
| OpenCompass | Chinese-language coverage, leaderboard-style reporting, integrated model zoo | english-only tasks, customization | Shanghai AI Lab |
| HELM | transparency, multi-metric reporting (calibration, robustness, fairness), classic leaderboard | running your own models fast, lightweight eval | Stanford CRFM |
| lighteval | Hugging Face integration, runs on HF Spaces / Inference Endpoints, slimmer | less task coverage than lm-eval | Hugging Face |
| bigcode-eval-harness | code generation (HumanEval, MBPP, MultiPL-E, RepoBench) | non-code tasks | BigCode |
The honest summary: lm-eval-harness is the default for most teams, OpenCompass if you care about Chinese benchmarks, HELM if you want the multi-axis Stanford-style reporting, and lighteval if you're already deep in the HF ecosystem and want something that integrates with the Hub.
Common pitfalls
A few traps that bite everyone the first time:
- Data contamination. Your model may have seen the test set during pretraining. There's no clean fix, but you should at least know your model's training cutoff and pick benchmarks whose data was published after that cutoff when you can. MMLU is essentially saturated at this point.
- Prompt-format sensitivity. Changing the few-shot separator, the answer-extraction regex, or even the ordering of choices can swing results by 1–2 points. Pin the lm-eval-harness version and the task config version in your results. A "regression" that's actually a harness version bump is a real failure mode.
- Few-shot variance. Default 5-shot for most tasks, but 0-shot and 25-shot can give very different numbers. Report which one you used. Run a stability check (same eval, two seeds, different few-shot order) before you trust a 0.3-point delta.
- License gotchas. Some datasets in the registry have non-commercial licenses. Running them is fine, but the resulting model weights may inherit restrictions depending on your jurisdiction. Read the dataset card.
- The "GPT-4-as-judge" trap. Some benchmarks score free-form generations by asking GPT-4 to rate them. This is a separate evaluation chain with its own biases and costs. If you use one of these, you're not really running an LLM eval — you're running an LLM-eval-of-LLM-judgments pipeline. Treat the score accordingly.
When NOT to use it
lm-eval-harness is the wrong tool if:
- You're monitoring production traffic. You need Langfuse / Phoenix / Helicone / Braintrust for that. Online eval is a different problem class: implicit feedback, drift detection, hallucination rates on your data, not on HellaSwag.
-
You need a domain-specific benchmark. If you're shipping a legal contract reviewer, "MMLU is 65.4" tells you almost nothing. Build a small (~200–500 example) hand-graded test set from real production samples, version it, and run it on every PR. lm-eval-harness's
--include_pathmakes this easy. - You're evaluating a tiny custom model on a toy task. A 50M-parameter model fine-tuned for sentiment classification doesn't need HellaSwag. Just write a Python script that calls the model 1000 times and computes accuracy. The harness overhead is real.
TL;DR
- An LLM evaluation harness is the plumbing between a model and a standardized benchmark. It loads the model, formats prompts, runs inference, scores answers, and writes results.
- lm-eval-harness (EleutherAI) is the de facto OSS standard. v0.4.12, 200+ tasks, multiple backends.
- A task is a YAML file with fields like
output_type,doc_to_text, andmetric_list. You can write your own and point at it with--include_path. - Run a small, version-pinned set of tasks that map to your use case, plus 1–2 general anchors. Don't trust deltas smaller than ~0.5 points without a stability check.
- Use it for offline eval and regression detection. For production monitoring, use an observability tool. For domain-specific eval, write your own.
Next post: how to actually build that domain-specific eval set — sampling strategy, inter-rater agreement, and the "is my golden set still golden" problem.
If you're building a model and want a second pair of eyes on your eval setup, I'm collecting feedback for the next post — drop a comment or DM the kinds of tasks you'd want covered.
Top comments (0)