Tech_Nuggets

Posted on Jun 3

What is an LLM evaluation harness? A deep dive into lm-eval-harness

#llm #ai #evaluation #opensource

What is an LLM evaluation harness? A deep dive into lm-eval-harness

You fine-tuned a 7B model. It aced your smoke tests, your colleague ran a few prompts and shrugged approvingly, and the README is now full of cherry-picked outputs that look great in a screenshot. Then someone asks: how good is it, really? — and you realize you have no number to point at. No MMLU score. No HellaSwag. Nothing reproducible, nothing you can defend in a PR review, nothing you can compare to last week's checkpoint.

That's the gap an evaluation harness fills. It turns "vibes-based evaluation" into something with a score, a stderr, and a config file you can re-run next Tuesday.

Why evaluate LLMs at all?

Two reasons that actually matter:

Comparability. If you can't put a number on a model, you can't compare it to anything else — not the previous checkpoint, not the open-source baseline, not the commercial API you're trying to replace. Leaderboards are noisy and gaming-prone, but a local leaderboard with the tasks you care about is one of the most useful artifacts a team can build.
Regression detection. Most model regressions are silent. A 0.3-point drop on MMLU won't show up in a chat session, but it will show up in CI. People who ship models for a living treat evals the way backend engineers treat unit tests: mandatory, run on every PR, and blocking on regressions.

You don't need a hundred benchmarks. You need the three to five tasks that map to your actual use case, plus one or two general capability anchors (MMLU, HellaSwag) so you can sanity-check that you didn't accidentally destroy basic reasoning while you were tuning for your domain.

What is an "evaluation harness"?

An evaluation harness is the software that sits between a model and a benchmark. It handles the boring-but-critical parts: loading the model weights, tokenizing prompts in the way the benchmark expects, running inference, extracting the answer from a longer generation, scoring it against a ground-truth key, aggregating across examples, and writing out a JSON or CSV you can diff against last week's run.

The key insight is the separation between the model and the test. The benchmark is just a dataset plus a scoring rule. The harness is the plumbing. Keeping them separate is what lets you evaluate the same model on many benchmarks, or many models on the same benchmark, without reimplementing either side.

Here's what the pipeline looks like end to end:

flowchart LR
    A[Load model<br/>HF / vLLM / API] --> B[Format prompt<br/>task template]
    B --> C[Generate<br/>logprobs or text]
    C --> D[Extract answer<br/>regex / logprob argmax]
    D --> E[Score<br/>acc, F1, BLEU, …]
    E --> F[Aggregate<br/>mean, stderr, fewshot splits]
    F --> G[Write results<br/>JSON / CSV / wandb]

Every box above is configurable in lm-eval-harness. That's the whole game.

lm-eval-harness, in detail

EleutherAI started the project in 2020 as a unified way to reproduce published LLM benchmark numbers. It's now at v0.4.12 (May 2026), ships with 200+ tasks spanning reasoning, knowledge, coding, math, multilingual, and long-context benchmarks, and supports a long list of model backends: Hugging Face transformers, vLLM, SGLang, GPT-NeoX, Megatron-DeepSpeed, plus API endpoints for OpenAI, Anthropic, and a few others.

A few things changed in the last year that are worth knowing about:

The CLI got refactored (v0.4.x). The old flat lm_eval --tasks ... still works, but the new style uses subcommands: lm_eval run, lm_eval ls, lm_eval validate. You can now also drive a whole run from a YAML config file via --config, which is the only sane way to manage more than a handful of tasks.
The install got lighter. The base package no longer pulls in transformers or torch. You install the backend you actually need: pip install lm_eval[hf] or lm_eval[vllm] or lm_eval[api]. A 30 MB wheel instead of a 4 GB one.
Multimodal is in prototype via hf-multimodal and vllm-vlm model types, with mmmu as the first real task. If you're doing vision-language, look at lmms-eval instead — it's a fork that has a much broader multimodal task coverage.

Anatomy of a task

Every benchmark in the registry is a YAML file. Here's a real one — hellaswag.yaml, straight from the repo:

tag:
  - multiple_choice
task: hellaswag
dataset_path: Rowan/hellaswag
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: null
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
doc_to_choice: "choices"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0

The fields you'll touch most:

task — the task's registered name, what you pass to --tasks.
dataset_path — a Hugging Face dataset id. Most tasks point at a public dataset; private ones need an HF_TOKEN env var.
output_type — drives the whole scoring pipeline. multiple_choice uses logprob-based argmax (fast, no generation). generate requires the model to actually produce text. There's also loglikelihood for older perplexity-style tasks.
doc_to_text / doc_to_target / doc_to_choice — Jinja2 templates that extract fields from each dataset row. {{query}} is a column in the row.
metric_list — what to compute. acc is raw accuracy, acc_norm is accuracy after length normalization (matters for HellaSwag and a few others where longer choices have an unfair advantage).
metadata.version — bumped whenever a task definition changes, so old result files don't get conflated with new ones. If you change a task, bump this.

You can write your own task by dropping a YAML file in a directory and pointing at it with --include_path. People do this for domain-specific benchmarks constantly.

Running it yourself

Install with the Hugging Face backend:

pip install lm_eval[hf]

Run HellaSwag on a small public model:

lm_eval run \
  --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-1B,dtype=bfloat16 \
  --tasks hellaswag \
  --batch_size 8 \
  --output_path ./results

You'll get a results.json (machine-readable) and a results/ directory with per-sample logs. A 1B model on HellaSwag runs in a few minutes on a single A100. The first run downloads the dataset, so give it a few extra seconds.

For vLLM (much faster on bigger models):

pip install lm_eval[vllm]
lm_eval run --model vllm --model_args pretrained=mistralai/Mistral-7B-v0.3 --tasks mmlu,hellaswag,arc_easy

lm-eval-harness vs the alternatives

Harness	Best at	Not great at	Maintained by
lm-eval-harness	breadth, OSS community, YAML-defined tasks, multi-backend	UI, custom metric UX	EleutherAI
OpenCompass	Chinese-language coverage, leaderboard-style reporting, integrated model zoo	english-only tasks, customization	Shanghai AI Lab
HELM	transparency, multi-metric reporting (calibration, robustness, fairness), classic leaderboard	running your own models fast, lightweight eval	Stanford CRFM
lighteval	Hugging Face integration, runs on HF Spaces / Inference Endpoints, slimmer	less task coverage than lm-eval	Hugging Face
bigcode-eval-harness	code generation (HumanEval, MBPP, MultiPL-E, RepoBench)	non-code tasks	BigCode

The honest summary: lm-eval-harness is the default for most teams, OpenCompass if you care about Chinese benchmarks, HELM if you want the multi-axis Stanford-style reporting, and lighteval if you're already deep in the HF ecosystem and want something that integrates with the Hub.

Common pitfalls

A few traps that bite everyone the first time:

Data contamination. Your model may have seen the test set during pretraining. There's no clean fix, but you should at least know your model's training cutoff and pick benchmarks whose data was published after that cutoff when you can. MMLU is essentially saturated at this point.
Prompt-format sensitivity. Changing the few-shot separator, the answer-extraction regex, or even the ordering of choices can swing results by 1–2 points. Pin the lm-eval-harness version and the task config version in your results. A "regression" that's actually a harness version bump is a real failure mode.
Few-shot variance. Default 5-shot for most tasks, but 0-shot and 25-shot can give very different numbers. Report which one you used. Run a stability check (same eval, two seeds, different few-shot order) before you trust a 0.3-point delta.
License gotchas. Some datasets in the registry have non-commercial licenses. Running them is fine, but the resulting model weights may inherit restrictions depending on your jurisdiction. Read the dataset card.
The "GPT-4-as-judge" trap. Some benchmarks score free-form generations by asking GPT-4 to rate them. This is a separate evaluation chain with its own biases and costs. If you use one of these, you're not really running an LLM eval — you're running an LLM-eval-of-LLM-judgments pipeline. Treat the score accordingly.

When NOT to use it

lm-eval-harness is the wrong tool if:

You're monitoring production traffic. You need Langfuse / Phoenix / Helicone / Braintrust for that. Online eval is a different problem class: implicit feedback, drift detection, hallucination rates on your data, not on HellaSwag.
You need a domain-specific benchmark. If you're shipping a legal contract reviewer, "MMLU is 65.4" tells you almost nothing. Build a small (~200–500 example) hand-graded test set from real production samples, version it, and run it on every PR. lm-eval-harness's --include_path makes this easy.
You're evaluating a tiny custom model on a toy task. A 50M-parameter model fine-tuned for sentiment classification doesn't need HellaSwag. Just write a Python script that calls the model 1000 times and computes accuracy. The harness overhead is real.

TL;DR

An LLM evaluation harness is the plumbing between a model and a standardized benchmark. It loads the model, formats prompts, runs inference, scores answers, and writes results.
lm-eval-harness (EleutherAI) is the de facto OSS standard. v0.4.12, 200+ tasks, multiple backends.
A task is a YAML file with fields like output_type, doc_to_text, and metric_list. You can write your own and point at it with --include_path.
Run a small, version-pinned set of tasks that map to your use case, plus 1–2 general anchors. Don't trust deltas smaller than ~0.5 points without a stability check.
Use it for offline eval and regression detection. For production monitoring, use an observability tool. For domain-specific eval, write your own.

Next post: how to actually build that domain-specific eval set — sampling strategy, inter-rater agreement, and the "is my golden set still golden" problem.

If you're building a model and want a second pair of eyes on your eval setup, I'm collecting feedback for the next post — drop a comment or DM the kinds of tasks you'd want covered.

DEV Community

What is an LLM evaluation harness? A deep dive into lm-eval-harness

What is an LLM evaluation harness? A deep dive into lm-eval-harness

Why evaluate LLMs at all?

What is an "evaluation harness"?

lm-eval-harness, in detail

Anatomy of a task

Running it yourself

lm-eval-harness vs the alternatives

Common pitfalls

When NOT to use it

TL;DR

Top comments (0)