I Ran 1,000 LLM Evals Over 12 Months. Here's What Actually Moved the Needle

#ai #llm #agents #llmtools

A 95 on MMLU doesn't mean your model will write a correct pagination query. I learned this the hard way, running eval after eval until 3 AM, watching green lights that lied to me.

After a year of benchmarking LLMs in production — coding tasks, agentic pipelines, RAG pipelines — I've got opinions. Some of them are contrarian. All of them come from watching models fail in ways that no leaderboard predicted.

The Benchmark Problem

Every serious AI engineering team eventually hits the same wall: the numbers on the wall don't match the numbers in production.

MMLU, HumanEval, GSM8K — these are fine as sanity checks. If your model scores below 60% on HumanEval, something is seriously wrong. But the inverse doesn't hold. A 92% on HumanEval tells you almost nothing about whether the model will handle your 47-specific business logic correctly.

The reason is obvious when you say it out loud: these benchmarks were designed to measure general capability, not task-specific performance. Your production use case is not in the training set. The distribution your users care about is not the distribution the benchmark samples from.

# A naive eval setup that will fool you
def naive_eval(model, test_cases):
    score = sum(
        model.generate(prompt) == expected
        for prompt, expected in test_cases
    )
    return score / len(test_cases)

# This looks great. It's probably lying.

The == comparison is the first red flag. Real outputs are rarely exactly equal. And if your test cases are synthetic (which most are), you're measuring the model's ability to follow patterns in your synthetic data, not its ability to handle real user inputs.

Building an Eval Harness That Doesn't Lie

After a few months of bad signals, I built a small eval harness. It won't win any research awards, but it works.

import json
from collections import defaultdict
from typing import Callable

class EvalHarness:
    def __init__(self):
        self.results = defaultdict(list)

    def register(self, name: str):
        """Decorator to register an eval case."""
        def decorator(fn: Callable):
            self.results[name].append(fn)
            return fn
        return decorator

    def run(self, model_fn, dataset: list[dict]) -> dict:
        """Run eval on a dataset, return per-category breakdown."""
        scores = defaultdict(lambda: {"correct": 0, "total": 0})

        for item in dataset:
            category = item.get("category", "uncategorized")
            prompt = item["prompt"]
            expected = item["expected"]

            output = model_fn(prompt)

            # Use semantic check, not exact match
            is_correct = self._semantic_check(output, expected, item)

            scores[category]["total"] += 1
            if is_correct:
                scores[category]["correct"] += 1

        return {
            cat: round(s["correct"] / s["total"] * 100, 1)
            for cat, s in scores.items()
        }

    def _semantic_check(self, output: str, expected, item) -> bool:
        """Plug in your own logic — LLM-as-judge, regex, unit test, etc."""
        check_type = item.get("check", "exact")

        if check_type == "llm_judge":
            # Use a stronger model to judge output quality
            judge_prompt = f"Given the input: {item['prompt']}\nOutput: {output}\nExpected: {expected}\nDoes the output satisfy the intent? Answer yes or no."
            return "yes" in model_fn(judge_prompt).lower()

        elif check_type == "unit_test":
            # Execute code and check test results
            exec_globals = {}
            exec(output, exec_globals)
            return exec_globals.get("test_result") == expected

        else:  # exact
            return output.strip() == expected.strip()

The key insight here: different eval cases need different judges. Code outputs need unit tests. Reasoning outputs might need an LLM-as-judge with a carefully crafted prompt. Text classification might just need exact match.

One score, one check type, one dataset — that will mislead you every time.

The Production Trap That No One Talks About

Here's the thing that cost me the most grief: eval distribution drift.

You build your harness. You pick your test cases. You get great scores. You ship. Three months later, your users are asking for something slightly different, your model updated its weights, and your eval scores are still 94% but your production error rate tripled.

This happens because eval cases rot. The real world moves, your test cases don't.

# Track eval freshness
def check_eval_freshness(dataset: list[dict], drift_threshold_days: int = 14) -> bool:
    from datetime import datetime, timedelta

    cutoff = datetime.now() - timedelta(days=drift_threshold_days)

    stale = [item for item in dataset 
             if datetime.fromisoformat(item.get("added", "2020-01-01")) < cutoff]

    if stale:
        print(f"⚠️  {len(stale)}/{len(dataset)} eval cases are stale (>14d old)")
        return False
    return True

I now run this check on every eval dataset before trusting any numbers. Stale evals are worse than no evals — they give you false confidence.

What I Learned After 1,000 Runs

After roughly a thousand eval runs across five different model families, here's what held:

1. Task-specific evals outperform general benchmarks by a mile. When I switched from MMLU to a custom dataset of 200 real user queries, the correlation with production performance jumped from ~0.3 to ~0.7. Worth the upfront cost.

2. LLM-as-judge is genuinely useful but needs guardrails. Using GPT-4o to judge outputs from a weaker model works surprisingly well — until it starts being charitable to outputs that are technically wrong but sound confident. Calibrate with a golden set of 20-30 human-labeled examples.

3. The delta between runs matters more than absolute scores. If model A scores 85% and model B scores 87%, I don't care. If model A scores 85% today and 91% tomorrow after a change, that's signal. Track deltas, not absolutes.

4. Eval leakage is real and easy to miss. If you're sampling test cases from data the model may have seen in training, you're measuring memorization, not reasoning. Use held-out sets. Check your data pipeline before you trust anything.

5. Fast feedback loops beat comprehensive evaluation. I'd rather run 50 quick evals that tell me "something is wrong" in 10 minutes than 500 comprehensive evals that tell me the same thing in 6 hours. Build for iteration speed early.

What I Learned

The best eval is the one that predicts what your users will experience — not the one that makes your model look good.

Generic benchmarks are a necessary starting point, but they're the floor, not the ceiling. If you're shipping LLM-powered features to real users and making decisions based only on MMLU or HumanEval, you're flying blind.

Invest in task-specific eval datasets. Build a harness that supports multiple judgment methods. Track freshness. And for the love of good engineering: run evals on the actual thing users will see, not a proxy that correlates loosely with it.

The model's 95% score on your benchmark probably doesn't mean what you think it means. Your production error rate is the only score that actually matters.