Saurav Bhattacharya

Posted on Jun 21

Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything

#ai #agents #observability #evaluation

There is a specific moment in the life of every agent team that nobody puts on the roadmap. You build an eval suite. It catches real bugs. You wire it into CI as a release gate. The dashboard goes green. And then, somewhere over the next three months, the green stops meaning anything — while everyone keeps treating it like it does.

This is Goodhart's Law, and it is coming for your agent evals whether you plan for it or not.

"When a measure becomes a target, it ceases to be a good measure."

The day your eval suite becomes the thing that decides what ships, it stops being a neutral measurement of quality and becomes a target your team optimizes toward. That is not a hypothetical risk. It is the default trajectory, and most teams only notice after a "fully passing" release lands in production and quietly makes everything worse.

How a good eval suite rots

The decay is boring, which is exactly why it's dangerous. Here's the usual sequence:

You write evals against the bugs you already found. Reasonable. But now your suite measures yesterday's failure modes, not tomorrow's.
A change fails one case. Instead of asking "did we regress?", someone asks "is the eval too strict?" and tweaks the assertion until it's green.
Prompts get tuned to the eval set. Few-shot examples drift toward the exact phrasings your judge rewards. The agent gets better at your test cases and no better at the actual job.
The held-out set quietly becomes the training set. Every case you debug against is a case you've now overfit to.

The endpoint is an agent with a 98% pass rate that is measurably worse for users — because the score is now measuring how well the agent satisfies the test, not how well it does the work. The map replaced the territory.

The tell: a green gate you can't explain

The cleanest signal that Goodhart has arrived is this — a release passes the gate, and nobody on the team can explain why a specific borderline case passed. It just did. The score is a number with no narrative behind it.

That's the real problem. A pass/fail bit is not a measurement you can reason about. It's a measurement you can only trust or distrust. And trust, unaudited, always decays toward green.

This is exactly the seam where the two tools I lean on have to work as one unit, not as separate dashboards.

agent-eval scores and gates the output. It runs the deterministic checks, the model-as-judge rubrics, the drift and hallucination signals — and it returns a verdict on what the agent produced.

AgentLens captures the trace of how the agent got there. Every model call and tool step, the resolved inputs (after templating, not the raw template), and the raw outputs before any post-processing.

Neither half is sufficient alone, and that's the entire point. A bare eval score is a target waiting to be gamed. A bare trace is forensic data with no verdict attached. You need agent-eval's score anchored to AgentLens's trace so that every gate decision carries a "show me why" attached to it. When a borderline case flips, you don't argue about whether the eval is too strict — you open the trace, see the resolved prompt and the exact tool output, and find out whether the agent actually reasoned correctly or got lucky on a phrasing.

That linkage is what keeps the measure honest. The eval tells you the gate flipped; the trace tells you whether the flip was earned.

What it looks like in code

The anti-pattern is a gate that returns a boolean and nothing else:

// Goodhart bait: a verdict with no evidence behind it.
async function gate(testCase: TestCase): Promise<boolean> {
  const output = await runAgent(testCase.input);
  return judge(output, testCase.expected) >= 0.8; // green or red, no "why"
}

The fix is to make the score and the trace travel together, so a passing case is auditable, not just countable:

import { evaluate } from "agent-eval";
import { trace } from "agentlens";

interface GatedResult {
  passed: boolean;
  score: number;
  traceId: string;     // the receipt
  heldOut: boolean;    // was this case ever debugged against?
}

async function gatedRun(testCase: TestCase): Promise<GatedResult> {
  // AgentLens records every model + tool step, resolved inputs, raw outputs.
  const session = trace.start({ caseId: testCase.id });

  const output = await runAgent(testCase.input, { trace: session });

  // agent-eval scores the OUTPUT: deterministic checks + judge rubric + drift.
  const verdict = await evaluate(output, {
    expected: testCase.expected,
    checks: ["schema", "grounding", "drift"],
    judge: "rubric-v3",
  });

  await session.attach({ verdict }); // bind score <-> trace

  return {
    passed: verdict.score >= 0.8,
    score: verdict.score,
    traceId: session.id,         // open this to see WHY it passed
    heldOut: testCase.heldOut,   // overfit guard, see below
  };
}

Two things in that snippet are doing the anti-Goodhart work. The traceId means no pass is unexplainable — every green is one click from its own evidence. And heldOut is the discipline that keeps the suite from collapsing into a training set.

Three rules to keep the measure honest

Tooling won't save you from Goodhart on its own. The process around it has to hold the line:

Quarantine a held-out set you never debug against. If you've ever opened the trace for a case to fix a failure, that case is burned for measurement — it's now a regression test, not an evaluation. Keep a rotating set you only ever score, never tune toward. When held-out and debugged scores diverge, that gap is your overfit, measured directly.
Treat eval edits like production changes. Loosening an assertion to get green is a code change with a blast radius. It needs a diff, a reviewer, and a one-line justification anchored to a trace — "this case was wrong because the trace shows X," not "this was flaky."
Mine new cases from production traces, not your imagination. The cases you invent reflect failures you can already picture. The cases in your AgentLens traces reflect what users actually trigger. Promote real, surprising traces into the held-out set continuously, so the suite keeps measuring a moving target instead of a frozen one.

The uncomfortable conclusion

A green eval dashboard is not evidence that your agent is good. It is evidence that your agent satisfies your evals — and those are only the same thing while you're actively defending the gap between them.

The teams that ship reliable agents aren't the ones with the highest pass rates. They're the ones who can pull up any green checkmark and explain, from the trace, exactly why it earned the pass. agent-eval gives you the verdict; AgentLens gives you the receipt. Keep them bound together, keep a real held-out set, and your dashboard might actually keep meaning something six months from now.

Most won't. Now you know why.

Top comments (2)

Armorer Labs • Jun 21

This is the uncomfortable eval problem in one sentence: the metric becomes the target and then stops measuring the behavior you cared about.

The dashboard needs adversarial pressure and periodic replay against real failures. I would also keep sampled receipts from production-like runs: what the agent saw, what tools it called, what evidence it used, and why the evaluator passed it.

A green number without inspectable examples is too easy to trust for too long.

Maya Andersson • Jun 23

Goodhart hitting eval suites once they become release gates is real, and the held-out set helps, but the quieter failure is the judge itself drifting as you tune prompts toward whatever it happens to reward. Before trusting a green dashboard I would want the inter-rater reliability between the LLM judge and a human-labeled sample tracked over time, since a rising score with a falling Cohen's kappa means you are overfitting the judge, not improving the agent. Are you stratifying the held-out set by failure mode, because a single aggregate pass rate can stay flat while a specific slice quietly collapses?