Goodhart's Law Just Got a Slash Command

#claudecode #aiengineering #goodhartslaw #agentdesign

Anthropic added the /goal command to Claude Code in v2.1.139. You set a completion condition; the agent keeps working across turns; a second model reads the transcript and decides whether the condition was met. It is the built-in version of the keep-going loops people have been hand-rolling for long agent work.

A careful field guide for it circulated this week, and the piece lands the right diagnosis. A verification-only condition produces a correct-but-useless result. The worked example built a space shooter as a 960×540 canvas with a triangle, a dot, and three starfield pixels. Every machine check passed. The recommended cure is the wrong one: write better conditions, point them at a longer PRD that defines what good looks like, keep the condition short and the spec long. Better conditions do not escape this failure. They relocate it.

The Slash Command Has a Fifty-Year-Old Name

Marilyn Strathern's formulation of Goodhart's Law is the canonical statement of what /goal automates: "When a measure becomes a target, it ceases to be a good measure." Targets get optimized with full discipline. Anything outside the target does not appear in the result, because nothing unmeasured can fail the goal. /goal takes this dynamic — previously an organizational pathology — and ships it as a CLI primitive. The condition is the target. The agent is the optimizer. The evaluator-as-judge enforces the target with mechanical rigor.

The field guide does not contain the word "Goodhart," and the omission matters. Every paragraph of it describes Goodhart's Law without naming what it is fighting.

The HUD That Wasn't Checked

The strongest evidence for the structural read is buried in the piece's own conclusion. The "fixed" three-games run used per-version visual assertions — for the 70s build, an automated check asserts the renderer uses stroke and line primitives only; the 8-bit and modern builds have their own. The public repo shows the work. And then, from the closing paragraph:

The modern version's headless playtest renderer stubs text drawing, so its headless screenshots show no HUD; it renders correctly in a browser. The visual assertion passed without ever checking for the HUD, which is the same lesson one level down. It measured the effects it was told to measure, not the HUD it was not.

That is the diagnosis from the opening of the piece recurring inside the fix from the middle. The PRD got longer. The condition got smarter. The unmeasured thing — text rendering — moved one room over and the shoebox followed it. This is what Goodhart's Law does to every system that automates a measure into a target. The fix is not a stricter spec. There is no spec that anticipates the thing you didn't think to check.

Why the Loop Cannot Save Itself

The structural reason /goal cannot escape this on its own is in Anthropic's own description of the feature. The evaluator runs no tools. It reads the transcript. The field guide flags this in two separate sections — first in How to use it ("The evaluator only read the transcript. Verify the result the way you would verify a colleague's pull request before you trust it") and then again in Gotchas ("A confident summary of broken work reads as 'fine'"). Both statements are correct. Both close the case.

A verifier that does not run the artifact has not verified the artifact. It has verified the transcript. The transcript is the artifact's lawyer, not its auditor. The same model that produced the broken thing also produced the summary of the broken thing, and a second model trained on the same loss function reading that summary is not adversarial review. It is paperwork.

The Narrow Case That Survives

/goal earns its keep where the goal and the measure are the same object. Tests pass. Build is green. The queue is empty. Every module is under a size budget. Coverage is over a threshold. In that case Goodhart does not bite, because there is nothing unmeasured to subvert — you wanted the tests green and the tests are green. This is the argument from Babysitter, Auditor, Prayer. Or Tests. two weeks ago, restated: anything with deterministic verification is the right place to lean on a loop; anything that needs judgment is not.

The moment your goal includes a judgment term — looks good, is fun, has a HUD, is well-designed, feels right — you have left the domain /goal can serve. The PRD-as-context pattern does not change this. The evaluator still does not read the PRD. The evaluator still does not run the artifact. It is doing what its documentation says: summarizing whether the transcript looks like it satisfied a condition.

The Cost Ledger

The three-game example cost about 91 minutes across the three runs, plus the upfront work writing the PRD and the goal prompt. That is one half of the productivity story. The other half is the audit. The field guide is explicit about this: "Audit 'achieved' yourself. The evaluator only read the transcript. Verify the result the way you would verify a colleague's pull request before you trust it."

If you audit every "achieved" result with the rigor of a real PR review, the loop did not eliminate the work. It moved the work to a different verb. The savings are real only when the verification is mechanical and you can skip the audit because the tests genuinely passed. Outside the mechanical case, the audit is the work, and the time spent writing a longer PRD is overhead the hand-rolled loop did not have.

What Survives Contact With Goodhart

Two patterns survive. The first is the narrow mechanical case above. Use /goal for it, write a short condition that exactly equals what you want, and trust the green build. The second is a hand-rolled loop you write yourself, where the verification step is code rather than English. A loop with code-level verification surfaces missing checks as tests that do not run or assertions that do not compile. A /goal condition that misses the HUD just announces "achieved." The visible failure is the cheaper one to fix.

Goodhart's Law has been around for fifty years. Every system that has automated a measure into a target has lived through the same failure — KPIs, OKRs, SLAs, test scores, hospital wait times, sales quotas, ad-engagement metrics, every algorithmic feed. Now the pattern is a slash command. The PRD-as-spec recipe is the same trap with extra documentation.

Use the feature where the goal and the measure coincide. Everywhere else, the audit is the loop and the human is the evaluator.

Top comments (1)

BaoDev Studio • May 20

This is the right read on /goal — and the same pattern shows up one layer down when you have agents reviewing other agents' work. Our internal quality-gate agent runs against a 40-item checklist. The backend agent satisfies the 40 items exactly. The checklist becomes the ceiling, not the floor: anything outside those 40 items is invisible to the dispatch loop.

We've mitigated it (not solved it) with two patterns:

(1) A periodic "red-team" agent dispatch with NO spec — just "find something wrong with this PR that isn't in the standard review checklist". About 1 in 5 finds something real that quality-gate missed.

(2) Bug bounties on production for issues that should have been caught. The bounty goes into the next quality-gate checklist update, not just the patch. The checklist grows, but the growth is empirical (real defects) rather than speculative (PRDs that try to enumerate every failure mode upfront).

The second-model approach you describe is still Goodhart-bound because the second model is reading the same surface as the first. The relocation is real, agree.