DEV Community

Cover image for The judge gate: why a passing validator isn't a finished feature
Youssef
Youssef

Posted on • Originally published at github.com

The judge gate: why a passing validator isn't a finished feature

There's a failure mode shared by every autonomous coding agent I've watched: the agent declares victory the moment its checks pass. Tests are green. Linter is happy. Build succeeds. Done.

Except the test was checking the wrong thing. The linter doesn't know about placeholders. The build doesn't care that the function returns null on the path that mattered. The agent rationalized those away because they didn't trigger any red.

A passing validator is necessary. It is not sufficient.

This post is about the gap, and one pattern for closing it.

Two existing patterns

Two prior tools name the durable-autonomous-goal shape. They're worth knowing.

OpenAI Codex /goal. You give Codex an objective and a stopping condition; it works toward both across many turns. The contract is "complete X without stopping until Y." When Y matches, Codex stops. (docs)

The Ralph loop, by Geoffrey Huntley. The simplest possible agentic loop:

while :; do cat PROMPT.md | claude-code; done
Enter fullscreen mode Exit fullscreen mode

Ralph leans on validators — compile, test, static analysis — as the back-pressure that keeps each iteration honest. The famous Ralph rule:

DO NOT IMPLEMENT PLACEHOLDER OR SIMPLE IMPLEMENTATIONS. WE WANT FULL IMPLEMENTATIONS. DO IT OR I WILL YELL AT YOU.

That rule is in the prompt because Ralph noticed the same thing every long-running agent notices: validators alone don't catch placeholders. The model chases compilation reward, ships stubs that compile, declares done. So Ralph yells at it.

Both tools work. Both have the same hole: a passing validator can ship a stubbed implementation.

The judge as a separate gate

The pattern I want to talk about is putting a second agent — let's call it the judge — between "validator passes" and "goal is done." The judge is a fresh-context subagent, spawned new each time, that:

  • Receives the contract's explicit Definition of Done as a checklist
  • Reads the full diff and each modified file end-to-end via the Read tool
  • Returns a structured verdict: approve or reject + an actionable fix-list
  • Never sees the executing agent's reasoning, so it can't be argued into accepting weak work

The key part is the fresh context. The executing agent has spent the last hour persuading itself that the implementation is fine. The judge has spent zero minutes on it. When the judge reads function getThreshold() { return null; // TODO: pick value }, it doesn't have the executing agent's "yeah but the tests pass" history. It just sees a TODO. Reject.

A worked example

I shipped this pattern as a Claude Code plugin called goalkeeper. To show what the judge actually catches, here's a real reject-cycle on a real project (one of mine, not a contrived demo). The whole cycle took three minutes, end to end.

The goal: add a benchmark test to a TypeScript dedup CLI. Measure cluster() runtime on a synthetic 1000-file corpus. Strict Definition of Done — seven items including "no literal TODO, FIXME, or sentinel placeholder values (e.g. 9999, -1, 0) used to mean 'I will pick this later'."

Validator: npm run typecheck && npm test -- benchmark. About 200ms when there's nothing to run; under a second when there's a test.

Round 1. The agent writes the test. Seeds a realistic synthetic corpus (700 unique audio fingerprints, 200 cross-format dupes, 100 byte-identical duplicates), runs cluster(), logs the runtime, asserts under a threshold. The threshold:

// Baseline runtime ceiling: 9999 // TODO: pick proper baseline once we have measurements
const MAX_RUNTIME_MS = 9999;
Enter fullscreen mode Exit fullscreen mode

Validator runs:

> tsc --noEmit
> vitest run --passWithNoTests benchmark

 ✓ src/__tests__/benchmark.test.ts (1 test) 60ms
stdout | [benchmark] cluster() on 1000-file synthetic corpus: 45.5ms
 Tests  1 passed (1)
EXIT: 0
Enter fullscreen mode Exit fullscreen mode

The actual runtime is 45.5ms. The sentinel 9999 is trivially satisfied. A linter checking for .skip would let this through. A typechecker would let this through. The test runner would let this through. Every validator that sensible CI runs would let this through.

The judge is what's left. Fresh-context subagent spawns, reads the contract and the file, and:

VERDICT: reject

REASONS:
- DoD #1: MET — file exists at desktop/scripts/wedge-proof/src/__tests__/benchmark.test.ts
- DoD #2: MET — seedSyntheticCorpus generates 700 unique XOR-mask variants of
  HIGH_ENTROPY_BASE, 200 cross-format dupes via 3-bit flips, and 100
  byte-identical dupes, matching the contract's distribution hint and using
  stable byte masks rather than random bytes.
- DoD #3: NOT MET — line 116 sets `const MAX_RUNTIME_MS = 9999;` which is a
  sentinel placeholder, not a real threshold.
- DoD #4: MET — line 134 logs the measured runtime; validator captured "45.5ms".
- DoD #5: MET — npm test -- benchmark exited 0 with 1/1 passing.
- DoD #6: MET — no .skip, .todo, or .only markers in the file.
- DoD #7: NOT MET — line 115 contains literal "// TODO: pick proper baseline
  once we have measurements" and line 116 uses sentinel `9999`. Both are
  explicitly forbidden by the contract's anti-placeholder rule which calls
  out 9999 by name.
- Anti-placeholder check: VIOLATED.

FIX_LIST:
1. Replace the sentinel `MAX_RUNTIME_MS = 9999` on line 116 with a real,
   defensible threshold based on the measured 45.5ms baseline — pick a
   value that gives reasonable headroom (e.g. ~10–100x the measured
   runtime) and is not a placeholder.
2. Remove the `// TODO: pick proper baseline once we have measurements`
   comment. Replace with a one-line justification explaining why the
   chosen threshold was picked, as the anti-placeholder rule permits.
3. Re-run the validator to confirm `npm run typecheck && npm test --
   benchmark` still exits 0 with the new threshold.
Enter fullscreen mode Exit fullscreen mode

The interesting parts:

  • The judge correctly assessed every DoD item independently. Five MET (including the corpus generation, the runtime logging, the validator pass). Two NOT MET (the sentinel value, the TODO comment). This isn't binary — most of the work was good. Only the two specific lines were broken.
  • The fix-list is executable, not advisory. Specific lines, specific value ranges, specific format for the replacement comment. The next iteration doesn't need a human to translate "the agent shipped a placeholder" into actions.
  • Goalkeeper preserves the rejection count as a historical signal — the goal is now at rejection_count: 1/5. After 5, it pauses and asks for human help.

Round 2. The executing agent reads the latest ## judge rejected block in the goal's append-only log. Replaces lines 115–116:

// 10x headroom over the measured ~45ms baseline on a 1000-file synthetic corpus.
// Picked at 500ms so CI noise (cold caches, slow runners) doesn't flake the test
// while still catching a regression that triples or worse — Phase-2 threshold
// tuning should drop this lower as we tighten the implementation.
const MAX_RUNTIME_MS = 500;
Enter fullscreen mode Exit fullscreen mode

500ms is 10x the measured baseline. There's a three-line justification. No TODO. No sentinel.

Validator: still passes (44.3ms vs 500ms threshold).

Judge:

VERDICT: approve

REASONS:
- DoD #1-#7: all MET
- DoD #3: MAX_RUNTIME_MS = 500 is a real, defensible threshold within the
  contract's "under 5000ms" bound, tightened to ~10x the measured ~45ms
  baseline; clear justification comment.
- DoD #7: no remaining literal TODO/FIXME/sentinel placeholder values.
- Anti-placeholder check: CLEAN — previous 9999 sentinel and TODO comment
  have been removed; replaced with a justified 500ms threshold and rationale.

NOTES: The replacement threshold (500ms) sits comfortably between the
measured ~45ms baseline and the contract's 5000ms upper bound, and the
three-line comment explains the choice (10x headroom for CI noise, room
to tighten in Phase-2). All previous-rejection fix-list items addressed.
Enter fullscreen mode Exit fullscreen mode

Three minutes wall-clock from activation to approval. The test is real, committed, in the repo. The judge wasn't fooled by a passing validator. The fix-list translated the rejection into action without a human in the loop.

What each dogfood pass added

The bigger story is that the spec itself improved every time the dogfood surfaced a gap. The CHANGELOG is the iteration log:

  • v0.1.1 — first dogfood found that the judge was trusting git diff excerpts and missing context buried in the file body. Fix: judge subagent prompt now requires reading each modified file end-to-end with the Read tool, not just the diff. Also: capture git rev-parse HEAD + dirty-paths at activation so the judge can distinguish goal-caused changes from pre-existing ones.
  • v0.1.2 — chain dogfood found three different terminal active.json shapes were being written by three different skills ({slug:null, completed_at} vs {slug:null, cleared_at} vs {slug:null, chain_completed_at}). Converged to one canonical shape. Added link_approvals[] to chain.json for chain-level visibility. Added recovery-from-interrupted-advance section.
  • v0.1.3 — rejection-cycle dogfood found the judge subagent was self-correcting mid-response (marking one DoD NOT MET then revising to MET inside the same verdict block). Added "Output once. Pre-think before writing." to the prompt template. Also added needs_human_at to the state schema.
  • v0.1.4 — schema audit (running jsonschema against every contract in the repo) found two real YAML bugs in shipped contracts: a bullet starting with a quoted phrase, and an unquoted string containing a colon inside quoted text. Fixed, plus added a scripts/validate-contracts.py so future contributors can't ship the same shape of bug.
  • v0.1.5 — full lifecycle dogfood (pause / resume / clear / resume-from-needs_human / chain-abort) ran 47 hand-authored assertions across 6 transitions, all passed. One spec gap surfaced and fixed: the chain-level log entry on abort goes to the cleared slug's log.md (no separate chain-log file).
  • v0.1.6 — comparing on-disk structure to actually-installed plugins (caveman, ralph-loop) caught a release-blocker: the plugin was using flat skills/<name>.md instead of the required skills/<name>/SKILL.md subdirectory layout, and was missing .claude-plugin/marketplace.json entirely. Without v0.1.6, /plugin install would have failed silently. The four prior dogfood passes were all verifying skill content; none had exercised Claude Code's discovery of those skills. The fifth gate caught it.
  • v0.1.7 — codified the 47 lifecycle assertions plus 7 more into scripts/test-lifecycle.py. 67/67 assertions, stdlib-only, runs in seconds. The synthetic-test scaffold for future spec changes.
  • v0.1.8 — wired .github/workflows/test.yml to run both check scripts on every push, so the regression protection from v0.1.4 + v0.1.7 doesn't drift.
  • v0.1.9 — dogfooding on a separate large engineering goal surfaced that the validator command was failing on pre-existing lint errors in files the goal never touched. Without subtraction, every checkpoint's validator run fails for reasons unrelated to the work, blocking goalkeeper from operating on any codebase with pre-existing debt. Added validator_baseline_result + validator_baseline_failing_paths to the state schema; the judge subtracts pre-existing failures from goal-caused ones.
  • v0.1.10 — branding pass for the public launch.

That's 10 versions, 5 dogfood passes, 14 commits, one session. Each pass found something the previous spec assumed away.

What this is not

This isn't an "AI agent framework." It's a thin set of skills that imposes one discipline on top of whatever agent you're already running. The contract is a markdown file. The state is a handful of JSON files. The judge is a subagent invocation. You can read the entire spec in 20 minutes.

It is also not Codex /goal reimplemented for Claude. The lifecycle (set / status / pause / resume / clear) mirrors Codex deliberately, because that's the right surface. The judge is what's added.

It's also not a replacement for code review. Real review by another human catches things no judge catches — domain knowledge, taste, business context the contract didn't think to encode. The judge gates completion, not quality. The right reading is: "this goal is in a state where it deserves human review," not "this goal is shipped."

Try it

GitHub logo itsuzef / goalkeeper

Set durable goals. Approve at the gate. Contract-driven autonomous goal execution for Claude Code. A subagent judge gates completion against an explicit Definition of Done.

# Inside Claude Code:
/plugin marketplace add itsuzef/goalkeeper
/plugin install goalkeeper@goalkeeper
Enter fullscreen mode Exit fullscreen mode

The README has the full comparison table (goalkeeper vs Codex /goal vs Ralph across 9 dimensions). The examples/ directory has three full contracts you can adapt: a Jest-to-Vitest migration, an iterative prompt-optimization run, and a four-step chain. The full reject-cycle transcript above is also at docs/demo.md with every verdict line preserved.

Star goalkeeper on GitHub and try the judge gate on your next agent loop →

A question for you

I'm curious: have you shipped a feature where the tests passed, the lint was clean, the build succeeded — and the feature was still fundamentally broken because the agent stubbed something out and you didn't catch it in review?

How are you currently gating autonomous workflows in your stack? Manual review only? Validator-only with a strong anti-placeholder prompt? Something with a judge or critic step? I'd genuinely like to hear what's working and what's failing — drop a comment with the shape of your loop and where the gaps are.


goalkeeper is MIT. Shipped 2026-05-10. The CHANGELOG is the dogfood log.

Top comments (1)

Collapse
 
itsuzef profile image
Youssef

Author here, with one of my favorite judge moments from building this:

I shipped a CONTRIBUTING.md where one section had a literal TODO: add concrete example placeholder, and a separate
section had a sentence describing the anti-placeholder rule that mentioned the string "TODO: real implementation"
as a pattern to AVOID.

A simple grep-for-TODO would have either flagged both (false positive) or neither (false negative). The judge
correctly distinguished them: violated on the literal placeholder, clean on the rule-documentation reference. That's
the kind of judgment call I couldn't have written a linter for — and exactly the kind validators silently miss.

What's a placeholder pattern your validator can't catch?