DEV Community

Ian
Ian

Posted on • Originally published at landing-ianymu.vercel.app

How 3 Claude Code Hook Strategies Compare for Preventing False-Completion

You ask Claude Code to add unit tests for the auth module. It works for two minutes and replies: "I've added comprehensive tests and verified they all pass."

You run git diff. There are three new test files. You run npm test. The output is 0 tests ran. The files exist. They contain no actual test() or it() calls — just stubs and TODO comments.

This is not a hallucination in the strict sense. The model wrote something. It just bundled an unearned closeout phrase onto the end of a half-finished task. Cemri et al. (NeurIPS 2025, "Multi-Agent System failure Taxonomy", arXiv:2503.13657) call this Mode 3.3 — No or Incorrect Verification, and in their corpus of 1600+ annotated multi-agent traces it accounts for the single largest slice inside the "task verification & termination" category (21.3% of all failures; Mode 3.3 dominates that category). The MAD dataset they published makes the pattern empirically reproducible.

You cannot prompt your way out of this. "Please verify before claiming done" is in roughly every CLAUDE.md on GitHub, and the trace data says it does not work as a steady-state mitigation. The intervention that actually moves the needle is out-of-band — a Claude Code hook that runs deterministic code at the session boundary and refuses to let the closeout through.

This post compares three production-grade approaches: verify-before-stop (log-based contract), no-vibes (text-vocabulary judge), and no-unreachable-symbol (codebase-static-analysis advisor). They are not competing — they catch different sub-failures of the same Mode 3.3.

Strategy 1: verify-before-stop — log-based contract

ianymu/claude-verify-before-stop treats verification as a discrete event that must be recorded. The hook fires on Claude Code's Stop event, reads the working-tree diff, and refuses to allow session-end if files changed but no fresh VERIFIED log entry exists.

The mechanism is a contract, not a heuristic:

#!/bin/bash
# .claude/hooks/verify-before-stop.sh (excerpt)

# Read Stop-event payload from stdin
PAYLOAD="$(cat)"
[ "$(echo "$PAYLOAD" | jq -r '.stop_hook_active // false')" = "true" ] && exit 0

# Did files change this session?
CHANGED="$(git diff --name-only HEAD 2>/dev/null)
$(git ls-files --others --exclude-standard 2>/dev/null)"
[ -z "$(echo "$CHANGED" | tr -d '[:space:]')" ] && exit 0

# Require a VERIFIED entry written within the last 5 minutes
LOG=".claude/state/stop-verify.log"
CUTOFF=$(( $(date +%s) - 300 ))
LATEST=$(grep '|VERIFIED' "$LOG" 2>/dev/null | tail -1 | cut -d'|' -f1)

if [ -z "$LATEST" ] || [ "$LATEST" -lt "$CUTOFF" ]; then
  echo "BLOCKED: files changed but no VERIFIED entry in last 5 min." >&2
  echo "Run your verification (npm test / curl / psql) then:" >&2
  echo "  echo \"\$(date +%s)|VERIFIED\" >> $LOG" >&2
  exit 2  # exit 2 = block stop, surface stderr to model
fi
exit 0
Enter fullscreen mode Exit fullscreen mode

Inside the session, Claude is expected to log evidence explicitly:

npm test
echo "$(date +%s)|VERIFY_ACTION|npm test passed" >> .claude/state/stop-verify.log
echo "$(date +%s)|VERIFIED" >> .claude/state/stop-verify.log
Enter fullscreen mode Exit fullscreen mode

What it catches well: the entire class of failures where the model writes a confident closing message without ever shelling out to a verifier. Because the contract is external (the log file is outside the model's context window), paraphrase attacks don't work — there is no clever phrasing that satisfies a missing log entry.

What it does not catch: a model that does write a VERIFIED line but whose verification was bogus (echo "VERIFIED" with no prior test command). It enforces that verification happened, not that the verification was correct.

Tradeoffs: zero language coverage problem (it doesn't read code or text — it reads filesystem state), zero false-positive rate by construction (silence on pure-conversation turns where no files changed), 60-second setup, MIT-licensed, no dependencies beyond bash + python3 stdlib.

Strategy 2: no-vibes — text-vocabulary judge

waitdeadai/no-vibes, part of the llm-dark-patterns suite, takes the opposite angle. It reads the model's outgoing text on Stop and looks for the linguistic signature of unearned closeouts — "all tests pass", "looks good", "should work", "verified" — when no proximate evidence (a fenced block with tool output, a recognized verifier binary name, a hash, an exit code) appears in the same message.

The detector is deterministic regex + locale packs + an evidence-binary allowlist with 200+ entries spanning app-dev, devops, k8s, cloud, and database tooling. Operators extend it without forking by dropping a .txt file at ${XDG_CONFIG_HOME}/llm-dark-patterns/packs/.

# Conceptual pattern (the real hook is ~530 lines with negation handling and proximity windows):

POSITIVE_CLOSEOUT_RE='\b(all tests pass(ed|ing)?|looks good|should work|verified|fixed|done|complete[d]?)\b'
EVIDENCE_BINARY_RE='\b(npm|pytest|cargo|go test|jest|playwright|curl|psql|kubectl|terraform|...)\b'

MSG="$(cat | jq -r '.message.content // empty')"

if printf '%s' "$MSG" | grep -qE "$POSITIVE_CLOSEOUT_RE"; then
  if ! printf '%s' "$MSG" | grep -qE "$EVIDENCE_BINARY_RE"; then
    echo "no-vibes: positive closeout without same-message evidence." >&2
    echo "Either run the verifier and quote its output, or hedge the claim." >&2
    exit 2
  fi
fi
exit 0
Enter fullscreen mode Exit fullscreen mode

Empirical baseline: F1 0.815 (95% CI [0.615, 0.941], n=19) on the human-labelled subset of MAD against Mode 3.3 — published at the umbrella suite's evaluation/MAST-RESULTS.md. On the full LLM-judge subset (n=954) it scores F1 0.308 — recall is high (0.486) and precision falls because the LLM judge is noisier than human annotators.

What it catches well: the "confidence theater" surface. Cases where the model has the vocabulary of a successful turn but didn't actually run anything. Useful as a pure complement to verify-before-stop because it can fire mid-message (PreToolUse / Stop) rather than only at session-end.

What it does not catch: a model that closes flatly — no positive verbs at all, just "Here are the files I added." Some failure modes route around no-vibes by producing prose that is technically not a positive closeout but still implies completion.

Tradeoffs: language coverage is per-locale pack (English, Spanish, Polish ship; others need a .txt contribution). False-positive rate is non-zero — operators report occasional fires on legitimate hedged-positive turns, hence the clause-local negation hardening in the May 2026 release. Setup: 30 seconds via the plugin marketplace or one curl.

Strategy 3: no-unreachable-symbol — codebase static-analysis advisor

The third strategy lives at a different boundary entirely. no-unreachable-symbol fires on Stop, diffs the working tree against HEAD, extracts new public Python symbols from added lines, and flags symbols with zero callers under an exclusion-aware grep that understands decorators (@app.route, @pytest.fixture, @click.command), __all__ markers, registry patterns (HANDLERS["foo"] = sym), and private prefixes.

# Conceptual flow (the real hook is ~200 lines):

DIFF="$(git diff HEAD -- '*.py')"
NEW_SYMBOLS="$(printf '%s' "$DIFF" | grep -E '^\+\s*(def|class)\s+' | sed -E 's/^\+\s*(def|class)\s+([a-zA-Z_][a-zA-Z0-9_]*).*/\2/' | sort -u)"

# Skip private (_foo) and dunder (__foo__) symbols
# Skip decorator-wired symbols (framework callbacks)
# Respect __all__ public-API markers
# ... (full exclusion logic in the real script)

for sym in $REMAINING_SYMBOLS; do
  CALLERS="$(grep -rE "\b$sym\b" --include='*.py' --exclude-dir=tests . | grep -vE "^[^:]+:\s*(def|class)\s+$sym\b" || true)"
  if [ -z "$CALLERS" ]; then
    echo "ADVISORY: new public symbol \`$sym\` has zero callers." >&2
  fi
done

# Default: advisory mode (exit 0 with stderr). Strict mode via env:
[ "${LDP_UNREACHABLE_SYMBOL_BLOCK:-0}" = "1" ] && exit 2
Enter fullscreen mode Exit fullscreen mode

What it catches well: the specific failure where Claude generates a new public function or class as part of a refactor, claims the refactor is wired up, but never edits any caller. This is invisible to no-vibes (no positive-closeout language was used) and invisible to verify-before-stop (the model dutifully logged VERIFIED after running tests — which passed because nothing called the new symbol).

What it does not catch: non-Python languages (Slice 0 is Python-only; TS/JS/Rust/Go are roadmap), and dynamic-dispatch patterns where the symbol is called but via getattr / string-based reflection that grep cannot resolve.

Tradeoffs: language coverage is limited (Python at Slice 0). False-positive rate is non-zero for codebases that rely heavily on registry patterns the hook doesn't recognize — hence the advisory-by-default ship mode, with strict-blocking opt-in via LDP_UNREACHABLE_SYMBOL_BLOCK=1. No empirical F1 baseline because MAD is text-only traces with no git-diff-vs-codebase ground truth; the hook ships against a 12-scenario smoke harness instead.

Side-by-side

Dimension verify-before-stop no-vibes no-unreachable-symbol
Signal source filesystem (VERIFIED log) model's outgoing text git diff + codebase grep
Operator effort per session active (must echo VERIFIED) passive passive (advisory) / active opt-in (strict)
Catches false closeout phrasing no (out-of-band) yes (primary target) no
Catches verified-but-bogus no partially no
Catches dead-code-on-merge no no yes
Language coverage language-agnostic per-locale pack (en/es/pl ship) Python (Slice 0)
Empirical F1 vs MAST 3.3 not published (strict contract; no false-negative-acceptable mode) 0.815 human-labelled, 0.308 LLM-judge n/a (no ground truth dataset)
False-positive rate ~0 (silence when no files changed) low but non-zero (hedge-then-positive cases) non-zero on heavy-registry codebases
Setup time 60s 30s (plugin marketplace) or 60s (curl) 30s (part of umbrella plugin)
License MIT Apache 2.0 Apache 2.0

When to use which

Use verify-before-stop if you are running a single-developer Claude Code project with frequent destructive operations (database migrations, API deploys, infra changes) where the cost of an unverified closeout is high (rollback, on-call page, data loss). The active-logging discipline is friction by design — it forces a verification step into muscle memory. Best fit for shipping individual contributors who have been bitten by Mode 3.3 in production and want the hardest contract.

Use no-vibes if you are working across many short turns where active logging is impractical (e.g. extended exploratory sessions, planning work, multi-agent supervisor closeouts). It is also the right pick for multi-language stacks because the locale packs and evidence-binary allowlist generalize, and for teams where the vocabulary of false success is the primary failure surface — overconfident closeouts from junior-coded prompts or model drift on long sessions.

Use no-unreachable-symbol if your codebase is primarily Python and your dominant failure shape is the "refactor mirage" — Claude added the new helper but never edited any caller, tests still pass because nothing exercises the new symbol, and you only discover it when reviewing the PR. Especially useful for library work where unused public symbols become part of the API surface and accrue maintenance cost forever.

Decision shortcuts:

  • "My tests are green but my prod is on fire"verify-before-stop (you have a test/reality mismatch, not a model-vocabulary problem).
  • "Claude says 'looks good' on turns where I know it didn't run anything"no-vibes.
  • "My PR reviewers keep finding orphan code Claude added"no-unreachable-symbol.

Combining all three: defense-in-depth

The hooks compose. Each closes a different gap. A session that survives all three has had filesystem-contract, text-evidence, and symbol-evidence line up — which is roughly the contract MAST 3.3 actually asks for.

{
  "hooks": {
    "Stop": [
      {
        "matcher": "*",
        "hooks": [
          { "type": "command", "command": "bash .claude/hooks/verify-before-stop.sh" },
          { "type": "command", "command": "bash .claude/hooks/no-vibes.sh" },
          { "type": "command", "command": "bash .claude/hooks/no-unreachable-symbol.sh" }
        ]
      }
    ],
    "PreToolUse": [
      {
        "matcher": "*",
        "hooks": [
          { "type": "command", "command": "bash .claude/hooks/no-vibes.sh" }
        ]
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Order matters: verify-before-stop first (cheapest check; exits early on read-only sessions), then no-vibes (text scan), then no-unreachable-symbol (most expensive — runs a full repo grep). Any hook returning exit 2 blocks the stop and surfaces its stderr to the model, which then has the corrective context on the next turn.


If you want a starting point tailored to your stack instead of a generic template, Ian's free audit tool generates three personalized hook recommendations from your repo's language mix and CI setup in about 41 seconds — no signup, no email gate.

Top comments (0)