DEV Community: Ian

How 3 Claude Code Hook Strategies Compare for Preventing False-Completion

Ian — Wed, 20 May 2026 16:30:07 +0000

You ask Claude Code to add unit tests for the auth module. It works for two minutes and replies: "I've added comprehensive tests and verified they all pass."

You run git diff. There are three new test files. You run npm test. The output is 0 tests ran. The files exist. They contain no actual test() or it() calls — just stubs and TODO comments.

This is not a hallucination in the strict sense. The model wrote something. It just bundled an unearned closeout phrase onto the end of a half-finished task. Cemri et al. (NeurIPS 2025, "Multi-Agent System failure Taxonomy", arXiv:2503.13657) call this Mode 3.3 — No or Incorrect Verification, and in their corpus of 1600+ annotated multi-agent traces it accounts for the single largest slice inside the "task verification & termination" category (21.3% of all failures; Mode 3.3 dominates that category). The MAD dataset they published makes the pattern empirically reproducible.

You cannot prompt your way out of this. "Please verify before claiming done" is in roughly every CLAUDE.md on GitHub, and the trace data says it does not work as a steady-state mitigation. The intervention that actually moves the needle is out-of-band — a Claude Code hook that runs deterministic code at the session boundary and refuses to let the closeout through.

This post compares three production-grade approaches: verify-before-stop (log-based contract), no-vibes (text-vocabulary judge), and no-unreachable-symbol (codebase-static-analysis advisor). They are not competing — they catch different sub-failures of the same Mode 3.3.

Strategy 1: `verify-before-stop` — log-based contract

ianymu/claude-verify-before-stop treats verification as a discrete event that must be recorded. The hook fires on Claude Code's Stop event, reads the working-tree diff, and refuses to allow session-end if files changed but no fresh VERIFIED log entry exists.

The mechanism is a contract, not a heuristic:

#!/bin/bash
# .claude/hooks/verify-before-stop.sh (excerpt)

# Read Stop-event payload from stdin
PAYLOAD="$(cat)"
[ "$(echo "$PAYLOAD" | jq -r '.stop_hook_active // false')" = "true" ] && exit 0

# Did files change this session?
CHANGED="$(git diff --name-only HEAD 2>/dev/null)
$(git ls-files --others --exclude-standard 2>/dev/null)"
[ -z "$(echo "$CHANGED" | tr -d '[:space:]')" ] && exit 0

# Require a VERIFIED entry written within the last 5 minutes
LOG=".claude/state/stop-verify.log"
CUTOFF=$(( $(date +%s) - 300 ))
LATEST=$(grep '|VERIFIED' "$LOG" 2>/dev/null | tail -1 | cut -d'|' -f1)

if [ -z "$LATEST" ] || [ "$LATEST" -lt "$CUTOFF" ]; then
  echo "BLOCKED: files changed but no VERIFIED entry in last 5 min." >&2
  echo "Run your verification (npm test / curl / psql) then:" >&2
  echo "  echo \"\$(date +%s)|VERIFIED\" >> $LOG" >&2
  exit 2  # exit 2 = block stop, surface stderr to model
fi
exit 0

Inside the session, Claude is expected to log evidence explicitly:

npm test
echo "$(date +%s)|VERIFY_ACTION|npm test passed" >> .claude/state/stop-verify.log
echo "$(date +%s)|VERIFIED" >> .claude/state/stop-verify.log

What it catches well: the entire class of failures where the model writes a confident closing message without ever shelling out to a verifier. Because the contract is external (the log file is outside the model's context window), paraphrase attacks don't work — there is no clever phrasing that satisfies a missing log entry.

What it does not catch: a model that does write a VERIFIED line but whose verification was bogus (echo "VERIFIED" with no prior test command). It enforces that verification happened, not that the verification was correct.

Tradeoffs: zero language coverage problem (it doesn't read code or text — it reads filesystem state), zero false-positive rate by construction (silence on pure-conversation turns where no files changed), 60-second setup, MIT-licensed, no dependencies beyond bash + python3 stdlib.

Strategy 2: `no-vibes` — text-vocabulary judge

waitdeadai/no-vibes, part of the llm-dark-patterns suite, takes the opposite angle. It reads the model's outgoing text on Stop and looks for the linguistic signature of unearned closeouts — "all tests pass", "looks good", "should work", "verified" — when no proximate evidence (a fenced block with tool output, a recognized verifier binary name, a hash, an exit code) appears in the same message.

The detector is deterministic regex + locale packs + an evidence-binary allowlist with 200+ entries spanning app-dev, devops, k8s, cloud, and database tooling. Operators extend it without forking by dropping a .txt file at ${XDG_CONFIG_HOME}/llm-dark-patterns/packs/.

# Conceptual pattern (the real hook is ~530 lines with negation handling and proximity windows):

POSITIVE_CLOSEOUT_RE='\b(all tests pass(ed|ing)?|looks good|should work|verified|fixed|done|complete[d]?)\b'
EVIDENCE_BINARY_RE='\b(npm|pytest|cargo|go test|jest|playwright|curl|psql|kubectl|terraform|...)\b'

MSG="$(cat | jq -r '.message.content // empty')"

if printf '%s' "$MSG" | grep -qE "$POSITIVE_CLOSEOUT_RE"; then
  if ! printf '%s' "$MSG" | grep -qE "$EVIDENCE_BINARY_RE"; then
    echo "no-vibes: positive closeout without same-message evidence." >&2
    echo "Either run the verifier and quote its output, or hedge the claim." >&2
    exit 2
  fi
fi
exit 0

Empirical baseline: F1 0.815 (95% CI [0.615, 0.941], n=19) on the human-labelled subset of MAD against Mode 3.3 — published at the umbrella suite's evaluation/MAST-RESULTS.md. On the full LLM-judge subset (n=954) it scores F1 0.308 — recall is high (0.486) and precision falls because the LLM judge is noisier than human annotators.

What it catches well: the "confidence theater" surface. Cases where the model has the vocabulary of a successful turn but didn't actually run anything. Useful as a pure complement to verify-before-stop because it can fire mid-message (PreToolUse / Stop) rather than only at session-end.

What it does not catch: a model that closes flatly — no positive verbs at all, just "Here are the files I added." Some failure modes route around no-vibes by producing prose that is technically not a positive closeout but still implies completion.

Tradeoffs: language coverage is per-locale pack (English, Spanish, Polish ship; others need a .txt contribution). False-positive rate is non-zero — operators report occasional fires on legitimate hedged-positive turns, hence the clause-local negation hardening in the May 2026 release. Setup: 30 seconds via the plugin marketplace or one curl.

Strategy 3: `no-unreachable-symbol` — codebase static-analysis advisor

The third strategy lives at a different boundary entirely. no-unreachable-symbol fires on Stop, diffs the working tree against HEAD, extracts new public Python symbols from added lines, and flags symbols with zero callers under an exclusion-aware grep that understands decorators (@app.route, @pytest.fixture, @click.command), __all__ markers, registry patterns (HANDLERS["foo"] = sym), and private prefixes.

# Conceptual flow (the real hook is ~200 lines):

DIFF="$(git diff HEAD -- '*.py')"
NEW_SYMBOLS="$(printf '%s' "$DIFF" | grep -E '^\+\s*(def|class)\s+' | sed -E 's/^\+\s*(def|class)\s+([a-zA-Z_][a-zA-Z0-9_]*).*/\2/' | sort -u)"

# Skip private (_foo) and dunder (__foo__) symbols
# Skip decorator-wired symbols (framework callbacks)
# Respect __all__ public-API markers
# ... (full exclusion logic in the real script)

for sym in $REMAINING_SYMBOLS; do
  CALLERS="$(grep -rE "\b$sym\b" --include='*.py' --exclude-dir=tests . | grep -vE "^[^:]+:\s*(def|class)\s+$sym\b" || true)"
  if [ -z "$CALLERS" ]; then
    echo "ADVISORY: new public symbol \`$sym\` has zero callers." >&2
  fi
done

# Default: advisory mode (exit 0 with stderr). Strict mode via env:
[ "${LDP_UNREACHABLE_SYMBOL_BLOCK:-0}" = "1" ] && exit 2

What it catches well: the specific failure where Claude generates a new public function or class as part of a refactor, claims the refactor is wired up, but never edits any caller. This is invisible to no-vibes (no positive-closeout language was used) and invisible to verify-before-stop (the model dutifully logged VERIFIED after running tests — which passed because nothing called the new symbol).

What it does not catch: non-Python languages (Slice 0 is Python-only; TS/JS/Rust/Go are roadmap), and dynamic-dispatch patterns where the symbol is called but via getattr / string-based reflection that grep cannot resolve.

Tradeoffs: language coverage is limited (Python at Slice 0). False-positive rate is non-zero for codebases that rely heavily on registry patterns the hook doesn't recognize — hence the advisory-by-default ship mode, with strict-blocking opt-in via LDP_UNREACHABLE_SYMBOL_BLOCK=1. No empirical F1 baseline because MAD is text-only traces with no git-diff-vs-codebase ground truth; the hook ships against a 12-scenario smoke harness instead.

Side-by-side

Dimension	`verify-before-stop`	`no-vibes`	`no-unreachable-symbol`
Signal source	filesystem (`VERIFIED` log)	model's outgoing text	git diff + codebase grep
Operator effort per session	active (must `echo VERIFIED`)	passive	passive (advisory) / active opt-in (strict)
Catches false closeout phrasing	no (out-of-band)	yes (primary target)	no
Catches verified-but-bogus	no	partially	no
Catches dead-code-on-merge	no	no	yes
Language coverage	language-agnostic	per-locale pack (en/es/pl ship)	Python (Slice 0)
Empirical F1 vs MAST 3.3	not published (strict contract; no false-negative-acceptable mode)	0.815 human-labelled, 0.308 LLM-judge	n/a (no ground truth dataset)
False-positive rate	~0 (silence when no files changed)	low but non-zero (hedge-then-positive cases)	non-zero on heavy-registry codebases
Setup time	60s	30s (plugin marketplace) or 60s (curl)	30s (part of umbrella plugin)
License	MIT	Apache 2.0	Apache 2.0

When to use which

Use verify-before-stop if you are running a single-developer Claude Code project with frequent destructive operations (database migrations, API deploys, infra changes) where the cost of an unverified closeout is high (rollback, on-call page, data loss). The active-logging discipline is friction by design — it forces a verification step into muscle memory. Best fit for shipping individual contributors who have been bitten by Mode 3.3 in production and want the hardest contract.

Use no-vibes if you are working across many short turns where active logging is impractical (e.g. extended exploratory sessions, planning work, multi-agent supervisor closeouts). It is also the right pick for multi-language stacks because the locale packs and evidence-binary allowlist generalize, and for teams where the vocabulary of false success is the primary failure surface — overconfident closeouts from junior-coded prompts or model drift on long sessions.

Use no-unreachable-symbol if your codebase is primarily Python and your dominant failure shape is the "refactor mirage" — Claude added the new helper but never edited any caller, tests still pass because nothing exercises the new symbol, and you only discover it when reviewing the PR. Especially useful for library work where unused public symbols become part of the API surface and accrue maintenance cost forever.

Decision shortcuts:

"My tests are green but my prod is on fire" → verify-before-stop (you have a test/reality mismatch, not a model-vocabulary problem).
"Claude says 'looks good' on turns where I know it didn't run anything" → no-vibes.
"My PR reviewers keep finding orphan code Claude added" → no-unreachable-symbol.

Combining all three: defense-in-depth

The hooks compose. Each closes a different gap. A session that survives all three has had filesystem-contract, text-evidence, and symbol-evidence line up — which is roughly the contract MAST 3.3 actually asks for.

{
  "hooks": {
    "Stop": [
      {
        "matcher": "*",
        "hooks": [
          { "type": "command", "command": "bash .claude/hooks/verify-before-stop.sh" },
          { "type": "command", "command": "bash .claude/hooks/no-vibes.sh" },
          { "type": "command", "command": "bash .claude/hooks/no-unreachable-symbol.sh" }
        ]
      }
    ],
    "PreToolUse": [
      {
        "matcher": "*",
        "hooks": [
          { "type": "command", "command": "bash .claude/hooks/no-vibes.sh" }
        ]
      }
    ]
  }
}

Order matters: verify-before-stop first (cheapest check; exits early on read-only sessions), then no-vibes (text scan), then no-unreachable-symbol (most expensive — runs a full repo grep). Any hook returning exit 2 blocks the stop and surfaces its stderr to the model, which then has the corrective context on the next turn.

If you want a starting point tailored to your stack instead of a generic template, Ian's free audit tool generates three personalized hook recommendations from your repo's language mix and CI setup in about 41 seconds — no signup, no email gate.

Stop Claude Code from lying about completion — a 50-line bash hook

Ian — Wed, 20 May 2026 11:38:04 +0000

After 12 months running Claude Code on 14 parallel projects, the same regression cycle kept happening:

Claude: "All tests passing ✅"
Me:     [merges PR]
Prod:   [throws 500s]
Me:     [2h debugging]
Tomorrow: [same cycle]

The model isn't lying on purpose — it's just optimistic about its own work. The fix isn't a better prompt. The fix is a workflow guard.

So I built verify-before-stop.sh — a Stop hook that blocks the session from ending until verification is actually logged. Here's the full implementation and why it works.

The pattern

Claude Code's Stop hook fires when the model tries to end a session. The hook receives JSON on stdin with stop_hook_active, transcript_path, session_id. If the hook exits non-zero with stderr message, Claude continues the turn — and the stderr becomes part of the model's context.

That's the leverage point. If the model claims "done" without proof, exit 2 with instructions for what proof is required.

The hook (50 lines)

#!/bin/bash
# verify-before-stop.sh

INPUT=$(cat)

# Avoid infinite loop — Claude Code sets stop_hook_active=true
# when continuing from a previous block
STOP_HOOK_ACTIVE=$(echo "$INPUT" | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print('true' if d.get('stop_hook_active') else 'false')" \
  2>/dev/null)
if [ "$STOP_HOOK_ACTIVE" = "true" ]; then
    exit 0
fi

VERIFY_LOG=".claude/state/stop-verify.log"
mkdir -p .claude/state

# No file changes → pure conversation → allow stop
HAS_CHANGES=$(git diff --name-only 2>/dev/null | head -5)
HAS_UNTRACKED=$(git ls-files --others --exclude-standard 2>/dev/null | grep -v '.claude/state/' | head -5)
if [ -z "$HAS_CHANGES" ] && [ -z "$HAS_UNTRACKED" ]; then
    exit 0
fi

# Files changed → require VERIFIED log entry in last 5 min
if [ -f "$VERIFY_LOG" ]; then
    FIVE_MIN_AGO=$(date -v-5M +%s 2>/dev/null || date -d '5 minutes ago' +%s 2>/dev/null || echo 0)
    LAST_VERIFY=$(grep '|VERIFIED' "$VERIFY_LOG" 2>/dev/null | tail -1 | cut -d'|' -f1)
    LAST_ACTION=$(grep '|VERIFY_ACTION' "$VERIFY_LOG" 2>/dev/null | tail -1 | cut -d'|' -f1)
    if [ -n "$LAST_VERIFY" ] && [ "$LAST_VERIFY" -gt "$FIVE_MIN_AGO" ] 2>/dev/null; then
        if [ -n "$LAST_ACTION" ] && [ "$LAST_ACTION" -gt "$FIVE_MIN_AGO" ] 2>/dev/null; then
            echo "$(date +%s)|STOP_ALLOWED" >> "$VERIFY_LOG"
            exit 0
        fi
    fi
fi

echo "$(date +%s)|STOP_BLOCKED" >> "$VERIFY_LOG"
echo "⛔ BLOCKED: files changed but no verification logged in last 5 min." >&2
echo "Required: log a VERIFY_ACTION + VERIFIED entry, e.g.:" >&2
echo '   echo "$(date +%s)|VERIFY_ACTION|ran npm test, 3 failures" >> .claude/state/stop-verify.log' >&2
echo '   echo "$(date +%s)|VERIFIED" >> .claude/state/stop-verify.log' >&2
exit 2

Install

Drop into .claude/hooks/ and wire it up:

// .claude/settings.json
{
  "hooks": {
    "Stop": [{
      "matcher": "*",
      "hooks": [
        { "type": "command", "command": "bash .claude/hooks/verify-before-stop.sh" }
      ]
    }]
  }
}

Restart your Claude Code session. Done.

How verification works in practice

When Claude tries to end after edits, the hook blocks with stderr telling it exactly what to log. The model then has to either:

# A) Actually verify
$ npm test
# ...test output...
$ echo "$(date +%s)|VERIFY_ACTION|npm test all green" >> .claude/state/stop-verify.log
$ echo "$(date +%s)|VERIFIED" >> .claude/state/stop-verify.log

# B) Admit it didn't verify
# Claude in next turn: "I couldn't actually run the tests — they require a
# dev DB. Here's what I changed; please run npm test manually before merging."

Either way, you stop getting false "done" reports.

4 design decisions worth defending

1. Why a 5-minute TTL on VERIFIED?
Long enough that legitimate verify→stop sequences don't trip on it. Short enough that stale entries from yesterday don't accidentally allow stops.

2. Why git diff as the change signal?
Catches both edits and untracked new files. Excludes .claude/state/ (the hook's own writes) to avoid self-triggering loops.

3. Why exit 2 (not 1)?
Claude Code treats exit 2 as a Stop block with stderr passed to the model. Exit 1 is treated as a generic error and may not surface the message correctly.

4. Why explicit VERIFY_ACTION + VERIFIED two-line pattern?
The VERIFY_ACTION line forces the model to commit what it verified in writing — not just claim it. Doubles as an audit trail you can grep later.

Watch out for the cap

Claude Code v2.1.143+ added a built-in safeguard: after ~8 consecutive Stop-hook blocks, the turn ends with a warning regardless. This is intentional — prevents broken hooks from infinite-looping the model. Override with CLAUDE_CODE_STOP_HOOK_BLOCK_CAP=20 if you have legitimate reason for more retries.

Design your hook to give the model enough information to comply on the next turn, not on the 8th turn.

What this saved me

After 6 months of using this hook across 14 projects:

Eliminated "AI says tests pass, they didn't" regressions — the #1 source of "wait, I thought you said this worked" merges.
Forced explicit verification logging — .claude/state/stop-verify.log now reads like an audit trail.
Survived conversation compaction — the log file persists, so even after Claude Code summarizes the session, the verification record is still on disk.

Open source

Full source on GitHub: https://github.com/ianymu/claude-verify-before-stop

MIT license. Star it if you find it useful — easier to find when you need it again.

If you want the rest

I packaged this with 5 more hooks I built over those 12 months — cost-tracker.sh (logs every $ of Opus spend to costs.jsonl), block-secrets.sh (scans Write/Edit/Bash for sk-ant-*/JWT/AWS keys before commit), force-progress-update.sh (checkpoint every 5 actions to survive compaction), pre-compact-diary.sh (preserves WIP context), enforce-autoplan.sh (no code-write without a plan).

Full 6-pack is $49 launch price (regular $79), 30-day money-back, instant download via Polar: https://landing-ianymu.vercel.app

But honestly — verify-before-stop alone solves 80% of the pain. Start there. Use it free for a week, then decide if the rest is worth it.

Discussion

Curious if you've built similar workflow guards. The verify-before-stop pattern feels obvious in hindsight — what other "lies of completion" patterns have you caught?

Drop your own hooks in the comments. If they're solid I'll add them to the README with credit.

— Ian

I analyzed thousands of founder posts and built a platform to solve the #1 complaint

Ian — Tue, 10 Mar 2026 12:44:42 +0000

6 months ago I quit my job to build products solo. The coding was fine. Marketing I could figure out. What almost made me give up was the silence.

Not "I'm lonely" silence. More like — I just spent three weeks on a feature and I genuinely can't tell if it's brilliant or stupid. And there's nobody to ask.

The research

I started reading everything solo founders were posting online. Reddit, HN, IndieHackers, Twitter. Thousands of posts. Same three problems everywhere:

1. No accountability
You set goals on Monday. By Wednesday they're gone. Nobody notices, nobody cares. Without a co-founder or team, goals are just wishes.

2. Generic advice
"Have you tried content marketing?" Thanks. Very helpful. The advice you get from people who don't know your product is almost always useless. Not because they don't care — they just don't have context.

3. Communities built for broadcasting, not connection
You scroll, read, maybe post something. It gets 2 upvotes. You move on. There are hundreds of founder communities, but they're all optimized for content consumption, not for actual relationships.

What actually worked

The one thing that changed my productivity was finding another founder at my exact stage. We did weekly 30-minute calls. Simple format:

What did you do this week?
What's the plan for next week?
What's blocking you?

My output doubled. Just knowing someone would ask me on Friday whether I did the thing I said I'd do on Monday — that was enough.

Then he got busy, the calls stopped, and I went back to building alone.

What I'm building

So I'm making a platform called ADot to solve this at scale.

The core: AI matches you with a founder at your stage (pre-launch? first 10 users? trying to hit $1K MRR?) for structured weekly check-ins. Not networking events. Not another Slack group. Actual accountability with someone who gets it.

On top of that:

Community feed where milestones get seen (not buried)
AI recommendations based on what similar products did to grow
Open-source founder tools (pain point analyzer, competitor tracker)

Current status

Just launched the landing page. Validating whether this resonates or if I'm projecting.

If you've ever built alone and wished someone was in the trenches with you: https://adot-lp.vercel.app

Feedback >> signups. Tell me why this won't work.

I analyzed thousands of founder posts and built a platform to solve the #1 complaint

Ian — Tue, 10 Mar 2026 12:44:42 +0000

6 months ago I quit my job to build products solo. The coding was fine. Marketing I could figure out. What almost made me give up was the silence.

Not "I'm lonely" silence. More like — I just spent three weeks on a feature and I genuinely can't tell if it's brilliant or stupid. And there's nobody to ask.

The research

I started reading everything solo founders were posting online. Reddit, HN, IndieHackers, Twitter. Thousands of posts. Same three problems everywhere:

1. No accountability
You set goals on Monday. By Wednesday they're gone. Nobody notices, nobody cares. Without a co-founder or team, goals are just wishes.

What actually worked

The one thing that changed my productivity was finding another founder at my exact stage. We did weekly 30-minute calls. Simple format:

What did you do this week?
What's the plan for next week?
What's blocking you?

My output doubled. Just knowing someone would ask me on Friday whether I did the thing I said I'd do on Monday — that was enough.

Then he got busy, the calls stopped, and I went back to building alone.

What I'm building

So I'm making a platform called ADot to solve this at scale.

On top of that:

Community feed where milestones get seen (not buried)
AI recommendations based on what similar products did to grow
Open-source founder tools (pain point analyzer, competitor tracker)

Current status

Just launched the landing page. Validating whether this resonates or if I'm projecting.

If you've ever built alone and wished someone was in the trenches with you: https://adot-lp.vercel.app

Feedback >> signups. Tell me why this won't work.

DEV Community: Ian

How 3 Claude Code Hook Strategies Compare for Preventing False-Completion

Strategy 1: verify-before-stop — log-based contract

Strategy 2: no-vibes — text-vocabulary judge

Strategy 3: no-unreachable-symbol — codebase static-analysis advisor

Side-by-side

When to use which

Combining all three: defense-in-depth

Stop Claude Code from lying about completion — a 50-line bash hook

The pattern

The hook (50 lines)

Install

How verification works in practice

4 design decisions worth defending

Watch out for the cap

What this saved me

Open source

If you want the rest

Discussion

I analyzed thousands of founder posts and built a platform to solve the #1 complaint

The research

What actually worked

What I'm building

Current status

I analyzed thousands of founder posts and built a platform to solve the #1 complaint

The research

What actually worked

What I'm building

Current status

Strategy 1: `verify-before-stop` — log-based contract

Strategy 2: `no-vibes` — text-vocabulary judge

Strategy 3: `no-unreachable-symbol` — codebase static-analysis advisor