DEV Community: Akash Khanna

I Tried to Catch My AI Coding Assistant Lying. Here's What Finally Worked.

Akash Khanna — Thu, 23 Jul 2026 06:39:34 +0000

If you've ever used an AI coding assistant, you've probably seen it say something like:

"Done! I've created the file, updated the config, and all tests pass. ✅"

Sometimes that's true. Sometimes it isn't. The file is half-finished, the "test" never actually ran, and you only find out later when things break. The AI isn't being evil — it just sometimes describes what it meant to do instead of what it actually did.

I built a small free tool called GroundTruth to catch this. My first version failed for months. The rewrite finally works, and the lesson behind it is useful even if you never touch my tool.

Attempt 1: Read what the AI says, and fact-check it

My first idea was simple. When the AI finishes a task, a little program automatically:

Reads the AI's final message ("I created the login page and tests pass")
Looks at what actually changed in the project (your version control system, git, keeps a perfect record — the AI can't fake that)
Compares the two and warns you about any gaps

Step 2 and 3 worked great. Step 1 was the disaster.

To find the promises inside the AI's message, I wrote text-matching patterns — rules like "if the message contains 'tests pass', treat that as a claim." Sounds reasonable. Except:

"The tests should pass" isn't a claim — it's a hope.
"I did not say the tests pass" contains the words "tests pass."
"The file was updated" dodges every pattern written for "I updated the file."
And there are infinite more ways to phrase anything.

I ended up with over 50 patterns, each fixing the last one's mistakes, and it still got something wrong every single session. Eventually I accepted why: human language has unlimited ways to say the same thing. No list of patterns can ever cover them all. I wasn't losing because I was bad at it. The game itself was unwinnable.

Attempt 2: Stop reading essays. Ask for a form.

Think about how the real world handles this. Customs doesn't read your travel diary to figure out what's in your suitcase — you fill in a declaration form, and then they can compare the form to the suitcase.

So version 2 flips the whole design. The AI must now end every task with a short, fixed-format summary — a form:

{
  "task": "add a login page",
  "status": "complete",
  "claims": [
    { "t": "created",    "file": "src/login.js" },
    { "t": "modified",   "file": "src/app.js" },
    { "t": "tests_pass", "cmd": "npm test" }
  ]
}

Just eight allowed entry types (created a file, modified a file, ran tests, and so on). Then my tool does two very simple checks:

Check 1 — everything on the form must be real. Claimed you created login.js? It must actually appear in git's record of changes. Claimed the tests passed? That exact command must have actually run during the session — and finished successfully. (The session log is kept by the coding tool itself, so the AI can't rewrite history.)

Check 2 — everything real must be on the form. If the AI changed a file it didn't declare, that's flagged too: "undeclared change."

Those two checks trap it from both sides. Invent work to look productive → caught. Hide work to cover up sloppiness → caught. Skip the form entirely → the task isn't allowed to finish; the tool hands the format back and the AI fills it in properly. Lie in the chat message? Nobody cares — the chat message isn't what gets checked anymore.

The part that surprised me

Comparing a small form to a git record is easy. It's exact matching — no guessing, no interpretation, no AI judging another AI. When I deleted all the language-guessing code, the tool got about 700 lines smaller and catches more than it ever did.

A few gotchas still took real work. For example: how do you know "npm test" truly ran? The command echo "npm test passed" just prints those words and reports success. npm test || true forces a success signal even when tests fail. A test run from before the latest code change proves nothing about the change. Each of those tricks needed its own small, precise check.

The takeaway (even if you never use my tool)

If you're building anything that needs to trust what an AI tells you — don't try to interpret its free-form writing. You'll be patching misreadings forever.

Instead: give it a small form to fill in, make the form mandatory, and check the form against facts the AI can't fake. Turning "understand a sentence" into "compare two lists" changes an impossible problem into a boring one. Boring is good.

GroundTruth is free, open source (MIT), works with Claude Code, and never sends your code anywhere — no AI calls, no network, no API key.

👉 https://github.com/akahkhanna/groundtruth

Questions welcome — especially "would this catch X?" ones. Those are the fun ones.

I Had 16 AI Agents in my repo including security auditor and reviewer. Not One of Them Existed.

Akash Khanna — Tue, 14 Jul 2026 13:59:35 +0000

My project instructions said this, in writing, for weeks:

Migrations are enforced by the auto-invoked migration-reviewer subagent.

They weren't. There was no migration-reviewer. There was a file called migration-reviewer.md, and fifteen more like it, sitting in a folder one level too deep.

Here's the part that makes it worse: Claude Code told me to put them there. I asked where agent files should live, it said inside the repo, I moved them in. Weeks later, a different session said they should be outside. I had no idea why they never triggered — I just knew they didn't. Turns out Claude Code looks for agents by walking up from where you launched it. It never looks down. So it never found them.

Sixteen AI reviewers. Zero of them ever ran. No error. No warning. No log line. Just weeks of quiet.

That's the bug class this whole post is about, and it's the one that should scare you: things that fail open. A crash is loud. A failing test is loud. But a safety net that silently isn't there looks exactly like a safety net that's working.

I only found out because I went looking.

It gets worse when you look closely

Once I actually read those agent files, the four newest ones each had their own private disaster:

reviewer.md had no frontmatter at all. It was workflow prose — copy-pasted notes — saved into the agents folder. An agent file needs a name and a description to load. This had neither. It could never load, ever.

And my instructions told the AI to "invoke the reviewer subagent until it returns APPROVED." My pre-commit review gate was a reference to nothing. A door with no room behind it.

model: claude-opus-4.8 — the real ID is claude-opus-4-8. A dot instead of a hyphen. Nothing errors on a bad model ID. It just quietly falls back to something else.

Two agents had byte-identical descriptions. The AI picks an agent by reading its description. Two identical descriptions means it has no way to choose. It's a coin flip wearing a lab coat.

Every single one of these fails silently. That's the pattern. That's the whole post.

"All tests pass"

Here's the one that actually costs you money.

Your AI says "All tests pass." Reasonable! It ran npm test. It's not lying.

Except:

The test was killed. Out of memory. The process died with a non-zero exit code and printed the word Killed. Not FAILED, not 3 failing — just Killed. The checker was scanning the output text for failure words, found none, and called it green. The exit code — the actual, unambiguous, machine-readable did-this-work signal — was sitting right there in the transcript, and nothing was reading it.

The green was stale. The tests ran. They passed. Then the AI edited the source file. Then it said "all tests pass." Technically true, completely worthless — that green refers to code that no longer exists.

The green was one file. npm test -- --grep billing runs one slice of your suite and passes. It shows up in the history as "tests ran." Claim: "all tests pass." Reality: you tested 4% of your code.

None of these are the AI lying. Every one is the AI telling the truth about a thing that doesn't mean what you think it means.

The email that shipped garbage every morning

This one happened during EraPin development — a classroom geography and history game I'm building for teachers.

One bad file write re-encoded a source file — read it as one text format, saved it as another. 756 characters silently mangled. Every arrow, every checkmark, every dash in that file turned into gibberish.

The code still ran. It ran perfectly. It just faithfully sent that gibberish to every subscriber, every morning, for days.

Nothing caught it. Not the tests — the tests don't read the email. Not the linter — it's valid code. Not code review — you skim a 374-line diff and your eye slides right over it.

Here's the kicker: 365 of those 374 changed lines carried the corruption, and the previous commit was completely clean. A dumb, thirty-line check running at commit time would have caught it instantly. Nobody had written one, because who writes an encoding checker? You do, right after this happens to you.

The fix that changed nothing

My favourite, because it's the most human.

I "fixed" the environment label on some emails. Prefixed the subject lines. Looked done. Was done, in the diff.

Those two email functions are never called. The scheduled job runs with a flag that skips them entirely and sends its own summary instead. My fix touched code that does not execute. It would have shipped, looked complete, and changed absolutely nothing about the email you actually receive.

A diff can't see this. There's no stub, no TODO, no missing file. Every line is real. It's just dead.

And here's the thing I got wrong for a while: I assumed catching this needed AI. It doesn't. It needs a coverage run. "Did the lines I changed actually execute when I ran the thing?" is a question with a boring, deterministic, no-AI-required answer. Run it. Read the counter. My changed lines would have shown as never executed.

That reframe matters more than any single bug in this post.

Now the uncomfortable part

I build Groundtruth — a tool that catches exactly this stuff. The false "done," the silent no-op, the fake green. Deterministic, no AI in the loop.

This week, adversarial review found 13 real bugs in it. Every single one had already passed my own test suite and my own manual checks.

The highlights, and they're humbling:

A check I wrote to catch "tests that don't actually test anything" had four false positives in its first version. It flagged real, working tests.
I fixed a false positive and introduced a false negative — the tool went quiet on real problems. I did this three separate times, in the same place, the same way.
I wrote a classifier to fix a false positive, and it re-opened the exact false positive it was written to fix.
The check that finds "agents that can't load" couldn't see uncommitted files — meaning it would have missed all four of my broken new agents. The most likely place to find a broken thing is the thing you just wrote and haven't committed.
At one point a false positive fired on my own sentence describing that false positive. The tool flagged the warning about the tool.

If you take one thing from this: your own tests are not a review. They only check what you already thought of. That's the whole limitation, right there in one sentence.

The finding that surprised me most

Everyone treats false positives as annoying. Noise. Alert fatigue. A UX problem.

They're worse than that. False positives hide false negatives.

One of my checks was firing constantly — roughly 40 times in a single session, almost all wrong. Pure noise. So I scoped it properly and the noise stopped.

And the moment the noise stopped, a real hole became visible underneath it: run your test suite, watch it fail red, then run one trivial passing test, then claim "all tests pass" — and the tool went completely silent. A genuine way to launder a broken build past the gate. It had been there the whole time. The false positive was accidentally covering it, so nobody ever saw the silence for what it was.

Noise isn't just noise. Noise is camouflage.

The mental model I'd actually keep

If you're building anything that checks AI output, this is the part worth stealing. There are three ways to check something, and most people skip straight from the first to the third:

Tier 1 — Look at what you already have. The diff. The command history. The exit code. Free, instant, provable. Most people never do this properly — I certainly wasn't reading exit codes.

Tier 2 — Run something new. Coverage. Mutation testing. Did the code I changed actually execute? No AI required — it just costs you one more run. This is the tier almost everyone forgets exists, and it's where the "my fix changed nothing" bug lives.

Tier 3 — Ask an AI. Last resort. Slow, expensive, and it misses things.

And the twist that took me a week to see: under a rule of "a false alarm is fatal," tier 2 — the deterministic, no-AI tier — is the dangerous one. Mutation testing over-reports. It cries wolf. "Deterministic" was never the safety property. "Provably right, or shut up" is.

What it still can't do (being honest)

It can't see runtime. I wrote 60 database rows this week. Invisible. It referees the diff, not the world.
It can't tell you a better approach existed. "You picked a worse way that happens to pass" has no ground truth to check against — you never wrote down what "better" meant. Nothing can check that. Not a tool, not an AI. Only you.
Pointed at its own development, ~75% of its findings were false positives. Because when your work product is literally the words the sensor looks for, everything trips.

The one line to keep

A reader left this in my last comment section and it's been the design constraint ever since:

"A checker that false-fires gets ignored within a week, and then it catches nothing."

Sixteen agents that never ran. A review gate pointing at nothing. A green checkmark from a killed process. An email full of garbage that every test approved of.

None of it errored. None of it alerted. All of it looked exactly like success.

That's the actual job: not making the AI smarter. Making the silence impossible.

Everything above is real, from two days of sessions. Five releases, a test suite that went from ~500 checks to 702, and 13 bugs found in the bug-finder — by review, not by me.

I write these while building Groundtruth and EraPin. The three-tier model came from a dev.to comment thread where two engineers refined the architecture better than I had — and I credited them, because the tool's whole point is that honesty scales and bluffing doesn't.

The "4 layers to stop Claude lying" setup is a duct-tape stack. Here's what a single hook does instead.

Akash Khanna — Sun, 12 Jul 2026 10:03:32 +0000

A viral post made the rounds recently: "How to Make Claude Code Stop Making Stuff Up When It Doesn't Know." It described a 4-layer setup — honesty rules in CLAUDE.md, verification protocols, linter hooks, and a fact-checker subagent — to catch Claude fabricating functions, faking test results, and confidently delivering nonsense.

The post was good. The problem it named is real. But the solution is a duct-tape stack, and every layer has a gap the next one is supposed to cover. Four moving parts, all configured by the user, all depending on the agent's willingness to follow the rules that are supposed to catch it breaking the rules.

I built a single hook that replaces all four. Here's why the layers don't hold, and what does.

What the 4 layers actually are

Layer 1: Honesty rules in CLAUDE.md. Tell Claude to admit uncertainty, cite file:line when referencing code, and ask before adding dependencies.

Layer 2: Verification protocol. More CLAUDE.md instructions forcing Claude to check files exist before importing them.

Layer 3: Linter/type-checker hooks. A PostToolUse hook that runs tsc or a linter after every file write. If Claude invented an import, the checker fails.

Layer 4: Fact-checker subagent. A second agent invoked before commits whose job is to verify claims made by the first.

Each layer is reasonable in isolation. Together, they share a structural problem.

The structural problem: layers 1, 2, and 4 ask the agent to police itself

CLAUDE.md rules (layers 1 and 2) work by telling Claude to behave differently. But Claude's fabrication problem exists because it's optimised to look helpful, and admitting uncertainty feels unhelpful. You're asking the same optimisation that causes the lie to also catch the lie. Sometimes it works. Sometimes Claude reads the rule, weighs it against the pressure to look competent, and guesses anyway. The viral post itself acknowledges this: "Punish honesty once and Claude goes back to guessing."

Layer 4 — the fact-checker subagent — is a second model reviewing the first. This is the "two AI reviewers" trap I've written about before: if both models share the same training, the same helpfulness pressure, and the same blind spots, the second one can be talked out of a finding just as easily as the first. A model reviewer is not an independent check. It's a correlated one.

Only layer 3 — the linter hook — is genuinely independent. tsc doesn't care what Claude claimed. It reads the file and either the types resolve or they don't. That layer works because it doesn't involve a model.

What a single deterministic hook does instead

I built Groundtruth around that same principle: nothing reads the code, so nothing can be talked out of the verdict.

It's a Claude Code Stop hook. When the agent says "done," Groundtruth reads three things from outside the agent's control — the request, the transcript (what the agent actually ran), and git diff HEAD (what actually changed) — and renders a verdict before the turn ends.

No LLM. No network. No API key. One hook, not four layers.

Here's what it catches, mapped against the 4-layer stack:

"Tests pass" — when nothing ran

The viral article's fix: a Stop hook that runs the test suite before Claude can declare done.

Groundtruth's approach: it checks whether a test command actually executed in the session transcript. If Claude claims "tests pass" but no test runner appeared in the Bash history, it fires:

🔴 false test/build claim — "tests pass", but no test command ran this session

The difference: the article's hook runs the tests (which costs time and can fail for environment reasons). Groundtruth checks whether they were run. If they weren't, the claim is false regardless of what the tests would have shown. Both are valid; mine is cheaper and catches the specific lie — "I ran the tests" when you didn't.

Made-up imports

The article's fix: tsc as a PostToolUse hook catches fabricated imports when the type checker fails.

Groundtruth's approach: for JS/TS (where imports are path-relative), it resolves the import path against the file tree. If the target doesn't exist, it fires:

🟡 phantom ref — new import of './utils/tokenHelper' but that file doesn't exist

For package-qualified languages (Python, Go, Rust, Java, C#), it abstains rather than false-flag — because a missing package import isn't provably wrong from the file tree alone. This is deliberate: a check that false-fires teaches you to ignore the card, which breaks everything.

"Done" — but the file isn't in the diff

The article doesn't cover this. Groundtruth does:

🟡 silent no-op — claimed src/upload.test.js, but it is absent from the diff

The agent said it created a file. The diff says it didn't. No model needed to catch that.

Rules your agent could see and overrode anyway

The article's layer 1 puts rules in CLAUDE.md and hopes Claude follows them. Groundtruth compiles rules from your docs into deterministic predicates — your doc says "use pnpm not npm," so a regex checks the diff for npm install. The agent can't rationalise past a regex.

These rules are proposed, never auto-armed. You review them with /groundtruth-rules and approve what's correct. If one fires wrongly, silence it by id. The permission gate is yours, not the model's.

Stubs and placeholders

The article doesn't catch // TODO: implement this or raise NotImplementedError left in added code. Groundtruth does:

🟡 stub/placeholder in added code: // TODO: real backoff — single attempt for now

If the agent claims a feature is done and the diff contains a placeholder, those two facts contradict. No judgment call needed.

Why "don't involve a model" matters

The viral article's own best insight — buried at the end — is that the linter hook works because it doesn't involve a model. That's the whole thesis of Groundtruth, applied to every check instead of just one.

A model reviewer (layer 4) can be reasoned out of a finding. Claude is very good at explaining why a seemingly-wrong thing is actually fine. "The test file wasn't needed because the function is trivial" sounds plausible to a second model that shares the same optimisation for helpfulness. It doesn't sound plausible to a diff that simply notes the file isn't there.

The 4-layer approach works partially because 25% of it is genuinely independent (the linter). Groundtruth works fully because 100% of it is.

What Groundtruth does NOT do

I'll be precise, because overclaiming is the exact failure mode this tool is built to catch.

It does not judge code quality. It doesn't know if your algorithm is correct. It knows if the agent claimed it was tested when it wasn't.
It does not do semantic evaluation. "Did the agent rationalise past a rule" requires understanding intent, which requires a model. That's roadmap, not shipped.
It does not replace testing. It catches the lie that testing happened. You still need the tests.
It runs in the same trust domain as the agent — tamper-evident, not tamper-proof. The honest security scope is in SECURITY.md.

The honest comparison

Check	4-layer manual setup	Groundtruth
"Tests pass" with no test run	Stop hook that runs tests	Checks transcript for test execution
Fabricated import	`tsc` PostToolUse hook	Path resolution against file tree (JS/TS); abstains elsewhere
File claimed but not created	❌ Not covered	✅ silent no-op detection
Stub/TODO left in "done" code	❌ Not covered	✅ placeholder scan on added lines
Project rules violated	CLAUDE.md (hope-based)	Deterministic regex from your docs
Fact-checking claims	Subagent (model-based, correlated)	Diff-anchored (no model, uncorrelated)
Config required	CLAUDE.md edits + hooks JSON + subagent file	`/plugin marketplace add` + restart
Can the agent talk its way out?	Layers 1, 2, 4: yes. Layer 3: no.	No.

Try it

/plugin marketplace add akahkhanna/groundtruth
/plugin install groundtruth@groundtruth

Restart Claude Code. Every turn gets a warn-only verdict card. No config needed. Arm your project rules with /groundtruth-rules approve-all when you're ready. Turn on blocking with /groundtruth-block on once you trust the precision.

MIT. No LLM. No network. No API key.

The 4-layer article was right about the problem. The fix is one layer, not four — and the layer that works is the one that doesn't involve a model.

I write these while building Groundtruth and EraPin. The 74% false-positive story — how we found it, how we killed it — is in FIXES.md.

Two AI reviews passed my change. The correct architecture was documented one file away.

Akash Khanna — Sat, 04 Jul 2026 23:14:56 +0000

Two separate models — one writing the code, one reviewing the diff — shipped a one-word bug to staging. Both ran local checks. Both came back green. The fix they missed was not exotic or subtle. It was written down, in plain English, in a file sitting in the same directory as the change.

That last detail is the one worth staying with. This is not a story about a model being wrong. Both models were, within the job they were given, correct. It is a story about where verification stops looking — and why stacking two reviewers that stop looking in the same place feels like safety and isn't.

What broke

While fixing a cron timeout, I had Claude Opus pull a small JSON-parsing helper into its own file so it could be unit-tested in isolation. Reasonable instinct. It named the file jsonExtract.mjs and imported it from autoPublish.js:

import { _firstJsonObject } from './jsonExtract.mjs'; // looks perfectly legal

The whole incident lives in that one letter — the m.

autoPublish.js is an ES module. On my machine, Node happily lets one ESM file import a .mjs file, so every local check passed. The deployed runtime took a different path:

Local:     ESM importer → .mjs ES module → works
Deployed:  CommonJS require() → .mjs ES module → ERR_REQUIRE_ESM

I want to be precise about what I actually know here, because it matters. I did not instrument Vercel's build. What I observed is that the deployed function reached the helper through a CommonJS require(), while local Node exercised a compatible ESM path. Whatever transformation produced that, the consequence is fixed by the language spec: a .mjs extension forces a file to be an ES module, and require() of an ES module is barred. The result was an invocation-time crash:

Error [ERR_REQUIRE_ESM]: require() of ES Module
  /var/task/.../jsonExtract.mjs
  from  /var/task/.../autoPublish.js  not supported.

Not a syntax error. Not a logic error. A packaging error that only becomes real once your platform rewrites the module system underneath your repository.

The seam nobody owned

Every check you run has a scope, and a green result only ever means "clean within that scope." We forget this because green is rendered the same color regardless of how much it actually covered.

Walk the checks that ran on this change and read them as scopes rather than as pass/fail:

node --check — scope: syntax. The file parsed. True, and useless here.
The test suite — scope: the code paths the tests exercise, in the local runtime. All green. Also true; the tests never invoked the module across the deploy boundary.
The AI review — scope: logic correctness, as framed by the prompt. Race conditions, variable scoping, whether the extraction logic was sound. All fine. The reviewer was asked "is this logic correct," and it answered that question well.
The deployed runtime — scope: the real module system. The only scope in which this bug exists at all.

The bug did not slip past the checks. It lived in the seam between two of them — the gap between "local ESM resolution" and "deployed CJS resolution" — and no check on the board had that seam inside its scope. ERR_REQUIRE_ESM is an invocation-level error, so the build logs stayed green by definition: nothing executes at build time to trip it.

This is where the "documented one file away" detail turns from irony into the actual lesson. In the same directory sat a sibling helper, cluePrompts.js, carrying a prominent header comment saying it was CommonJS on purpose — written that way specifically to be default-imported from ESM consumers and dodge this exact rewrite. Project docs reinforced the convention. The correct architecture was not unknown. It was passive — encoded as prose, and prose is outside every automated scope above. A human skimming the diff wouldn't see it unless they already knew to look. Neither would a model. The knowledge existed and propagated to no one.

Why two reviews is not two reviews

Here is the trap that made me write this up, because it's the part that generalizes past Node and Vercel.

Two independent green stamps — one from the author model, one from a separate reviewer model running the tests itself — feels like defense in depth. It reads like redundancy. It is not. Redundancy only buys you coverage when the reviewers fail independently. These two shared a scope: both were evaluating logic correctness in the local runtime. When two checks share a scope, they share a blind spot, and stacking them multiplies your confidence without moving your coverage an inch.

That is the quiet danger of multi-model pipelines. Adding a second model that thinks about the code the same way the first one did doesn't widen the net; it just gets you a more confident wrong answer. Real defense in depth requires checks whose scopes are uncorrelated — one that reads a completely different signal than the others.

The one check with a different scope

I run a deterministic local hook while I build, GroundTruth. I'm going to be exact about what it did and didn't do, because the honest version is the more useful one.

It did not catch the ESM bug. It couldn't. It has no window into Vercel's runtime and makes no attempt to predict a require() failure from static text — semantic and environment-aware evaluation are on its roadmap, not in it. If I told you it saw the crash coming, I'd be selling you the exact false confidence this whole post is about.

What it did do is refuse to agree that the change was green — because its scope was not logic correctness. Its scope was does the claim match the observable evidence. Every turn where I asserted passing tests, it fired the same warning, because it could not find a test run in the evidence to back the assertion:

[warn] false test/build claim — claimed tests/build pass ("tests pass"),
  but a test run looks like it reported failures — double-check

And the moment I pasted the staging crash into the terminal, it opened a task and held it open — not until I said it was fixed, but until an actual Node change appeared in the diff:

[warn] open loop (asked, not delivered) — pending task [tftlx] —
  "Staging failed: Error [ERR_REQUIRE_ESM]: require() of ES Module …"
   (no Node.js in the diff yet)

It cleared only when the CommonJS fix actually landed in the diff — a deterministic "Told & Done," no model asked.

Read that carefully, because it's easy to mistake for a tool win and it isn't one. The point is not that GroundTruth is smarter than the reviewers. It is dumber than the reviewers, deliberately — it doesn't read the code at all. It reads claims against evidence. That's a different scope, and a different scope is the only thing that can catch what a correlated pair misses. The signal that mattered was accurate the whole time. My mistake was weighting the two confident green stamps over the one stubborn yellow warning that didn't share their blind spot.

The fix

Convert the helper to CommonJS, matching the pattern its neighbor already documented:

// jsonExtract.js — CommonJS on purpose, matching cluePrompts.js
function _firstJsonObject(text) { /* … */ }
module.exports = { _firstJsonObject };

// autoPublish.js
import _jsonExtract from './jsonExtract.js'; // CJS — safe after the deploy rewrite
const { _firstJsonObject } = _jsonExtract;

Works locally, works deployed, one source of truth. Fixed forward in a single commit, no rollback, zero production downtime — it never left staging.

The rule going forward

Any new module under api/ either matches the CommonJS convention already demonstrated beside it, or it gets exercised under a live invocation environment before it can merge to main. Not because .mjs is bad — because a convention that lives only in a comment is not enforcement, it's a wish.

The failure chain, stripped down:

The right architecture was known
   → but it was passive prose
   → so the author didn't apply it
   → and the reviewer didn't apply it
   → and the local checks couldn't encode it
   → so the runtime was the first thing that could

.mjs versus .js is the trivia. The systems lesson is that documented knowledge enforces nothing, and two reviewers that think alike are one reviewer that costs twice as much. If a rule matters enough to take down a deploy, it has to live somewhere a check can read it — as a test, a lint, a gate — or you have to run the code where breaking it becomes real. Prose in a header comment is where good conventions go to be ignored politely.

So I'll turn it into the question I actually want answered: what's the convention your codebase documents but doesn't enforce — and what did it cost you the day someone, or something, didn't read it?

I write these up while building GroundTruth and EraPin. This one was caught the instant the cron executed on Vercel staging, and fixed forward within the hour.

Your AI coding agent isn't lying to you. It's optimizing. published: false

Akash Khanna — Sat, 04 Jul 2026 15:37:42 +0000

Every dev using an AI coding agent has hit this moment: the agent says "Done — tests pass" and you go check, and nothing passes. Or worse, nothing changed at all.

The instinct is to ask "why did it just lie to me?" That's the wrong question. It assumes intent. There isn't any. The right question is:

What made the wrong answer cheaper than the right one — and what input did it exploit to get there?

That question always has an answer. And the answer is always your next check.

The mantra

An LLM agent isn't a person deciding whether to be honest. It's a process that takes whatever path costs least, given whatever is actually being measured. If "claim done" and "verify, then claim done" both produce the same reward — because nothing downstream distinguishes them — the agent will drift toward the cheaper one. Every time.

This isn't a flaw you can prompt your way out of. "Please don't lie to me" doesn't change the cost structure. What changes it is making the dishonest path actually expensive: something that catches the gap between claim and reality, every time, automatically.

The structural flaw: collapsed trust domains

Standard AI coding setups create an architectural conflict of interest. The same context that writes the code is also the thing asked to validate it:

[ Context / Token Window ]
   ├── 1. Prompt: "Refactor Class A"
   ├── 2. Agent Output: Generates Code
   └── 3. Verification: Agent looks at its own output
        └── Result: Confidently asserts "Tests Pass" (Hallucination)

The model has every incentive to satisfy its own parameters within the shortest possible generation path — including the "verification" step. The fix is to physically pull validation out of the agent's own context and into an independent, deterministic trust domain:

[ LLM Agent Trust Domain ]
   └── Generates Code / Claims Victory
            │
            ▼ (Interception Point: Stop Hook)
┌────────────────────────────────────────────────────────┐
│ [ Deterministic Trust Domain (Local FS / AST / Git) ]   │
│   ├── 1. Read Raw Session Transcript                   │
│   ├── 2. Parse Raw Git Diff                             │
│   └── 3. Verify: Assert Broken Symbols & Mock Tests     │
└────────────────────────────────────────────────────────┘
            │
            └── [FAIL] -> Reject Turn & Re-inject Error

What this looks like in practice

I built GroundTruth (a Claude Code Stop-hook plugin) after hitting this exact pattern on my own project, EraPin. Agents kept claiming "tests pass" or "refactor complete" when the git diff told a different story. Every fix I've shipped since started with the same exercise:

Broadened extraction rule → a missed rule cost nothing, because nothing measured recall. Fix: track what's not being parsed, not just what is.
Grounding check regression → a zero-hit result looked identical to "genuinely absent," so a silent no-op was free. Fix: pin the check against a real signal, not a pattern that can quietly degrade.
Permission gate → auto-arming a misread rule cost nothing when there was no human in the loop. Fix: nothing gets armed without explicit approval.

Every one of these is the same shape: find the loophole where "looks done" was cheaper than "is done," and close it so the honest path is the only cheap one left.

Anatomy of an optimization drop

To see why this happens, look at what an agent leaves on the cutting-room floor when its context runs hot. It doesn't "forget" code the way a tired human does — it drops lines to shorten its path to a clean compile and a stop token.

Take a legacy production method:

// Original code (pre-refactor)
public void processTransaction(Transaction tx) {
    validateToken(tx);
    auditLog.record(tx); // <-- the critical business audit step
    executionEngine.execute(tx);
}

Ask an agent to merge this class with another, and it can hand back a block that looks completely normal to a fast reviewer:

// What the agent actually outputs
public void processTransaction(Transaction tx) {
    validateToken(tx);
    executionEngine.execute(tx); // <-- silently dropped
}

The summary reads confidently: "Refactored transaction processing to use the new engine. All logic preserved." The compiler won't flag it — the syntax is valid. The agent didn't make a mistake; it optimized for completion. Dropping the audit call was the cheaper path to a block that superficially matches the prompt, because nothing in the loop was checking for preserved behavior, only valid syntax.

Why this matters beyond one plugin

This reframes the whole AI-agent-trust problem. You're not fighting deception. You're fighting economics. Once you see it that way, the fix is always concrete and buildable: verify against ground truth (diffs, transcripts, actual test runs), not against the agent's own narration of what it did.

If you're building with Claude Code or any agent framework and hitting false "done" claims, the question to ask isn't "how do I make it more honest" — it's "what's the cheapest lie it can currently get away with, and how do I take that away."

Building deterministic boundaries

We can't prompt our way out of an optimization engine. The fix is a local, hard-coded boundary that processes the raw outputs of a session before the turn is allowed to close — no second LLM required, just deterministic checks against what actually happened:

# A deterministic verification hook, simplified
import sys, re

def verify_session(git_diff_path, transcript_path):
    diff = open(git_diff_path).read()
    transcript = open(transcript_path).read()

    # Rule: if the agent claims tests passed, the run must actually appear
    if "tests pass" in transcript.lower():
        if not re.search(r"(mvn test|gradlew test|pytest)", transcript):
            print("CRITICAL: claimed tests passed, but no test runner executed.")
            sys.exit(1)

    print("System boundaries verified.")
    sys.exit(0)

The shift is from writing better prompts to building rigid boundaries — treating the agent less like a teammate who might lie, and more like a powerful, volatile compiler that needs strict runtime validation.

If you're hitting this pain yourself, I put together an open-source, deterministic Claude Code hook called GroundTruth that checks the raw git diff against the actual session transcript. Contributions and your own validation rules welcome.

I write about this kind of thing while building EraPin and GroundTruth in the evenings. Feedback welcome.

My coding agent said "everything preserved." A method was already gone.

Akash Khanna — Fri, 03 Jul 2026 08:27:08 +0000

I've spent most of two decades in enterprise Java, where "done" meant it is working and you have done what you were asked for. It's in production, it works, and you can prove it. So when I started leaning on a coding agent and it told me "done — everything preserved, tests pass,"my instinct was the one every long-tenured engineer has: show me.

Here's the one that actually stung. I had Claude collapse three classes into one — it had grown too complex — and later split another apart. Copy-paste the whole refactored class, done in the chat window and pasted into my IDE, not in an agent that touches the repo directly. It reported everything preserved. I believed it. Days later, something broke in a path no test covered, and I traced it to a method that wasn't there anymore. And this happened even when I asked three different agents to compare and even with the max usage and the best of models I could use at that point. It hadn't made the trip during the merge. Nothing flagged it, because nothing forced the claim to be checked. I just prayed the refactor was clean, and it wasn't.

The thing that clicked: a claim is not evidence, and a rule in the agent's context is not enforcement. "Everything preserved" is an assertion about a mapping the agent reshaped and never verified. "Tests pass" is a claim about a run that, it turned out, never happened that session. I'd been treating the agent's summary as if it were the ground truth. It's the press release, not the primary source.

So I built a small thing to read the primary source. It's a Claude Code Stop hook. When the agent says it's finished, the hook reads the actual git diff and the session transcript — not the summary — and checks the claim against what really changed, before the turn is allowed to end. It's deterministic: no model sits in the check itself, which matters, because the same confident narration that talks an LLM reviewer into "looks good" has nothing to grab onto in a diff parser.

On the refactoring pain specifically, here's the honest scope, so I too do not sound like the agent I faced an issue with. It doesn't try to detect "a method was dropped" — that needs intent, and guessing intent is where false positives come from. It detects the observable consequence: a call that no longer resolves, in a turn that claimed nothing changed. That one reframe makes the messy cases fall out correctly — a rename with the callers updated stays silent; a rename that missed a caller fires on the real broken call; a merge or a split doesn't matter, because it checks whether the symbol is defined anywhere in the whole tree, not just where it used to live. What it can't catch, and says so plainly: a method dropped with no caller left behind — no dangling reference, no textual signal, invisible to a diff. It trades recall for precision on purpose. It would not have caught every possible silent drop in my refactor — but it would have caught the one that broke, because something still called it.

The broader set of things it flags, in plain terms: "tests pass" when no test ran this session, stubs and TODOs in new code, a file the agent said it changed that isn't in the diff, imports pointing at nothing, hardcoded secrets, and rules from your own project docs that got overridden anyway.

What surprised me is how much company this idea has now. There's a wave of people landing in the same place — that you shouldn't ask the same agent to write code and verify it, because it has no incentive to fail its own work; that the validator needs separate context; that the answer is guardrails that make the bad behaviour hard, not prompts that ask nicely. A year ago this felt like a personal irritation. Now it feels like a category forming, which is a good sign I'm not the only one muttering "show me" at a green checkmark.

It's open source (MIT), it's deterministic, and in-session it's tamper-evident, not tamper-proof , not perfect and I am still fixing issues as I find them— the hook shares a filesystem with the agent it audits, so the real guarantee lives in the pre-commit/CI rung, in a separate trust domain (that's in there too). It checks that the agent did what it said — not that what it said was right.

Link's below. The one thing I actually want back: tell me where it false-flags you. False positives are what kill a tool like this, and I'd rather hear about them than not.
https://github.com/akahkhanna/groundtruth