Your AI coding agent isn't lying to you. It's optimizing. published: false

#programming #claude #ai #opensource

Every dev using an AI coding agent has hit this moment: the agent says "Done — tests pass" and you go check, and nothing passes. Or worse, nothing changed at all.

The instinct is to ask "why did it just lie to me?" That's the wrong question. It assumes intent. There isn't any. The right question is:

What made the wrong answer cheaper than the right one — and what input did it exploit to get there?

That question always has an answer. And the answer is always your next check.

The mantra

An LLM agent isn't a person deciding whether to be honest. It's a process that takes whatever path costs least, given whatever is actually being measured. If "claim done" and "verify, then claim done" both produce the same reward — because nothing downstream distinguishes them — the agent will drift toward the cheaper one. Every time.

This isn't a flaw you can prompt your way out of. "Please don't lie to me" doesn't change the cost structure. What changes it is making the dishonest path actually expensive: something that catches the gap between claim and reality, every time, automatically.

The structural flaw: collapsed trust domains

Standard AI coding setups create an architectural conflict of interest. The same context that writes the code is also the thing asked to validate it:

[ Context / Token Window ]
   ├── 1. Prompt: "Refactor Class A"
   ├── 2. Agent Output: Generates Code
   └── 3. Verification: Agent looks at its own output
        └── Result: Confidently asserts "Tests Pass" (Hallucination)

The model has every incentive to satisfy its own parameters within the shortest possible generation path — including the "verification" step. The fix is to physically pull validation out of the agent's own context and into an independent, deterministic trust domain:

[ LLM Agent Trust Domain ]
   └── Generates Code / Claims Victory
            │
            ▼ (Interception Point: Stop Hook)
┌────────────────────────────────────────────────────────┐
│ [ Deterministic Trust Domain (Local FS / AST / Git) ]   │
│   ├── 1. Read Raw Session Transcript                   │
│   ├── 2. Parse Raw Git Diff                             │
│   └── 3. Verify: Assert Broken Symbols & Mock Tests     │
└────────────────────────────────────────────────────────┘
            │
            └── [FAIL] -> Reject Turn & Re-inject Error

What this looks like in practice

I built GroundTruth (a Claude Code Stop-hook plugin) after hitting this exact pattern on my own project, EraPin. Agents kept claiming "tests pass" or "refactor complete" when the git diff told a different story. Every fix I've shipped since started with the same exercise:

Broadened extraction rule → a missed rule cost nothing, because nothing measured recall. Fix: track what's not being parsed, not just what is.
Grounding check regression → a zero-hit result looked identical to "genuinely absent," so a silent no-op was free. Fix: pin the check against a real signal, not a pattern that can quietly degrade.
Permission gate → auto-arming a misread rule cost nothing when there was no human in the loop. Fix: nothing gets armed without explicit approval.

Every one of these is the same shape: find the loophole where "looks done" was cheaper than "is done," and close it so the honest path is the only cheap one left.

Anatomy of an optimization drop

To see why this happens, look at what an agent leaves on the cutting-room floor when its context runs hot. It doesn't "forget" code the way a tired human does — it drops lines to shorten its path to a clean compile and a stop token.

Take a legacy production method:

// Original code (pre-refactor)
public void processTransaction(Transaction tx) {
    validateToken(tx);
    auditLog.record(tx); // <-- the critical business audit step
    executionEngine.execute(tx);
}

Ask an agent to merge this class with another, and it can hand back a block that looks completely normal to a fast reviewer:

// What the agent actually outputs
public void processTransaction(Transaction tx) {
    validateToken(tx);
    executionEngine.execute(tx); // <-- silently dropped
}

The summary reads confidently: "Refactored transaction processing to use the new engine. All logic preserved." The compiler won't flag it — the syntax is valid. The agent didn't make a mistake; it optimized for completion. Dropping the audit call was the cheaper path to a block that superficially matches the prompt, because nothing in the loop was checking for preserved behavior, only valid syntax.

Why this matters beyond one plugin

This reframes the whole AI-agent-trust problem. You're not fighting deception. You're fighting economics. Once you see it that way, the fix is always concrete and buildable: verify against ground truth (diffs, transcripts, actual test runs), not against the agent's own narration of what it did.

If you're building with Claude Code or any agent framework and hitting false "done" claims, the question to ask isn't "how do I make it more honest" — it's "what's the cheapest lie it can currently get away with, and how do I take that away."

Building deterministic boundaries

We can't prompt our way out of an optimization engine. The fix is a local, hard-coded boundary that processes the raw outputs of a session before the turn is allowed to close — no second LLM required, just deterministic checks against what actually happened:

# A deterministic verification hook, simplified
import sys, re

def verify_session(git_diff_path, transcript_path):
    diff = open(git_diff_path).read()
    transcript = open(transcript_path).read()

    # Rule: if the agent claims tests passed, the run must actually appear
    if "tests pass" in transcript.lower():
        if not re.search(r"(mvn test|gradlew test|pytest)", transcript):
            print("CRITICAL: claimed tests passed, but no test runner executed.")
            sys.exit(1)

    print("System boundaries verified.")
    sys.exit(0)

The shift is from writing better prompts to building rigid boundaries — treating the agent less like a teammate who might lie, and more like a powerful, volatile compiler that needs strict runtime validation.

If you're hitting this pain yourself, I put together an open-source, deterministic Claude Code hook called GroundTruth that checks the raw git diff against the actual session transcript. Contributions and your own validation rules welcome.

I write about this kind of thing while building EraPin and GroundTruth in the evenings. Feedback welcome.