Code that looks right and lies: a field guide to intent↔code drift

Viscount Sparrow — Thu, 04 Jun 2026 11:08:48 +0000

AI agents now write enormous amounts of code, and it usually looks right. It compiles, it passes the tests, it reads cleanly in review. But "looks right" and "does what it said it would" are different things — and the gap between them is where the real bugs hide now.

I keep seeing the same failure mode: a pull request claims one thing, and the diff quietly does another. The reviewer reads the claim, confirms the claim is present, and sails right past the part nobody mentioned. At PR volume — especially with agents generating the code — humans simply can't audit every change for

Silent scope — the change does more than it claims

The most dangerous one, because the extra behavior hides behind a true, narrow description.

import { logger } from "../log";
-const SESSION_TTL = 30 * 60; // 30 minutes
+const SESSION_TTL = 24 * 60 * 60; // 24 hours
export function createSession(user) {

logger.info("creating session for", user.id); return { user, ttl: SESSION_TTL }; }

▎ PR says: "Add debug logging to session creation. No behavior changes."

The logging is real — and that truth is the camouflage. The 48× session-lifetime change is the actual risk, and a reviewer skimming for "logging" never sees it.

Unbacked claim — the description promises what the code doesn't deliver

// PR: "Adds rate limiting (5 req/min) to the login route."
router.post("/login", async (req, res) => {

logger.info("login attempt", req.ip); const user = await authenticate(req.body); return res.json({ token: sign(user) }); });

The body advertises rate limiting; the diff adds a log line and nothing that counts or rejects requests. The title set an expectation strong enough that your brain auto-filled the missing implementation.

Doc drift — code and its referenced doc now contradict each other

-// Spec (CLEANUP.md): "expireStaleData runs once daily at 04:00 UTC."
-exports.expireStaleData = schedule("every 24 hours").onRun(...);
+exports.expireStaleData = schedule("every 1 hours").onRun(...);

Nobody updated the doc. Six months later someone reasons about load using "daily" and is wrong by 24×. Your most trusted reference becomes confidently wrong.

Missing impl — a stated requirement has no code

// Spec: "Players get 3 free guesses; extras are paid (max +2)."
-const FREE_GUESSES = 3;
+const FREE_GUESSES = 1; // paid extras: TODO

"Behind the spec" is fine if everyone knows. Drift is when only the code knows — and the spec keeps overstating what ships.

Intent mismatch — the code does something different from the claim

// Design: "All message writes go through the server trigger."
match /matches/{id}/messages/{msgId} {

allow create: if false; // server-only
allow create: if isMember(id); // client writes directly }

The feature still "works" — through a different door than the one the design built. The mismatch only bites later: a missing notification, an unmoderated message, a broken rollup.

Why this is hard to catch by hand

Every one of these passes CI and looks clean. The signal isn't in the code's quality — it's in the distance between the code and its stated intent. That's a comparison, and it's exactly the kind humans stop doing carefully at scale.

So I built a small open-source tool for it: Verdict reads a PR's declared intent (issue / description / linked spec) and the actual diff, and flags only these fidelity gaps — evidence-backed, or nothing. Not a style linter, not a general reviewer. It runs as a GitHub Action and a CLI, and it's model-agnostic
(OpenAI-compatible or Anthropic).

The first thing I did was run it on its own codebase. It found 16 real issues I then fixed — including a way an attacker's PR text could try to talk the judge into passing. Dogfooding a verification tool is the only honest way to ship one.

It's early (v0). If the "claim vs code" gap is something you've felt, I'd genuinely like your eyes on it:

Repo: https://github.com/verdict-ci/verdict
npm i -g verdict-ci then git diff main... | verdict --intent "what you meant to do"

What drift patterns have bitten you that aren't on this list? That's the part I most want to hear.

DEV Community: Viscount Sparrow

Code that looks right and lies: a field guide to intent↔code drift