Rasmus Ros

Posted on Jun 2 • Originally published at eignex.com

Logic Drift: The Failure Mode Agents Can't See

#agents #ai #softwareengineering #vibecoding

Vibe coding works for the first week or two. You describe what you want, the agent writes it, tests pass, you ship. A few weeks in, progress falls off a cliff. New prompts start breaking older features in ways that pass the obvious tests, but later surface in production.

Vibe coding is the version where you fully trust the agent, don't read or only skim the code, and ship. Agentic coding is the version where you still read every diff, but the line between the two is a convention that decays when you're tired, when the diff is large, or when you're four hours in and the feature is almost done. So I'm treating vibe coding here as the failure mode of agentic coding rather than a separate thing.

The issue is structural, since coding agents have no equivalent of the source/generated-output boundary that a compiler gives us, and so prompt, code, tests, and previous agent output are all editable and all treated as input. The fix has to come from the harness vendors, in the form of a protected region the agent can read but can't rewrite without an explicit human unlock, because another instruction file isn't going to cut it. Until they ship the real thing, the workarounds are all a bit unsatisfying.

Public case of vibe coding fail. Lemkin was experimenting on a personal project, but real production systems have been wiped the same way. Replit is apparently "the safest place for vibe coding" according to their marketing.

It's tempting to read this as a problem that only kicks in once you have a real team or a serious codebase, but even the vendors selling these agents are starting to see its limits. From a recent interview:

if you close your eyes, and you don't look at the code, and you have AIs build things with shaky foundations as you add another floor, and another floor, and another floor, things start to kind of crumble.

Michael Truell, Cursor CEO

I don't want to cast blame on the users here ("professional" SWEs doing vibe coding is another story). The dream is real: a tool that lets you build production software without the years of engineering muscle memory it usually takes. The marketing says it's safe and the product produces plausible work. The loop stays quiet until something breaks, and the dev forums are full of stories where it did: leaked secrets, runaway agents, silent regressions.

Even if you don't use agents, or you always read the diffs carefully, you still have to deal with the consequences. It usually arrives as a vibe-coded PR or demo from a non-technical colleague that engineering then has to finish properly. It's hard to be the engineer who always says no, especially when these colleagues are excited to contribute and think they made something good. The question is do we want to fix it, control it, or ban it?

Why this fails

The agent reads both the prompt and the code, treating them as equally important since either can be changed at any time. This is different from a compiler, which operates in one direction. You write Go, it produces assembly, and there's no confusion about which side to edit. If you change the Go file, the assembly gets regenerated next time. If you edit the assembly directly, you could make a mistake that the next compile will silently overwrite.

Now, picture a compiler that is right 95% of the time. Sometimes it regenerates code in a different file you didn't plan to modify, treating its previous output as input for the next run. Nobody reads the assembly because the main reason for trusting the compiler is that you don't have to. So, when things go wrong, nobody notices. The compiler continues to treat its past output as if it were the source, causing errors to accumulate unnoticed.

Compilers gave us assembly we never had to look at. The agent loop asks us to look at both.

To make this concrete, let's say that in week 1 you ask the agent to add a payment flow where it does the right thing, eg, a GDPR consent check before charge and amount bounded against user daily cap:

if not user.has_consent("payments"):
    raise PaymentDenied("missing consent")

if amount <= 0 or amount > user.daily_cap:
    raise PaymentDenied("amount out of bounds")

You revisit the same function weeks later and tell the agent to send a quick cleanup pass and it looks this way:

if amount <= 0:
    raise PaymentDenied("amount out of bounds")

if amount > user.daily_cap and not user.is_premium:
    raise PaymentDenied("amount out of bounds")

The tests still pass, the code is clean and readable, but gone is the GDPR check, a fraud cap has been silently dropped from premium users without anyone asking for it.

I've been calling this logic drift. The code shape is roughly the same, but an earlier constraint is subtly relaxed. An invariant becomes conditional, a guard gets moved a few lines down past the thing it was supposed to guard, an authorization check gets duplicated and one of the copies is wrong. The diff just says a guard moved. The source never stated that the guard was load-bearing, so the review never catches the moment it is no longer load-bearing.

This actually happened on the Linux kernel recently. A maintainer submitted a patch generated by a AI that removed a __read_mostly annotation. This annotation is a hint to the compiler about cacheline placement, and removing it causes contention on every multi-core system that the kernel ships to. On review, the line seemed like a simple cleanup, so the patch was accepted, and Torvalds later said that he would have viewed it differently if he had known it was written by AI.

The shape of a fix

The fix needs to be in the harness, the layer between the model and your filesystem (Cursor, Claude Code, Replit, an IDE plugin). The simplest implementation is a way of tagging a comment and the code immediately following it as human owned so that the agent can read it and reference it and suggest a patch but cannot implement the patch without the human unlocking it first. That puts the source/assembly boundary back into the code.

Protected regions like this are a really old idea. Code generators have used BEGIN USER CODE / END USER CODE markers for decades because rerunning the generator overwrites whatever you had hand-edited inside the generated file. Agentic coding has the same overwrite problem, except there's no generator and no rerun, just an agent editing ordinary source files in the background. There's no codegen template to put the markers in, so the lock has to live one layer up, in the harness itself.

A # lock: comment does the job one statement at a time, in the spirit of Python's # type: or # pragma: no cover:

def charge_card(user, amount, idempotency_key):
    # lock: gdpr art 6 - refuse charge if no payment consent
    if not user.has_consent("payments"):
        raise PaymentDenied("missing consent")

    # lock: fraud SLA - reject amounts <=0 or above user.daily_cap
    if amount <= 0 or amount > user.daily_cap:
        raise PaymentDenied("amount out of bounds")

    invoice = build_invoice(user, amount, idempotency_key)
    metrics.timing("invoice.build", invoice.elapsed_ms)

    receipt = stripe.charge(invoice.token, amount)
    # lock: pci audit trail, compliance keeps asking, dont remove
    log.info("charged", user=user.id, amount=amount)
    return receipt

The # lock: comment locks itself and the syntax node immediately below, so attaching it to an if covers the whole block and attaching it to a single call covers just that line. The comment contains the motivation and is locked along with the code.

Note that these solutions do not rely on the model to cooperate. The harness already sits between the agent and the filesystem. Before applying any patch, it analyses the file, determines where the locks are placed, and refuses all attempts to edit the spans containing the locks, unless of course they are explicitly unlocked by the user.

What's been tried

The first answer everyone reaches for is discipline (agentic coding is a trap): use the agent less, keep diffs small, review everything. This all works well right up until the tool itself drains any remaining self-discipline you might have. You pull the lever and a perfectly functional piece of code drops out of the app. Also, even if you may have strong discipline, you cannot enforce that on others.

Traditional engineering processes work well for humans, but don't scale to the scope of agents. Requirements live outside the code and are not generally read by agents. Tests, types, and linters all give the agent rails to follow, but none of them says: don't change this line, ever. Code review can catch some of the drift, but it's a scale problem. Reviewing takes far longer than it takes an agent to spit out a new feature.

The harness vendors themselves have caught up some too, but most of what they've shipped is still not hard constraints. Persistent memory survives sessions, skills bundle known procedures, code search has gone from grep to semantic indexing, and AGENTS.md files politely beg the agent not to touch certain functions. Cursor has project rules, Claude Code has hooks that can intercept tool calls, GitHub Copilot has custom instructions, and OpenCode has modes that can't write to production files at all. I actually use a lot of it.

AGENTS.md, on closer inspection.

So that's roughly where I land. The harness vendors aren't going to ship a real lock anytime soon, and until they do, the only boundary that reliably holds is one the agent can't see or touch. Current solutions are helpful but just as advisory hints rather than as the lock itself.

Top comments (9)

Max Quimby • Jun 8

The "no source/generated boundary" framing is the sharpest way I've seen this put. What makes it insidious is that an editable test file turns "tests pass" from a verification signal into a target — and agents are very good at hitting targets. The most common drift I see isn't the agent rewriting logic, it's the agent quietly weakening an assertion so the obvious test goes green while the behavior underneath rots.

The workaround that's held up best for us is mechanical rather than another instruction file: run verification from a clean checkout the agent has no write access to, so it physically cannot touch the tests it's being graded against. Not as clean as a real protected region, but it restores the boundary the harness doesn't give you.

Curious where you'd draw the protected line — the test files, or the assertions? Protecting whole files turned out too coarse for us (you do want the agent adding new tests), but "can append, can't weaken existing" is hard to enforce without semantic diffing.

Mykola Kondratiuk • Jun 11

reading the diffs isn't the whole fix - teams that review every diff still drift when intent only lives in someone's head. the real variable is whether the agent can recover intent from artifacts, or has to invent it.

Rasmus Ros • Jun 13

Teams can read every diff and still drift if the reason behind the change never made it into the PR, issue, or doc. An agent just exposes that faster by filling in the blanks with something nobody meant.

Mykola Kondratiuk • Jun 13

filling in the blanks makes a bad assumption visible, which is actually a better failure mode than quiet drift. you can argue with a wrong guess. you can't argue with undefined intent.

Kantemir Satibalov • Jul 1

Good article. "Logic drift" is a spot-on name for it — the code looks fine, tests pass, but an important constraint quietly disappears during a refactor.

On # lock: — the idea makes sense, but it only protects what you marked in advance. The dangerous bugs are usually in the stuff you didn't know was important. Probably need tests for the rules themselves too, like "no payment without consent", regardless of how the agent rearranges the lines.

Rasmus Ros • Jul 2

Thanks. You're right about the limitations. Unfortunately there is no silver bullet here I think.

Gissur Runarsson • Jun 25

This really lands. The thing that got me is the failure is invisible because the intent lives outside the artifact. Nothing breaks loudly. You just find out later in production.

I keep hitting the exact same shape one layer up outside the codebase. I have been measuring how the big models answer when someone asks for a tool in a category. It is the same failure mode you are describing. The model quietly drops the newer option and names the incumbent. No error, no signal. The company just never shows up in the answer and nobody notices until the buyers stop coming. The load bearing constraint, this company exists and does X, is not in what the model can see, so it gets optimized away.

It actually made me think of your loss function piece too. The objective when it recommends is not be right about new tools. It is produce the safe plausible answer. And the safe answer is whatever the training consensus already rewarded.

Where do you land on this for agents? Does the boundary have to be enforced structurally like you argue, or is there a version where the eval just gets good enough to catch the drift before it ships? I keep flip flopping on it.

Rasmus Ros • Jun 27

It is the same failure mode. Nothing fails loudly, the newer tool just disappears and the incumbent keeps getting recommended until someone makes a production decision from a stale map.

Gissur Runarsson • Jun 28

The way i have come to see it:
eval catches the drift you thought to test for, structure catches the drift you did not. For the recommendation layer it is the same shape, you cannot eval your way into being seen if the constraint "this company exists and does X" was never in the model's inputs. no test fires because nothing is wrong, the information just is not there.

So for agents i land closer to your structural side, with one caveat: structure for the load-bearing invariants, eval for the soft stuff where "good enough" really is good enough and a miss is cheap. The mistake i keep watching people make is using eval for the load-bearing parts because writing a test is cheaper than enforcing an invariant. That is the safe-plausible-answer failure again, one layer up.

The recommendation version is almost funny: the only thing that reliably moves the model is putting the constraint back into the world it reads from. published, specific, third-party evidence. Y
ou cannot test your way in, you have to be in the inputs. where do you draw the line on which invariants earn structure vs eval?