Enjoy Kumawat

Posted on Jun 29

My AI Wrote Code That Passed Every Test and Was Still Wrong

#ai #llm #codequality #claudecode

The scariest bug I shipped this year came from code that did everything right. It compiled. It passed the tests. The linter was happy. The PR looked clean. My AI agent wrote it, I skimmed it, it worked in the demo — and it was still wrong in a way that none of those green checkmarks could see.

Here's the function, lightly disguised:

def load_env(path):
    for line in open(path, encoding="utf-8"):
        line = line.strip()
        if "=" in line and not line.startswith("#"):
            k, v = line.split("=", 1)
            os.environ[k] = v.strip().strip('"').strip("'")

Looks fine, right? Parses a .env, strips quotes, skips comments. It passed my test, which had one clean KEY=value line. It worked when I ran it. And it clobbered an environment variable I'd already exported in my shell, because it used os.environ[k] = v instead of os.environ.setdefault(k, v). The test never caught it because the test didn't model the one situation that mattered: a variable that already existed.

The code was functional. It was not correct. Those are different words and AI-generated code is exactly where the gap between them gets dangerous.

Why "it works" lies to you with a straight face

When I write a sloppy function, I usually know it's sloppy. I felt the shortcut. There's a little voice. AI-generated code arrives with none of that. It's confident, idiomatic, well-formatted, and carries zero signal about which parts the model actually reasoned through versus pattern-matched from a million GitHub repos.

That polish is the trap. A human's rough draft looks rough, so you review it carefully. The AI's draft looks finished, so you review it the way you review finished code — for style, not for truth. And "does it run" answers a much easier question than "does it do the right thing when the input is hostile, empty, duplicated, or already in a weird state."

The model optimizes for plausible. Plausible and correct overlap a lot. The whole problem lives in the gap where they don't.

Where the gap actually hides

After getting burned a few times, I started noticing the same categories over and over. AI code is reliably weakest at:

State that already exists. Overwriting instead of merging, creating instead of upserting, assuming the file/dir/key isn't there yet.
The empty and the duplicate. Zero rows, one row, the same row twice. The happy path is n >= 2 distinct items.
Idempotency. Run it twice — does it do the same thing, or double it?
Off-by-one at boundaries. Especially in slicing, pagination, and "strip the first line" logic.
Silent error swallowing. except: pass that turns a real failure into a wrong-but-quiet success.

None of these throw. None of them fail a naive test. They just produce the wrong answer, calmly, until production finds the edge case for you.

What I changed: review the spec, not the syntax

The fix wasn't "trust the AI less." It was changing what I review for. I stopped reading AI code to check that it's well-written — it almost always is — and started reading it to find the input the author never imagined.

Three things that stuck:

1. Make the AI write the test for the case it's avoiding. When I get a function, I don't ask "is this right?" I ask the agent to write the test for the nasty input:

def test_load_env_does_not_clobber_existing():
    os.environ["DEV_TO_API"] = "real-key-from-shell"
    load_env(env_file_containing("DEV_TO_API=placeholder"))
    assert os.environ["DEV_TO_API"] == "real-key-from-shell"  # FAILS with os.environ[k] = v

That test turns an invisible correctness bug into a visible, red, functional failure. Now my safety net actually models the danger.

2. Use a second model as an adversary, not a cheerleader. I run a separate review pass whose entire job is to attack the code — "what input breaks this?" — not to approve it. The author-model and the reviewer-model are different lanes on purpose. A model grading its own homework finds nothing; a model told to break someone else's code finds plenty. (In this repo I literally have a different tool review what the first one wrote.)

3. Treat "passed the tests" as the floor, not the ceiling. Green tests mean the cases I thought of work. They say nothing about the cases I didn't. With AI code, the cases I didn't think of are precisely where the model also didn't think — because it was pattern-matching my framing, not the real world.

The one-line fix, and the real lesson

The actual patch was tiny:

os.environ.setdefault(k, v.strip().strip('"').strip("'"))  # don't clobber a live env var

One method. But I'd never have found it by asking "does this work," because it did work, right up until it ruined my afternoon.

Here's what I tell myself now every time an agent hands me a clean diff: functional is the AI's job; correct is still mine. The model got dramatically better at producing code that runs. It did not get better at understanding what I actually meant. That gap doesn't show up in the test output — it shows up three weeks later in a bug report. Review for the meaning, write the test for the case the model dodged, and let a second model try to tear it apart before reality does.

Top comments (2)

Adam Lewis • Jul 1

The test had one clean KEY=value line, so it could only ever pass. It was written to match the code rather than the contract, so nothing you fed it would have turned it red. With an agent writing the code and the test off the same prompt, that's the default unless the cases come from outside it, the malformed line, the variable that's already set. Green just meant the code agreed with itself.

Pon • Jun 30

The .env clobber is a clean example because it's the rare one that stays purely a correctness bug. Your five categories are all 'the input the author never imagined.' There's a sixth hiding in the same blind spot, except it flips from correctness to security: the input an attacker imagines. Same mechanism you named -- the test is one honest user, so it passes -- but now the missing case isn't a duplicate row, it's a second user reading the first one's data. The AI writes an access check that trusts a client-supplied id, or an RLS policy that's enabled but set to USING (true), and every checkmark stays green because no test in the suite ever logs in as someone else. Functional, correct for you, wide open to anyone who isn't you. Your 'write the test for the case the model dodged' is the exact fix here -- the security version is just 'write the test as a different user.' The adversary-model pass is the right shape too; the one thing I'd add is that for the security slice the adversary has to be told to log in as the wrong person, because attacking the logic and attacking the trust boundary surface different bugs. I built a small scanner that does this categorization for the Supabase and auth slice specifically -- if pointing it at a repo would be useful, happy to run it for free and send back whatever it finds, no strings.