The scariest bug I shipped this year came from code that did everything right. It compiled. It passed the tests. The linter was happy. The PR looked clean. My AI agent wrote it, I skimmed it, it worked in the demo — and it was still wrong in a way that none of those green checkmarks could see.
Here's the function, lightly disguised:
def load_env(path):
for line in open(path, encoding="utf-8"):
line = line.strip()
if "=" in line and not line.startswith("#"):
k, v = line.split("=", 1)
os.environ[k] = v.strip().strip('"').strip("'")
Looks fine, right? Parses a .env, strips quotes, skips comments. It passed my test, which had one clean KEY=value line. It worked when I ran it. And it clobbered an environment variable I'd already exported in my shell, because it used os.environ[k] = v instead of os.environ.setdefault(k, v). The test never caught it because the test didn't model the one situation that mattered: a variable that already existed.
The code was functional. It was not correct. Those are different words and AI-generated code is exactly where the gap between them gets dangerous.
Why "it works" lies to you with a straight face
When I write a sloppy function, I usually know it's sloppy. I felt the shortcut. There's a little voice. AI-generated code arrives with none of that. It's confident, idiomatic, well-formatted, and carries zero signal about which parts the model actually reasoned through versus pattern-matched from a million GitHub repos.
That polish is the trap. A human's rough draft looks rough, so you review it carefully. The AI's draft looks finished, so you review it the way you review finished code — for style, not for truth. And "does it run" answers a much easier question than "does it do the right thing when the input is hostile, empty, duplicated, or already in a weird state."
The model optimizes for plausible. Plausible and correct overlap a lot. The whole problem lives in the gap where they don't.
Where the gap actually hides
After getting burned a few times, I started noticing the same categories over and over. AI code is reliably weakest at:
- State that already exists. Overwriting instead of merging, creating instead of upserting, assuming the file/dir/key isn't there yet.
-
The empty and the duplicate. Zero rows, one row, the same row twice. The happy path is
n >= 2 distinct items. - Idempotency. Run it twice — does it do the same thing, or double it?
- Off-by-one at boundaries. Especially in slicing, pagination, and "strip the first line" logic.
-
Silent error swallowing.
except: passthat turns a real failure into a wrong-but-quiet success.
None of these throw. None of them fail a naive test. They just produce the wrong answer, calmly, until production finds the edge case for you.
What I changed: review the spec, not the syntax
The fix wasn't "trust the AI less." It was changing what I review for. I stopped reading AI code to check that it's well-written — it almost always is — and started reading it to find the input the author never imagined.
Three things that stuck:
1. Make the AI write the test for the case it's avoiding. When I get a function, I don't ask "is this right?" I ask the agent to write the test for the nasty input:
def test_load_env_does_not_clobber_existing():
os.environ["DEV_TO_API"] = "real-key-from-shell"
load_env(env_file_containing("DEV_TO_API=placeholder"))
assert os.environ["DEV_TO_API"] == "real-key-from-shell" # FAILS with os.environ[k] = v
That test turns an invisible correctness bug into a visible, red, functional failure. Now my safety net actually models the danger.
2. Use a second model as an adversary, not a cheerleader. I run a separate review pass whose entire job is to attack the code — "what input breaks this?" — not to approve it. The author-model and the reviewer-model are different lanes on purpose. A model grading its own homework finds nothing; a model told to break someone else's code finds plenty. (In this repo I literally have a different tool review what the first one wrote.)
3. Treat "passed the tests" as the floor, not the ceiling. Green tests mean the cases I thought of work. They say nothing about the cases I didn't. With AI code, the cases I didn't think of are precisely where the model also didn't think — because it was pattern-matching my framing, not the real world.
The one-line fix, and the real lesson
The actual patch was tiny:
os.environ.setdefault(k, v.strip().strip('"').strip("'")) # don't clobber a live env var
One method. But I'd never have found it by asking "does this work," because it did work, right up until it ruined my afternoon.
Here's what I tell myself now every time an agent hands me a clean diff: functional is the AI's job; correct is still mine. The model got dramatically better at producing code that runs. It did not get better at understanding what I actually meant. That gap doesn't show up in the test output — it shows up three weeks later in a bug report. Review for the meaning, write the test for the case the model dodged, and let a second model try to tear it apart before reality does.
Top comments (0)