DEV Community

Enjoy Kumawat
Enjoy Kumawat

Posted on

My AI Wrote Code That Passed Every Test and Was Still Wrong

The scariest bug I shipped this year came from code that did everything right. It compiled. It passed the tests. The linter was happy. The PR looked clean. My AI agent wrote it, I skimmed it, it worked in the demo — and it was still wrong in a way that none of those green checkmarks could see.

Here's the function, lightly disguised:

def load_env(path):
    for line in open(path, encoding="utf-8"):
        line = line.strip()
        if "=" in line and not line.startswith("#"):
            k, v = line.split("=", 1)
            os.environ[k] = v.strip().strip('"').strip("'")
Enter fullscreen mode Exit fullscreen mode

Looks fine, right? Parses a .env, strips quotes, skips comments. It passed my test, which had one clean KEY=value line. It worked when I ran it. And it clobbered an environment variable I'd already exported in my shell, because it used os.environ[k] = v instead of os.environ.setdefault(k, v). The test never caught it because the test didn't model the one situation that mattered: a variable that already existed.

The code was functional. It was not correct. Those are different words and AI-generated code is exactly where the gap between them gets dangerous.

Why "it works" lies to you with a straight face

When I write a sloppy function, I usually know it's sloppy. I felt the shortcut. There's a little voice. AI-generated code arrives with none of that. It's confident, idiomatic, well-formatted, and carries zero signal about which parts the model actually reasoned through versus pattern-matched from a million GitHub repos.

That polish is the trap. A human's rough draft looks rough, so you review it carefully. The AI's draft looks finished, so you review it the way you review finished code — for style, not for truth. And "does it run" answers a much easier question than "does it do the right thing when the input is hostile, empty, duplicated, or already in a weird state."

The model optimizes for plausible. Plausible and correct overlap a lot. The whole problem lives in the gap where they don't.

Where the gap actually hides

After getting burned a few times, I started noticing the same categories over and over. AI code is reliably weakest at:

  • State that already exists. Overwriting instead of merging, creating instead of upserting, assuming the file/dir/key isn't there yet.
  • The empty and the duplicate. Zero rows, one row, the same row twice. The happy path is n >= 2 distinct items.
  • Idempotency. Run it twice — does it do the same thing, or double it?
  • Off-by-one at boundaries. Especially in slicing, pagination, and "strip the first line" logic.
  • Silent error swallowing. except: pass that turns a real failure into a wrong-but-quiet success.

None of these throw. None of them fail a naive test. They just produce the wrong answer, calmly, until production finds the edge case for you.

What I changed: review the spec, not the syntax

The fix wasn't "trust the AI less." It was changing what I review for. I stopped reading AI code to check that it's well-written — it almost always is — and started reading it to find the input the author never imagined.

Three things that stuck:

1. Make the AI write the test for the case it's avoiding. When I get a function, I don't ask "is this right?" I ask the agent to write the test for the nasty input:

def test_load_env_does_not_clobber_existing():
    os.environ["DEV_TO_API"] = "real-key-from-shell"
    load_env(env_file_containing("DEV_TO_API=placeholder"))
    assert os.environ["DEV_TO_API"] == "real-key-from-shell"  # FAILS with os.environ[k] = v
Enter fullscreen mode Exit fullscreen mode

That test turns an invisible correctness bug into a visible, red, functional failure. Now my safety net actually models the danger.

2. Use a second model as an adversary, not a cheerleader. I run a separate review pass whose entire job is to attack the code — "what input breaks this?" — not to approve it. The author-model and the reviewer-model are different lanes on purpose. A model grading its own homework finds nothing; a model told to break someone else's code finds plenty. (In this repo I literally have a different tool review what the first one wrote.)

3. Treat "passed the tests" as the floor, not the ceiling. Green tests mean the cases I thought of work. They say nothing about the cases I didn't. With AI code, the cases I didn't think of are precisely where the model also didn't think — because it was pattern-matching my framing, not the real world.

The one-line fix, and the real lesson

The actual patch was tiny:

os.environ.setdefault(k, v.strip().strip('"').strip("'"))  # don't clobber a live env var
Enter fullscreen mode Exit fullscreen mode

One method. But I'd never have found it by asking "does this work," because it did work, right up until it ruined my afternoon.

Here's what I tell myself now every time an agent hands me a clean diff: functional is the AI's job; correct is still mine. The model got dramatically better at producing code that runs. It did not get better at understanding what I actually meant. That gap doesn't show up in the test output — it shows up three weeks later in a bug report. Review for the meaning, write the test for the case the model dodged, and let a second model try to tear it apart before reality does.

Top comments (0)