Why 'AI-Generated Code is a Minefield' is Trending — And What 2 Months of Building a Static Scanner Taught Me

#ai #python #softwareengineering #testing

The top trending article on Qiita this week (Japan's largest dev community) is titled "AI-Written Pandas Code is Usually a Minefield." Hundreds of replies — and almost all of them boil down to the same observation: AI code looks correct, runs once locally, and then quietly breaks six weeks later in production.

I've been heads-down for two months building a static analysis scanner for AI-generated code — no LLM in the engine, just 93 deterministic rules across 14 categories. The trending discussion lines up almost exactly with what I keep seeing in the wild, so I want to write down what's actually going on.

What the discussion is really about

The Qiita thread starts with pandas, but the pattern generalizes. AI-generated code consistently fails in three places:

Silent data-type drift — code that worked on the toy DataFrame in the example does something subtly different on real-world data.
Deprecated APIs — models trained on older tutorials happily emit functions that emit warnings (or worse, return different types) on modern library versions.
Missing edge cases — happy-path code that quietly assumes no empty inputs, no NaN, no encoding issues.

None of these are exotic. They're the kind of thing a senior reviewer would flag in five seconds. The problem is that AI-generated code increasingly doesn't go through review — it goes straight from chat window to commit.

Why another LLM can't fix this

The first version of my scanner used an LLM for the analysis pass. That felt like the obvious move. It failed for a reason that turned out to be structural, not a tuning problem.

I ran the same security-review prompt against the same Python file five times. I got five different verdicts. Three flagged a SQL-injection pattern. One missed it entirely. One hallucinated a path-traversal vulnerability that didn't exist.

That's not a reviewer. That's a coin flip with extra steps.

Tightening the prompt helped at the margin. Lowering temperature helped at the margin. But the fundamental property of an LLM is that it's a probability distribution over outputs, and a reviewer needs to satisfy a much harder property: same input, same verdict, every time. Without that, you can't gate a CI pipeline on it. You can't tell a developer "this is safe to merge." You just have a confident-sounding chatbot.

The fix wasn't a better prompt. It was ripping the LLM out of the analysis path entirely and rewriting the engine around AST parsing and pattern rules.

What I keep finding in real AI-generated projects

When I scan AI-heavy repos, the same handful of patterns surfaces over and over:

SQL composed via f-strings. The model picked up an old tutorial pattern and just kept emitting it.
Hardcoded credentials — API keys and tokens dropped directly into source instead of read from environment.
pickle for deserialization in contexts where the input could be untrusted, even when modern alternatives existed.
Path operations without validation — os.path.join with user input passed straight through.

Each individual one is trivial to detect. The reason they keep slipping through is that AI-generated code runs. Local tests pass — because the local tests are often also AI-generated against the same flawed assumptions. The failure mode isn't "this code is obviously broken." It's "this code looks idiomatic until a real edge case shows up."

What changed for me

For my own projects, every AI-generated commit now goes through static analysis before merge. Not LLM review — deterministic rule-based scanning. The check completes in roughly two seconds. It returns Pass or Fail. There is no probabilistic verdict to argue with.

That's also what I built CodeHeal to do for other devs — 14 categories, 93 detection rules, no signup required for the free tier (5 scans/day). Paste a snippet or point it at a repo and it will tell you what's there, deterministically, in seconds.

If you've been wondering why AI-generated code feels great in the editor and lousy a month later, the Qiita commenters have already named it. The fix isn't a better LLM. It's getting the LLM out of the gate.

Try it on your own AI-generated code: https://scanner-saas.vercel.app/scan