DEV Community

Dimitris Kyrkos
Dimitris Kyrkos

Posted on

Functional doesn't mean correct. That's the biggest risk with AI-generated code.

The code runs. That's not the question.

There's a failure mode with AI-generated code that's harder to catch than bugs, security holes, or performance problems. The code works. The interface looks right. The tests pass. And the system quietly solves the wrong problem.

This is different from broken code. Broken code announces itself. It throws errors, fails tests, crashes in production. You find it and fix it. The feedback loop is fast.

Code that's functional but wrong is silent. It runs perfectly while misunderstanding the actual requirement. And because it looks clean and passes every automated check, it can live in production for months before someone notices it's doing the wrong thing confidently.

Why this happens more with AI

When a human writes code, the act of building forces engagement with the requirement. You read the spec, you think about it, you translate it into logic. Sometimes you realize halfway through that the requirement doesn't make sense, or that there's an edge case the spec didn't cover, or that what the client asked for isn't what they actually need. That friction is valuable. It's where misunderstandings surface.

AI skips all of that. You prompt it, it produces output that structurally matches what you described. But "structurally matches the prompt" and "solves the real problem" are very different things. The AI doesn't know your business context. It doesn't know that "calculate the discount" means something different for wholesale customers than retail ones. It doesn't know that "send a notification" shouldn't happen during a maintenance window. It doesn't know that the requirement as written is actually wrong and a human would have flagged it.

The output looks right because the code is well-formed. The output is wrong because the intent behind the code was never verified.

The specific ways this shows up

The requirement gets interpreted literally. You ask for a search function and the AI builds one that matches exact strings. The actual users expect fuzzy matching, typo tolerance, and synonym handling. The code works perfectly. It's just not what anyone needed.

Business rules get flattened. The AI implements the rule as stated in the prompt but misses the exceptions that everyone on the team knows about but nobody wrote down. A pricing function that doesn't account for the grandfather clause on legacy accounts. A permissions check that doesn't know about the temporary elevated access your support team uses during escalations.

Edge cases get the happy path treatment. The AI handles the common case well because that's what the prompt described. The uncommon cases, the ones that cause actual production incidents, get default behavior that technically doesn't crash but produces wrong results silently.

Validation is the actual work

Vibe coding gives you speed. Validation gives you correctness. They're different things and one doesn't substitute for the other.

The teams handling this well do something boring but effective: they verify that the generated code solves the right problem before they verify that it solves it correctly. That means going back to the actual requirement, not the prompt, and asking whether the output matches what the business actually needs. Not what the prompt said. What the business needs. Those are often different.

Then they check the edge cases. Not the ones the AI tested for, the ones it couldn't know about because they live in the team's domain knowledge, not in the codebase.

Then they ask the question that matters most: could this code produce wrong results silently? Not crash, not throw errors, just quietly do the wrong thing and look fine on every dashboard. That's the failure mode that AI makes much more likely, and it's the one that most validation processes don't test for.

The uncomfortable bottom line

LLMs are very good at producing code that looks structurally right. They're also very good at producing code that confidently solves a problem you don't actually have. The gap between those two things is where engineering judgment lives.

AI didn't remove the need for that judgment. It made it the only thing standing between "the code runs" and "the system actually works."

How does your team validate that AI-generated code solves the right problem, not just any problem?

Top comments (4)

Collapse
 
xulingfeng profile image
xulingfeng

The 'functional-but-wrong code is silent' line hit hard. We spent months chasing a 97.2% coverage report that looked beautiful — until someone asked if it was testing the right things. Turns out running code ≠ working system.

Collapse
 
cyclopt_dimitrisk profile image
Dimitris Kyrkos

The 97.2% number is a perfect example because it creates exactly the kind of false confidence this post is about. Nobody questions a test suite with 97% coverage. But coverage answers "did this code execute" not "did this code do the right thing," and those are completely different questions. You can cover every line and still test the AI's interpretation of the requirement rather than the actual requirement. That's the trap: the metrics look great precisely because they're measuring the wrong thing, and the better they look the less likely anyone is to dig deeper

Collapse
 
xulingfeng profile image
xulingfeng

Funny how 97.2% keeps finding its way into these conversations 😂 But you're right — the green bars look their best when they're measuring the wrong thing. Makes me wonder if there's a way to track "did it do what we actually asked" vs "did it run"

Thread Thread
 
cyclopt_dimitrisk profile image
Dimitris Kyrkos

Ha the 97.2% is becoming the unofficial mascot of this conversation. On tracking the difference, the closest thing I've seen work is specifying acceptance criteria in terms of business outcomes before generation, not after. Instead of "does this function return the right value" you write "given a wholesale customer with a legacy discount, the final price should be X." The first one tells you if the code ran, the second tells you if it did the right thing. The gap between those two is exactly where the silent failures live, and most test suites only cover the first one because that's what's easy to automate.