Dimitris Kyrkos

Posted on Jun 26

Functional doesn't mean correct. That's the biggest risk with AI-generated code.

#ai #webdev #programming #discuss

AI skipping critical requirement friction

The code runs. That's not the question.

There's a failure mode with AI-generated code that's harder to catch than bugs, security holes, or performance problems. The code works. The interface looks right. The tests pass. And the system quietly solves the wrong problem.

This is different from broken code. Broken code announces itself. It throws errors, fails tests, crashes in production. You find it and fix it. The feedback loop is fast.

Code that's functional but wrong is silent. It runs perfectly while misunderstanding the actual requirement. And because it looks clean and passes every automated check, it can live in production for months before someone notices it's doing the wrong thing confidently.

Why this happens more with AI

When a human writes code, the act of building forces engagement with the requirement. You read the spec, you think about it, you translate it into logic. Sometimes you realize halfway through that the requirement doesn't make sense, or that there's an edge case the spec didn't cover, or that what the client asked for isn't what they actually need. That friction is valuable. It's where misunderstandings surface.

AI skips all of that. You prompt it, it produces output that structurally matches what you described. But "structurally matches the prompt" and "solves the real problem" are very different things. The AI doesn't know your business context. It doesn't know that "calculate the discount" means something different for wholesale customers than retail ones. It doesn't know that "send a notification" shouldn't happen during a maintenance window. It doesn't know that the requirement as written is actually wrong and a human would have flagged it.

The output looks right because the code is well-formed. The output is wrong because the intent behind the code was never verified.

The specific ways this shows up

The requirement gets interpreted literally. You ask for a search function and the AI builds one that matches exact strings. The actual users expect fuzzy matching, typo tolerance, and synonym handling. The code works perfectly. It's just not what anyone needed.

Business rules get flattened. The AI implements the rule as stated in the prompt but misses the exceptions that everyone on the team knows about but nobody wrote down. A pricing function that doesn't account for the grandfather clause on legacy accounts. A permissions check that doesn't know about the temporary elevated access your support team uses during escalations.

Edge cases get the happy path treatment. The AI handles the common case well because that's what the prompt described. The uncommon cases, the ones that cause actual production incidents, get default behavior that technically doesn't crash but produces wrong results silently.

Validation is the actual work

Vibe coding gives you speed. Validation gives you correctness. They're different things and one doesn't substitute for the other.

The teams handling this well do something boring but effective: they verify that the generated code solves the right problem before they verify that it solves it correctly. That means going back to the actual requirement, not the prompt, and asking whether the output matches what the business actually needs. Not what the prompt said. What the business needs. Those are often different.

Then they check the edge cases. Not the ones the AI tested for, the ones it couldn't know about because they live in the team's domain knowledge, not in the codebase.

Then they ask the question that matters most: could this code produce wrong results silently? Not crash, not throw errors, just quietly do the wrong thing and look fine on every dashboard. That's the failure mode that AI makes much more likely, and it's the one that most validation processes don't test for.

The uncomfortable bottom line

LLMs are very good at producing code that looks structurally right. They're also very good at producing code that confidently solves a problem you don't actually have. The gap between those two things is where engineering judgment lives.

AI didn't remove the need for that judgment. It made it the only thing standing between "the code runs" and "the system actually works."

How does your team validate that AI-generated code solves the right problem, not just any problem?

Top comments (67)

UnitBuilds • Jun 26

Yip... And that's the fatal flaw, a unit test that pass, doesnt represent a unit test in production, if it's not explicitly written to test all the edge-cases, which generally you dont do, because a 600 LOC unit test for a single function is bloat, do that for an entire codebase and you have more test code than production code, it's just not feasible. Had a case in the early days of developing V.A.L.I.D. where it would create a unit test that passes, I run the app, errors everywhere. It had written the tests based on the boilerplate it generated, instead of the variables I gave it, so it was a self-gratifying system that just made pointless tests. It also pushed for the need of constraints and the fuzzer, so it has a range to test in and the fuzzer to test whether it breaks in production. Now it's rock solid, I even built an entire autonomous accounting suite on it and saved having to write the 400k LOC that it generated for me (82% generated to be exact from a 500k LOC codebase), which is exactly where it should be, write all the unique stuff, generate all the things where mistakes could happen, but properly guarded, so they dont surface in production and before anything is shipped, run the fuzzer to test every single UI element and make sure that the rules and relations dont break.

Dimitris Kyrkos • Jun 26

The "self-gratifying system" description is the most honest way I've heard anyone describe AI-generated tests. That's exactly the failure mode: the AI tests its own assumptions, the tests pass, and you have a green dashboard that proves nothing about production. Your fix of adding constraints and fuzzing is the right architecture because it introduces an adversarial element the AI can't game. The AI generates the code, but the fuzzer doesn't care what the AI intended, it just tries to break things. That separation between "code author" and "code breaker" is what most teams are missing. The 82% generated ratio is interesting too because at that volume you literally cannot review everything manually, which is why the automated guardrails aren't optional at that scale, they're the only thing between you and shipping 330k lines of unverified code.

UnitBuilds • Jun 26

Exactly, Coca-Cola doesnt have to verify each bottle manufactured, they do spot-checks and they make damn sure their production line is flawless, with that mentality, you can produce 400k LOC of generated code, because you trust your Roslyn generator and instead of testing 1-10 bottles, you run the fuzzer to blast it for a minute at 3600 multi-dependency mutations a second. That way you know without a doubt that it's unbreakable, because it ran a year's usage in a minute. Nobody really cares that it can run that fast, for them it just means it takes less than a ms for their screen to update the total when they enter a value, but the reason it works that fast, is what makes the architecture production ready.

Dimitris Kyrkos • Jun 26

The Coca-Cola analogy is solid. Trust the production line, verify the output statistically rather than individually. At 400k generated lines that's really the only model that works because inspecting every bottle is impossible at that volume. The speed being a side effect of the architecture rather than the goal is a good insight too, you didn't optimize for fast, you optimized for correct, and fast came for free because the architecture had no waste in it.

UnitBuilds • Jun 26

Exactly, if you can be certain it will do it right 100 times, you know it can do it perfectly for 1m times. So optimize once, test twice and use forever

Dimitris Kyrkos • Jun 26

That's a good engineering principle in one sentence. Solid thread.

xulingfeng • Jun 26

The 'functional-but-wrong code is silent' line hit hard. We spent months chasing a 97.2% coverage report that looked beautiful — until someone asked if it was testing the right things. Turns out running code ≠ working system.

Dimitris Kyrkos • Jun 26

The 97.2% number is a perfect example because it creates exactly the kind of false confidence this post is about. Nobody questions a test suite with 97% coverage. But coverage answers "did this code execute" not "did this code do the right thing," and those are completely different questions. You can cover every line and still test the AI's interpretation of the requirement rather than the actual requirement. That's the trap: the metrics look great precisely because they're measuring the wrong thing, and the better they look the less likely anyone is to dig deeper

xulingfeng • Jun 26

Funny how 97.2% keeps finding its way into these conversations 😂 But you're right — the green bars look their best when they're measuring the wrong thing. Makes me wonder if there's a way to track "did it do what we actually asked" vs "did it run"

Dimitris Kyrkos • Jun 26 • Edited

Ha the 97.2% is becoming the unofficial mascot of this conversation. On tracking the difference, the closest thing I've seen work is specifying acceptance criteria in terms of business outcomes before generation, not after. Instead of "does this function return the right value" you write "given a wholesale customer with a legacy discount, the final price should be X." The first one tells you if the code ran, the second tells you if it did the right thing. The gap between those two is exactly where the silent failures live, and most test suites only cover the first one because that's what's easy to automate.

xulingfeng • Jun 26

Haha tell me about it — 97.2% is basically the unofficial mascot of this whole account now 😂
The acceptance criteria point is solid. Biggest thing is: a lot of teams don't skip writing business outcomes because they're lazy — they skip it because they genuinely don't know how to turn 'user should feel supported' into a test assertion. The whole thing falls apart before it even starts.
Do you find it's a training thing or a process thing when you see teams miss that gap?
Also, the emojis at the end cracked me up 🤣

Dimitris Kyrkos • Jun 26

Both but mostly a language problem. Engineers think in code and product people think in feelings, and nobody in the middle translates "user should feel supported" into something testable. The teams where this works have someone, usually a senior dev or a pragmatic PM, who can take a vague business outcome and break it into concrete observable behaviors. "User should feel supported" becomes "support request gets a response within 2 hours, response includes at least one actionable next step, user doesn't reopen the same ticket within 7 days." Now you have three things you can actually measure. That translation skill is rare and it's not taught in bootcamps or CS programs, people pick it up from years of watching vague requirements turn into production incidents. Process helps because you can mandate the translation step, but without at least one person on the team who's good at it the process just produces bad acceptance criteria instead of no acceptance criteria.

xulingfeng • Jun 26

I hear you. Let's be real — most of what they teach in school and bootcamps is just rigid templates. The actual job throws you curveballs that none of those templates ever prepared you for. You learn that instinct by falling on your face enough times. Some people are born with it, sure. Some people never learn regardless of how many times they fall. Really enjoying this conversation, learned a lot 🤝

Dimitris Kyrkos • Jun 26

Likewise. And you're right, the instinct comes from the scars. No shortcut for that part. Good thread, catch you on the next one.

Dirk Mattig • Jun 29

Your post highlights the inconvenient truth that the old and new main culprits are more often than not sloppy or nonexistent requirements. The road to hell is paved with good intentions - and implicit assumptions. Humans and machines have one thing in common: they are not mindreaders.

In your example, "search function" is not a requirement; it is a feature that needs a requirement specification. Software developers have a long tradition of filling in the requirement gaps, but it has always been a double-edged sword in my view. Developers tend to think they always know what a customer really needs, but that does not mean the customer will always agree. It does not even mean a customer is always grateful for the feedback. You could also be perceived as a troublemaker. I know what I am talking about 😉

So I think it is actually a good thing that AI takes requirements literally, because not only does it give you incredible steering power, but it also highlights the gaps in the requirements.

Dimitris Kyrkos • Jun 30

The "AI taking requirements literally is actually a good thing" reframe is interesting and I think you're partially right. It does surface gaps faster than a human developer who silently fills them in with assumptions. The problem is that a human filling in gaps with wrong assumptions at least produces a conversation when the result doesn't match expectations. AI filling in gaps with literal interpretation produces code that looks finished, so nobody realizes there was a gap at all until production exposes it. The troublemaker point is real though and I've lived it too. The developer who says "this requirement doesn't make sense" is doing the most valuable work on the team and is often the least appreciated for it. Maybe the best use of AI in this context is as the diplomatic troublemaker: "here's what I built from your spec, and here are the five questions your spec didn't answer that I had to guess on." That would surface the gaps without the politics.

TxDesk • Jun 27

the silent-wrong framing is right, and the part i'd add is that the green checkmark has the same failure mode as the code it's checking. a test passing tells you the code does what the test says, not what the requirement says, and those drift apart exactly the way the code and the intent do. so you can end up trusting a number (coverage, pass rate) for the same reason you trusted the clean-looking code: it looks like correctness without being tied to it.

i hit the inverted version of this today. a defect tracker said two things were broken; the tests said they'd pass; the truth was in the code, which had been fixed weeks earlier and the tracker just never caught up. functional-but-wrong and broken-but-actually-fine are the same disease, a status signal that was never re-derived against the real thing it's supposed to represent.

which is why the "verify it solves the right problem before you verify it solves it correctly" line is the whole game. the only check that doesn't drift is the one that goes back to the requirement itself. everything downstream of that, tests, dashboards, reviewer scores, is a proxy, and proxies rot quietly. how do you keep the requirement itself from being the thing that's stale? that's the layer under yours i keep running into.

Dimitris Kyrkos • Jun 29

The "broken-but-actually-fine" inversion is a great catch and you're right that it's the same disease. A status signal detached from the thing it claims to represent. Proxies rot quietly is a good way to put it.

On your question about stale requirements, honestly nobody has fully solved this but the least bad approach I've seen is tying requirements to observable behavior rather than documentation. A written requirement goes stale the moment someone changes the system and doesn't update the doc. But a requirement expressed as "given this input, this specific business outcome should happen" can be re-validated against the running system at any time. The requirement becomes executable rather than descriptive. It still needs a human to define it correctly in the first place but at least staleness becomes detectable because the check fails when the system drifts, instead of the doc sitting there looking accurate while the system quietly diverged six months ago. The layer under that, making sure the executable requirement itself still reflects what the business actually needs, is a human conversation and I don't think there's an automated answer for it. Someone has to periodically ask "do we still want this to be true" and that's a process discipline, not a tooling problem.

TxDesk • Jun 30

executable-not-descriptive is the right answer, and it's the same move one level down: an executable requirement is a proxy that re-derives against the running system instead of sitting static, which is exactly why it doesn't rot the way a doc does. it's tied to the thing it represents. the doc detached the moment someone changed the system; the executable check can't, because drift makes it fail.

and your last point is the honest floor: even the executable requirement is a proxy for intent, and intent is the one layer nothing automated reaches. "do we still want this to be true" can't be checked against the system because it's a question about what the system should be, not what it is. so the chain bottoms out in a human asking that on a schedule. which is maybe the real takeaway: you can push the rot further and further down by making each layer re-derivable, but the bottom layer is always someone deciding it still matters. good thread, this one clarified a lot for me.

Dimitris Kyrkos • Jun 30

You just described the entire proxy chain more clearly than the post did. Every layer can be made re-derivable against the layer below it, except the bottom one, which is a human deciding it still matters. That's the constraint and there's no engineering around it, only discipline through it. Good thread, learned from this one too.

TxDesk • Jun 30

that's the whole thing: re-derive every layer you can, and respect that the bottom one is a choice, not a check. discipline through it, not around it, well put. good thread, this one sharpened it for me too. catch you on the next piece.

Dimitris Kyrkos • Jun 30

See you there.

Yurii Cherkasov • Jul 5

What I've eventually learned is to force the AI to build within its own guardrails.

Not just unit tests or prompt-level guidelines, but deterministic structural checks that make entire classes of mistakes impossible. Think of them as hooks that continuously verify architectural and coding constraints.

It's hard to produce complex example, which I certainly have, but there are some simple examples:

Import from an internal/private module -> fail
Use a deprecated API -> fail
Introduce a circular dependency or a "quick fix" local import -> fail
Violate layering or architectural boundaries -> fail

At that point, the agent stops behaving like a clever burglar trying to satisfy the prompt at any cost and starts behaving more like an engineer working under a supervision.

Once those guardrails exist, you combine deterministic with agentic check - because there are checks you can't strictly formalize, but can make it increasingly harder to fail, because the result is not supported by the test metrics.

The downside is that these guardrails aren't going to write itself. You only discover what needs to be enforced after watching agents solve real tasks for a while. Every recurring "clever monkey fix" eventually becomes another deterministic rule, linter, static analysis check, or architectural test.

In my experience, that's the real feedback loop of agentic development: observe recurring failure modes, formalize them, automate their detection. Over time, the space in which the AI can silently solve the wrong problem becomes smaller.

Dimitris Kyrkos • Jul 6

The "observe recurring failure modes, formalize them, automate their detection" loop is the right development process for guardrails and I think most teams skip it because they try to design the rules upfront instead of letting them emerge from actual agent behavior. Your point about the clever burglar is accurate too, without constraints the AI will find whatever path satisfies the prompt with the least effort, and that path is almost never the architecturally sound one. The layering violation check is a good example because that's one a human developer would catch through experience ("we don't import from that module") but the AI has no concept of unless you make it a hard fail. The honest tradeoff you named at the end is the part most people skip over: the guardrails cost real engineering time to build and maintain, and you can only build them after you've watched the agent fail enough times to know what to prevent. There's no shortcut for that

Yurii Cherkasov • Jul 6 • Edited

But this is where AI also can help.

Building proper guardrails often starts with understanding the structure of the codebase: imports, dependencies, call graphs, layering, ownership boundaries, public vs. private APIs, and so on. Once you go deeper, you quickly end up in ASTs, formal analysis, compiler tooling, and other areas that are proper CS topics.

But that does not mean you have to build the whole thing alone.

First, underlying tools for the code formal analysis already exist - Flake8 for Python, Clang libTooling with AST Matchers for C++, ErrorProne for Java. Use it as a starting point, but the whole team have to agree on the workflow first.

Then use AI to help design the fence it will later have to obey. Ask it to analyze recurring failure patterns, suggest static checks, generate AST-based rules, write custom linters, build dependency-graph validators, or explain which parser/compiler APIs are appropriate for your language.

Dimitris Kyrkos • Jul 6

Using the AI to help build its own constraints is a good practical shortcut. It's faster at generating AST rules and custom linter configs than writing them from scratch, and it's one of the few cases where the AI's lack of ego works in your favor since it won't push back on building a rule that restricts its own behavior. The key part of your answer is "the whole team has to agree on the workflow first" because the tooling layer is the easy part. The hard part is getting developers to commit to treating guardrail failures as hard stops rather than suggestions they override when they're in a hurry. Best tools in the world don't help if the team has a culture of skipping the check when it's inconvenient.

algorhymer • Jun 26

Functional does not mean correct!

Yes! EXACTLY!
Imperative can be correct too, and more efficient.
Also, overuse of Monads or ML-style sideffects in functional simply make it look functional, but it becomes imperative in reality.

Yes. Good post.
Devs are very good at producing code that looks structurally right. They're also very good at producing code that confidently solves a problem you don't actually have. The gap between those two things is where Quality Assurance lives. Devs barely understand what they are doing, so you need huge teams of testers to babysit each and every little ticket.

QAs are very good at producing tests that look structurally right. They're also very good at producing tests that confidently cover an infinite graph you don't actually have. The gap between those two things is where Mathematics lives. QAs barely understand what they are doing, so you need huge teams of theorists to babysit each and every little QA Institution and training center.

Mathematiciaians are very good at producing proofs that look structurally right. They're also very good at producing conundrums you don't actually have. The gap between those two things is where tentacle-armed unimaginable multidimensional space horrors live. Mathematicians barely understand what they are doing, so you need huge teams of tentacle-armed unimaginable multidimensional space horrors to babysit each and every little planckian frame of this Big Brother show called Universe, which we are streaming now in HD Dimension to our customers.

Dimitris Kyrkos • Jun 29

Ha the turtles-all-the-way-down escalation from devs to QAs to mathematicians to cosmic tentacle horrors is a journey I did not expect this comment section to take. To clarify though, the "functional" in the title means "the code functions, it runs, it works" not functional-vs-imperative as a paradigm. The point is that code can execute correctly and still solve the wrong problem, regardless of whether you wrote it in Haskell or assembly. But I appreciate the interdimensional QA framework, will update our review checklist accordingly.

algorhymer • Jun 29

'not functional-vs-imperative as a paradigm'

That part of my comment was a joke.
I use jokes for testing System 1 vs. System 2 capabilities of both the carbon and the silicon based workforce.

All other parts of my comment were not jokes.
They were a rebuttal of your post's naively optimistic main conclusion.
If we cannot vouch for our own code, as the large corpus of CVEs and meme level left-pad facepalm blunders show, then what chance do we have if throughput is increased?

One can argue pro/con for llms' capabilities when it comes to code generation.
For me, that's a moot point.
My argument is that we humans were outgunned by sheer volume of 'ready for testing' tickets since the 1960s.
Due to broculture, rampant egoism and financial interests, this was not openly acknowledged by the programming and corporate community.
We are still in the denial phase. LLMs simply made the symptom more pronounced, but they did not cause it.
To my surprise, we are still in denial phase.

But more importantly as I showed: We even lost our humor. Which is sad.

If life seems jolly rotten
There's something you've forgotten
And that's to laugh and smile and dance and sing

Theo Valmis • Jun 29

The 'functional but wrong' failure mode is the one that should scare teams most, because it passes every gate we built to catch the loud failures. Broken code announces itself; code that confidently solves the wrong problem ships clean and lives for months. Your point about the friction of writing being where misunderstandings surface is key, the agent skips the part where a human would have flagged that the requirement itself was wrong. Tests prove the code runs. Nothing in the pipeline proves it does the right thing. That gap is where the next generation of checks has to live, verifying intent, not just behavior.

Dimitris Kyrkos • Jun 30

"Verifying intent, not just behavior" is a clean way to frame where the tooling gap is right now. The entire CI/CD pipeline was built to answer "does this work" and nobody built the equivalent pipeline for "is this right." That's the missing layer and I don't think it's a tooling problem, it's a process problem, because intent lives in humans not in code. The closest proxy is concrete acceptance criteria written before generation, but even that only catches the intent that someone thought to write down.

Mateo Ruiz • Jun 26

This is an important distinction that doesn't get discussed enough. AI-generated code can be syntactically correct, pass tests, and still miss the actual business intent. We've found that the biggest wins come from validating requirements and edge cases before focusing on implementation quality. It's a pattern we see often while building production AI applications at IT Path Solutions. The engineering challenge isn't just generating code it's making sure it solves the right problem.

Dimitris Kyrkos • Jun 26

Agreed. Requirements validation before implementation is where most teams underinvest, especially when AI makes the implementation step feel instant. Thanks for reading.

Ömer Berat Sezer • Jun 26

Really enjoyed this. I think the solution isn't just better prompting, it's better delegation. Give AI a complete task with clear success criteria, then review the outcome instead of every line of code.
I recently shared my thoughts on this in post in my blog post => Asking vs Delegating AI Agents 🧐

Dimitris Kyrkos • Jun 29

The delegation framing makes sense. Clear success criteria upfront changes the review from "read every line" to "did it meet the spec," which scales way better when the volume of generated code is high. Thanks for reading.

Kartik N V J K • Jun 26

The silent-wrong failure mode is the one I trust least, precisely because every automated check stays green. The friction you describe, realizing mid-build that the spec itself is wrong, is the part AI removes, and that friction was doing real validation work. I've started treating "passes tests" as evidence the code does something, never evidence it does the right thing.

Dimitris Kyrkos • Jun 29

"Passes tests as evidence the code does something, not evidence it does the right thing" is a clean reframe and I'm going to steal that for future conversations. The friction point is the part I keep coming back to as well. Nobody designed that friction on purpose, it was just a side effect of building being slow, and we only realized it was doing validation work after we removed it.

View full discussion (67 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.