Bala Paranj

Posted on May 29 • Edited on Jun 1

Fallacies of GenAI Development #2: If the Output Looks Correct, It Is Correct

#ai #softwaredevelopment #architecture #engineering

This is the second in a series of eight posts on the false assumptions teams make when building with generative AI. Fallacy #1 covered why faster code generation doesn't mean faster engineering. This post covers why code that looks right isn't necessarily right — and what right requires.

The Fallacy

"The AI-generated code compiles, passes tests, and reads well. Therefore it's correct."

Why it's tempting

You prompt an AI agent. It generates a function. The function compiles. The existing tests pass. You read through it — the variable names make sense, the logic follows a recognizable pattern, the error handling looks reasonable. A colleague glances at it during code review. Looks good to me. You merge.

The code LOOKS like it was written by someone who understood the problem. The structure is good. The naming is conventional. The patterns are familiar. Everything about it signals competence.

This is more convincing than obviously bad code. Obviously bad code gets caught. Code that looks good gets waved through — by reviewers, by CI pipelines, by the developer's own judgment. The danger isn't code that fails visibly. It's code that fails invisibly.

Why it's wrong

AI-generated code is optimized for plausibility, not correctness. The model produces output that matches patterns from its training data — patterns of code that LOOKED correct in the millions of repositories it learned from. The output inherits the surface appearance of correctness without inheriting the underlying reasoning that made the original code correct.

Bender, Gebru, et al. (2021) formalized this in "On the Dangers of Stochastic Parrots" — LLMs are probabilistic models that stitch together sequences based on statistical likelihood, without reference to meaning or truth. The AI produces plausible code because it's predicting the most likely shape of a correct answer, not reasoning about correctness.

A 2023 Stanford study made this concrete: Perry et al. found that developers using AI assistants wrote significantly more insecure code than those without one — and were more likely to believe their code was secure. The AI made the code look so professional that the developer's critical thinking was bypassed.

Three specific failure modes:

Failure mode 1: Correct for the common case, wrong for the edge case

The AI handles the happy path flawlessly. It handles the most common error cases. It misses the edge case that matters — the one that constitutes your security boundary, your financial calculation precision, or your data integrity guarantee.

// AI-generated: looks correct
func calculateInterest(principal float64, rate float64, days int) float64 {
    return principal * rate * float64(days) / 365.0
}

Compiles. Passes the test for a standard 30-day period. Reads well. But: leap years? Negative principal? Rate above 1.0? Days exceeding the term? Overflow on large principals? Each edge case is a potential financial error that won't surface in normal testing.

The AI didn't think about these cases. It pattern-matched a common interest calculation. The developer who would have WRITTEN this function would have encountered the edge cases during development — struggling with the leap year question, asking about negative values, checking the spec. Correctness comes from that struggle. The AI skipped it. OpenAI's own Codex evaluation (Chen et al., 2021) documented this: while Codex solved 70%+ of simple problems, performance dropped significantly as complexity or the need for specific logical constraints increased. AI is trained on the average code, meaning it defaults to the most common — and often least robust — implementation.

Failure mode 2: Locally correct, globally wrong

Each function is correct in isolation. The composition is wrong. Function A correctly parses input. Function B correctly transforms data. Function C correctly writes output. But A's output format doesn't match B's expected input. Or B's transformation assumes an ordering that A doesn't guarantee. Or C writes to a resource that A already locked.

This is the composition problem from Fallacy #1 — but at the code level, not the system level. AI generates each piece by pattern-matching against similar pieces in training data. The pieces look correct individually. Nobody checks whether they compose correctly, because each piece passed its own review.

Failure mode 3: Semantically different from what was intended

The code does something. It does it correctly. It's not what you wanted.

You asked for a function that validates user input. The AI generates a function that checks string length and character types. You meant a function that checks the input against the business rules for your specific domain — valid account numbers, permitted transaction types, jurisdictional constraints. The AI generated a PLAUSIBLE interpretation of validates. Not YOUR interpretation.

The code compiles. The tests pass (because the tests also validate string length and characters). The review looks fine (the function does what it says). But the intended validation — the business rules — was never implemented. The AI solved the how perfectly. It guessed the why. And without a specification that anchors the why, no amount of code review will catch the gap — because the code does something reasonable. It's just not the right something.

Peter Naur explained why in 1985: software isn't just the code — it's the mental theory the programmer possesses about how the software handles the problem. The AI can generate the artifact, but it doesn't possess the theory. Without the theory, the code is an artifact that might look like the solution but lacks the internal logic of the intended design. The developer who writes the validation function builds a theory of the domain's rules during development. The AI skips that theory-building entirely.

What correct requires

Byron Cook built Amazon's automated reasoning organization — 300+ scientists, 15+ teams, formal verification embedded across AWS. The insight he discovered over 11 years: executives don't want bug reports. They want proofs.

Newcombe, Cook, et al. (2015) documented this journey in "How Amazon Web Services Uses Formal Methods" — AWS uses TLA+ and model checking to find bugs that traditional testing and code reviews would never find. At industrial scale, "tests pass" is an insufficient definition of "correct."

The distinction:

"Looks correct" (what teams do now):
    Code compiles           ← syntax check
    Tests pass              ← checks tested cases
    Review approves         ← human judgment on appearance

    Gap: what about the cases nobody tested?
    Gap: what about the compositions nobody checked?
    Gap: what about the properties nobody thought to verify?

"IS correct" (what proof provides):
    Property declared       ← "no unauthorized access path exists"
    Property verified       ← checked against ALL possible inputs
    Evidence produced       ← the specific evaluation trace

    No gap: the property either holds for every case or it doesn't.
    The verification is exhaustive, not sampled.

Looks correct is an opinion. "IS correct" is evidence. The difference between them is the difference between "I looked and didn't see anything wrong" and "I proved nothing wrong exists."

The cost of plausible

The cost isn't immediate. Plausible code ships. It works for weeks, months, sometimes years. The cost arrives when:

An incident occurs that nobody can diagnose. The code that looked correct has a subtle bug in an edge case. The developer who merged it has no mental model of how it works — they read it, it looked fine, they approved. Debugging AI-generated code you don't understand takes longer than writing the code would have taken, because you're building the mental model during the crisis instead of during development.

An auditor asks for evidence. "How do you know this function correctly handles PII?" The answer: "It passed code review and the tests pass." The auditor: "Show me the tests." You show them. The tests check the happy path. The auditor: "What about edge cases X, Y, Z?" Silence. The test suite verified what someone thought to test. Nobody thought to test the thing the auditor is asking about.

A security researcher finds a path nobody anticipated. The AI-generated IAM policy is correct for the intended use case. But the policy's conditions, evaluated together, are mathematically equivalent to Principal: * — allowing public access through a logical path nobody wrote because the AI pattern-matched the condition blocks from training data without understanding their composition. The policy LOOKS restrictive. The math says it isn't.

The resolution: properties, not appearances

The cache hierarchy from Fallacy #1 has layers. The first two already exist in most codebases:

L1 cache (types):
    The compiler catches type violations instantly.
    AI generates code with wrong types → caught before any human sees it.
    Fast. Deterministic. Already deployed.

L2 cache (tests):
    CI catches behavioral violations before merge.
    AI generates code that breaks existing tests → caught before merge.
    Fast. Deterministic for tested cases. Already deployed.

L3 cache (specification gate — what's missing):
    A mechanical check that verifies properties nobody wrote tests for.
    Security invariants. Architectural boundaries. Cross-service contracts.
    Composition correctness across module boundaries.

    Existing L3-adjacent tools you can adopt today:
    → Property-based testing (QuickCheck, Hypothesis) — tests properties
      across randomly generated inputs, not hand-picked examples
    → Static analysis (Semgrep, SonarQube) — checks structural patterns
      across the codebase without running the code  
    → Contract testing (Pact, Dredd) — verifies API implementations
      match their OpenAPI/Swagger specifications
    → Formal verification (Z3, AWS Zelkova) — proves properties
      mathematically across ALL possible inputs

    Each is a step closer to L3. Property-based testing is the
    easiest first step — it moves you from "tested 5 examples"
    to "tested 10,000 random inputs against one property."

L1 and L2 verify what someone THOUGHT to check — type contracts and test cases. L3 verifies what must ALWAYS be true — properties that hold regardless of implementation.

The L3 check for the interest calculation: "the result must never be negative for positive principal and positive rate." One property. Verified on every change. Catches the edge case the test missed — not because someone anticipated the specific failure, but because the property is universal.

The L3 check for the IAM policy: "no principal outside the organization can access any resource tagged as sensitive." Not a test case for one specific policy. A property verified across every policy in the snapshot. Catches the mathematical-equivalent-to-star policy — not by pattern-matching the text, but by evaluating the logic.

The L3 check for the composition: "Function B's input type must be a subset of Function A's output type." Not verified by testing A and B separately. Verified by checking the interface contract between them — mechanically, on every change that touches either function.

The difference between testing and proving

Testing checks specific inputs. If you test 1,000 inputs and they all pass, you know 1,000 inputs work. You don't know about input 1,001.

Proving checks ALL inputs. If a property is proved, it holds for every possible input — including the ones nobody thought to test. The verification is mathematical, not sampled.

Dijkstra stated this in 1970: "Program testing can be used to show the presence of bugs, but never to show their absence." This is the fundamental limit that separates testing from verification. Moving from example-based testing to property-based verification is not an improvement in degree — it's a change in kind.

AWS learned this distinction at scale. Cook: "You can't go to a customer and say 'Good news, we found 10,000 more bugs.' They say 'Why am I using AWS if you have bugs?' But you CAN say 'We proved this property holds under these assumptions.' That's why they moved their data to the cloud."

The same distinction applies to AI-generated code. "This code passed 47 tests" is useful but incomplete. "This code satisfies these 12 properties across ALL possible inputs" is evidence. The first is testing. The second is proving. The invisible bugs live in the gaps between them.

You don't need to prove EVERYTHING — that's the formal methods mistake of the 1980s. You need to prove the PROPERTIES THAT MATTER — security invariants, financial correctness guarantees, data integrity constraints, architectural boundaries. The small set of things that must ALWAYS be true, regardless of how the AI implemented them.

And you don't have to jump straight to mathematical proofs. There's a practical gradient:

Example-based tests:    "This input produces this output"           (5 cases checked)
Property-based tests:   "For ALL inputs, this property holds"       (10,000 random cases)
Contract tests:         "This API matches its specification"         (every endpoint, every field)
Formal verification:    "This property holds for EVERY possible case" (mathematical proof)

Each step catches more than the previous one. Property-based testing (QuickCheck, Hypothesis) is the most accessible first step — one afternoon to adopt, and it immediately catches edge cases that example-based tests miss. You don't need Z3 to start. You need one property and one tool that checks it across more inputs than you'd write by hand.

What you can do this week

1. Identify one property that must always hold in your system. Not a test case. A property. "No API endpoint returns PII without authentication." "No database query returns results from a tenant other than the requesting tenant." "No financial calculation produces a negative balance for a credit transaction." One property. Write it down.

2. Ask: would your current tests catch a violation? If the AI generated code that subtly violated this property — not obviously, but through an edge case or a composition error — would your test suite catch it? If the answer is "probably" or "I think so," you don't have verification. You have hope.

3. Add one mechanical check for that property. A CI check. A contract test. A schema validation. Something that verifies the property on every change, deterministically, regardless of how the code was generated. The property is the specification. The check is the enforcement. The combination is what "correct" means.

"Looks correct" is how we got here. "IS correct" is how we get out. The difference is one property, mechanically verified, on every change.

Next in the series: **Fallacy #3 — "You Can Verify AI Output With Another AI."* Why wrapping a non-deterministic system with another non-deterministic layer doesn't converge on reliability — and what deterministic verification looks like in practice.*

The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.

References

Bender, E.M., Gebru, T., et al. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" FAccT '21.
Chen, M., et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374 (OpenAI Codex).
Dijkstra, E.W. (1970). "Notes on Structured Programming." EWD249.
Naur, P. (1985). "Programming as Theory Building." Microprocessing and Microprogramming, 15(5).
Newcombe, C., Cook, B., et al. (2015). "How Amazon Web Services Uses Formal Methods." Communications of the ACM, 58(4).
Perry, N., et al. (2023). "Do Users Write More Insecure Code with AI Assistants?" IEEE S&P 2023.

DEV Community