<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bala Paranj</title>
    <description>The latest articles on DEV Community by Bala Paranj (@bala_paranj_059d338e44e7e).</description>
    <link>https://dev.to/bala_paranj_059d338e44e7e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862804%2F7ea6c560-63cb-4daf-a713-450532280b0a.jpg</url>
      <title>DEV Community: Bala Paranj</title>
      <link>https://dev.to/bala_paranj_059d338e44e7e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bala_paranj_059d338e44e7e"/>
    <language>en</language>
    <item>
      <title>From Fallacies to Superpowers: Eight Agent Skills That Make AI-Assisted Development Work</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Sun, 07 Jun 2026 11:59:32 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/from-fallacies-to-superpowers-eight-agent-skills-that-make-ai-assisted-development-work-nmg</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/from-fallacies-to-superpowers-eight-agent-skills-that-make-ai-assisted-development-work-nmg</guid>
      <description>&lt;p&gt;The &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/the-fallacies-of-genai-development-1m54"&gt;Fallacies of GenAI Development&lt;/a&gt; named eight assumptions that break AI-assisted development. The resolutions were framed as human knowledge — things the engineer must understand and apply.&lt;/p&gt;

&lt;p&gt;But the resolutions don't have to live in the engineer's head. They can live in the agent's workflow.&lt;/p&gt;

&lt;p&gt;Projects like &lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;Superpowers&lt;/a&gt; proved that agents can follow structured methodologies — brainstorm before coding, write tests before implementation, review against specs before declaring success. The skills are mandatory workflows, not suggestions. The agent checks for relevant skills before any task.&lt;/p&gt;

&lt;p&gt;The same approach works for the Fallacies resolutions. Each one can be encoded as an agent skill that fires automatically. The engineer doesn't need to remember "check for existing libraries before generating." The agent does it as a mandatory step.&lt;/p&gt;

&lt;p&gt;Here are the eight skills. Each one resolves one fallacy. Each one is achievable today with current agent capabilities.&lt;/p&gt;

&lt;p&gt;But first, a critical boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  The line between agent skill and human judgment
&lt;/h2&gt;

&lt;p&gt;Not everything in a Fallacy resolution should be automated. The Fallacies series itself warns against this — Fallacy #3 (AI can't verify AI) and Fallacy #4 (dropping review) exist because teams automated judgment calls that should have stayed with humans.&lt;/p&gt;

&lt;p&gt;Each skill below has two halves:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The mechanical half (agent does this):&lt;/strong&gt; Search for existing libraries. Run the compiler. Execute the linter. Read the specification file. Count the boundaries. These are deterministic actions with deterministic outputs. The agent executes them. No judgment required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The judgment half (agent surfaces this to the human):&lt;/strong&gt; "Is this the right library?" "Does this architectural constraint still apply?" "Should this uncovered decision become a new spec?" These require context, domain knowledge, and strategic thinking. The agent surfaces the question. The human answers it.&lt;/p&gt;

&lt;p&gt;The agent does the LEGWORK. The human makes the CALL. The agent that tries to make the call is Fallacy #3 — using AI to verify AI. The agent that doesn't do the legwork is wasting human attention on mechanical work (Fallacy #4).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WRONG:  Agent decides "this library is the right choice" → Fallacy #3
WRONG:  Human searches for libraries manually             → Fallacy #4
RIGHT:  Agent searches, presents 3 options with tradeoffs → human picks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every skill below respects this boundary. Watch for the split: steps marked &lt;strong&gt;[MECHANICAL]&lt;/strong&gt; are what the agent does autonomously. Steps marked &lt;strong&gt;[SURFACE]&lt;/strong&gt; are what the agent presents for human decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Skill 1: Compose-First
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolves Fallacy #1:&lt;/strong&gt; Faster generation ≠ faster engineering&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fires:&lt;/strong&gt; Before generating any implementation code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. [MECHANICAL] Parse the task: what capability is needed?
2. [MECHANICAL] Search for existing functions in the codebase that already provide it
3. [MECHANICAL] Search for well-maintained upstream libraries that provide it
4. [SURFACE]    Present findings: "Found 2 existing options: [library A] (last updated 
                3 days ago, 12k stars) and [library B] (last updated 8 months ago, 
                200 stars). Also found internal utils/retry.go with similar logic.
                Shall I compose from one of these, or generate new?"
5. [MECHANICAL] After human picks: write the import + glue code
6. [MECHANICAL] Log the decision: "Composed from [library] per human approval"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this is a superpower:&lt;/strong&gt; The agent that composes instead of generating produces 80-95% less code for the same capability. Less code = less to maintain, test, debug, secure. The agent becomes a librarian, not a typist. The codebase shrinks while capabilities grow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without skill:
    Task: "Add HTTP retry logic"
    Agent: generates 150 lines of retry implementation

With skill:
    Task: "Add HTTP retry logic"  
    Agent: "Found existing retry library in go.mod dependencies.
            Writing 6 lines of configuration instead of 150 lines
            of implementation."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Skill 2: Property-Check
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolves Fallacy #2:&lt;/strong&gt; Plausible ≠ correct&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fires:&lt;/strong&gt; After generating any code, before presenting to the human.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. [MECHANICAL] Read the project's property definitions (if they exist):
                .properties/ directory, INVARIANTS.md, CI check configs,
                type constraints, API contracts, schema definitions
2. [MECHANICAL] Run available mechanical checks:
                type checker, linter rules, contract tests
3. [MECHANICAL] For properties with clear pass/fail: evaluate and report result
4. [SURFACE]    For properties requiring judgment: "Generated code touches user 
                data. INVARIANT says 'all user-data endpoints require auth 
                middleware.' I added the auth wrapper — please verify this is 
                the correct middleware for this endpoint."
5. [MECHANICAL] Report: "Mechanically verified: N properties passed. 
                Flagged for human review: M properties (listed above)."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this is a superpower:&lt;/strong&gt; The agent doesn't just generate plausible code. It checks its own output against declared properties before the human ever sees it. The human receives code that's already been evaluated against the team's safety boundaries — not just code that looks right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without skill:
    Agent generates API endpoint. Looks correct. Human merges.
    Endpoint returns PII without authentication. Discovered in production.

With skill:
    Agent generates API endpoint.
    Property check: "INVARIANT: All endpoints returning user data 
    require authentication middleware."
    Agent: "Generated endpoint does not include auth middleware. 
    Adding authentication wrapper before presenting."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Skill 3: Mechanical-Verify
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolves Fallacy #3:&lt;/strong&gt; AI can't verify AI&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fires:&lt;/strong&gt; When the agent needs to verify its own output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. [MECHANICAL] Classify each property to verify:
                Type constraint → run compiler
                API contract    → run contract test
                Structural      → run linter/static analysis
                Universal       → run property-based test
                Subjective      → flag for human (NOT self-review)
2. [MECHANICAL] Run ALL mechanical checks. Collect results.
3. [SURFACE]    For subjective properties: "This error message says
                'invalid input.' Is that clear enough for your users, 
                or should it specify what's invalid?"
4. [MECHANICAL] Report: "Mechanically verified: [list with pass/fail].
                Human review needed: [subjective items listed above]."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this is a superpower:&lt;/strong&gt; The agent stops pretending it can judge its own output. It runs every mechanical check available and presents the RESULTS, not its OPINION. For subjective properties, it doesn't self-review — it surfaces the question to the human. The agent knows what it can verify deterministically and what it can't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without skill:
    Agent reviews its own code: "This looks correct."
    The code has a subtle type mismatch the agent doesn't catch
    because it pattern-matches appearance, not logic.

With skill:
    Agent: "Running compiler... type mismatch on line 47:
    expected []byte, got string. Fixing before presenting."
    The compiler caught what the agent's self-review would miss.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Skill 4: Spec-Before-Code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolves Fallacy #4:&lt;/strong&gt; Dropping review ≠ removing bottleneck&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fires:&lt;/strong&gt; Before writing any implementation, after the brainstorming/planning phase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. [MECHANICAL] Read all specifications that govern the target module:
                module interface, API contract, database schema, ADRs, conventions
2. [MECHANICAL] Extract the list of constraints that apply
3. [SURFACE]    Present constraints to human: "Before I write code, these 
                constraints apply to this module: [list]. Are these correct? 
                Any I'm missing?"
4. [MECHANICAL] After human confirms: generate within those constraints
5. [MECHANICAL] After generating: verify output satisfies each constraint
                using available mechanical checks
6. [SURFACE]    If any constraint can't be mechanically verified: "I couldn't
                confirm compliance with [constraint]. Please review this 
                specific aspect."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this is a superpower:&lt;/strong&gt; The agent doesn't just generate code and hope someone reviews it. It reads the existing specifications, confirms the constraints with the human, generates within those constraints, and verifies compliance. The human reviews the constraint list (small, fast) instead of the code (large, slow). The review moves to the right level — specifications for humans, code verification for the agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without skill:
    Agent generates code. Human reviews 200 lines. Takes 45 minutes.
    Misses that the code uses exceptions instead of result types.

With skill:
    Agent: "Module conventions require result types for error handling
    (from ADR-007). Generating with result types."
    Human reviews: "Yes, those constraints are correct. Go ahead."
    Agent generates. Agent verifies against constraints. 
    Human reviews the 3-line constraint confirmation, not the 200-line implementation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Skill 5: Output-Audit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolves Fallacy #5:&lt;/strong&gt; Better context ≠ correct output&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fires:&lt;/strong&gt; After generation, specifically checking output against properties that AREN'T in the retrieved context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. [MECHANICAL] After generating code, search project docs for architectural
                properties that apply but weren't in the original context:
                - ADRs mentioning timeout, authentication, PII, concurrency
                - CI check configurations
                - CLAUDE.md / CONVENTIONS.md constraints
2. [MECHANICAL] For each found property: check if the generated code 
                violates it using available mechanical tools
3. [SURFACE]    Present findings: "Found 3 architectural properties not in 
                my original context. Timeout policy (ADR-012) applies — 
                I added context.WithTimeout. PII handling policy applies — 
                please verify I'm not logging the user email on line 34.
                Encryption-at-rest policy does not apply to this code path."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this is a superpower:&lt;/strong&gt; The agent compensates for its own context limitation. RAG retrieves documents that are semantically similar to the task. Architectural properties are semantically DISTANT from the code they govern. This skill explicitly searches for the properties that RAG would miss — because they live in ADRs, convention documents, and CI configurations that aren't similar to the implementation task in vector space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without skill:
    RAG retrieves correct API docs. Agent generates correct API call.
    Code makes synchronous external call without timeout inside a
    transaction. Timeout policy is in ADR-012, never retrieved.
    Connection pool exhausts in production.

With skill:
    Agent generates API call. Output-audit fires.
    Agent: "Checking timeout policies... Found ADR-012: 
    'All external calls require context.WithTimeout(5s).'
    Generated code lacks timeout. Adding before presenting."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Skill 6: Deletion-Aware
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolves Fallacy #6:&lt;/strong&gt; Generated code is a liability&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When the agent fires:&lt;/strong&gt; During implementation and refactoring tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. [MECHANICAL] Before generating, search the codebase for existing 
                implementations of the same or similar functionality
2. [SURFACE]    If duplicates found: "Found 3 existing implementations 
                of date formatting: utils/dates.go, handlers/format.go, 
                api/helpers.go. Recommend consolidating to one. Which 
                should be the canonical version, or should I extract a 
                new shared function?"
3. [MECHANICAL] After human decides: implement the consolidation
4. [MECHANICAL] After completing any task, report additions AND deletions: 
                "Added 45 lines. Deleted 120 lines. Net: -75 lines."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this is a superpower:&lt;/strong&gt; The agent actively shrinks the codebase. Instead of the default behavior (generate new code for every task), the agent searches for duplication, extracts shared functions, and deletes redundant implementations. The deletions-to-additions ratio improves. The maintenance burden decreases with each task instead of increasing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without skill:
    Five developers prompt agents for date formatting over a month.
    Five implementations exist. Bug found in one. Other four remain broken.

With skill:
    Agent: "Found existing formatDate() in utils/dates.go.
    Using existing implementation instead of generating new one.
    Also found two other formatDate variants in handlers/ and api/.
    Recommend consolidating to the utils/ version. Shall I refactor?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Skill 7: Boundary-Read
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolves Fallacy #7:&lt;/strong&gt; Specs already exist, they're not new work&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fires:&lt;/strong&gt; At the start of every task, before any code generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. [MECHANICAL] Identify which module/package the task targets
2. [MECHANICAL] Read the module's existing boundaries:
                exported interface, import restrictions, API contract, 
                database schema, configuration schema
3. [MECHANICAL] Generate code that satisfies all identified boundaries
4. [MECHANICAL] After generating: run linter/depguard to verify 
                no boundary is violated
5. [SURFACE]    If the task REQUIRES changing a boundary: "This task 
                needs a new exported function in pkg/stave/. Changing 
                the public interface. Please confirm this is intended 
                — it affects all consumers of this package."
6. [MECHANICAL] Report: "Module has N boundaries. Implementation 
                satisfies all N. No cross-boundary imports."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this is a superpower:&lt;/strong&gt; The agent treats existing specifications as first-class constraints instead of ignoring them. Most agents generate code that happens to match the module's style. This agent READS the interface definition, KNOWS what's exported and what isn't, and REFUSES to generate code that violates the boundary. The Parnas boundaries the team already built are finally enforced — by the agent itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without skill:
    Agent generates code that imports an internal package from
    a different module. No linter catches it. The hexagonal
    architecture erodes silently.

With skill:
    Agent: "Module internal/core/ is restricted — depguard rule
    prevents imports from internal/app/. Generating without
    cross-boundary import. Using the public interface in pkg/stave/ instead."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Skill 8: Protocol-Sync
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolves Fallacy #8:&lt;/strong&gt; More agents ≠ more productivity&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fires:&lt;/strong&gt; At the start of every task, when multiple agents are working on the same codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. [MECHANICAL] Read the shared specification repo:
                naming conventions, error handling strategy, retry policy,
                API contract versions, architecture decision records
2. [MECHANICAL] Check the specification VERSION — confirm it matches
                what other agents are reading (prevent split-brain)
3. [MECHANICAL] Generate code that conforms to ALL shared specifications
4. [MECHANICAL] After generating: verify conformance using linter rules
                and convention checks
5. [SURFACE]    If a decision isn't covered by any specification: 
                "This task requires choosing a serialization format for 
                the new event type. No convention covers this. Options: 
                JSON (consistent with existing events) or Protobuf 
                (consistent with the gRPC migration plan in ADR-015). 
                Which should I use? Should this become a new convention?"
6. [MECHANICAL] Report: "Conforming to spec version 2.4. All conventions 
                match. One uncovered decision flagged above."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this is a superpower:&lt;/strong&gt; The agent doesn't make invisible architectural decisions. It reads the coordination protocols (shared specifications), follows them, and flags decisions that aren't covered. Multiple agents producing code that follows the same conventions, same error handling, same retry strategy — without any coordination meetings. The specifications are the protocols. The agents are the nodes. The distributed system is consistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without skill:
    Agent A uses camelCase for JSON fields.
    Agent B uses snake_case.
    Integration breaks silently. Bug takes two days to trace.

With skill:
    Both agents read conventions.md at task start.
    Both generate camelCase JSON fields.
    Integration works on the first try.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What makes these different from "just prompting better"
&lt;/h2&gt;

&lt;p&gt;These aren't prompt improvements. They're structural workflow changes with a clear boundary between agent action and human judgment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompting:    "Remember to check for existing libraries"
              → The agent might or might not. Probabilistic.

Skill:        compose-first fires automatically before every
              implementation task. The agent MUST search before
              generating. Mandatory workflow, not suggestion.

BUT:          The agent searches [MECHANICAL].
              The human picks which library to use [SURFACE].
              The agent never decides "this library is fine" on its own.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This boundary  separates these skills from Fallacy #3 (AI verifying AI). The agent does exhaustive, deterministic legwork — searching, reading, running checks, collecting results. The human makes judgment calls — which library, whether a constraint applies, whether an uncovered decision should become a new specification. The agent that crosses this boundary is automating judgment with correlated failure modes. The agent that respects it is doing the mechanical work that frees human judgment for where it matters.&lt;/p&gt;

&lt;p&gt;The Superpowers framework proved this distinction. Skills aren't suggestions. They're mandatory workflows that fire based on triggers. The agent checks for relevant skills before any task. The skills execute as part of the agent's process, not as afterthoughts.&lt;/p&gt;

&lt;p&gt;Each skill above has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;trigger&lt;/strong&gt; (when it fires)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;process&lt;/strong&gt; (what the agent does, step by step)
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;verification&lt;/strong&gt; (how to confirm the skill executed correctly)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;report&lt;/strong&gt; (what the agent tells the human)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the same structure Superpowers uses for brainstorming, TDD, and code review. The Fallacy resolutions fit the same framework — because they're the same kind of thing: structured workflows that prevent a known failure mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  The compound effect
&lt;/h2&gt;

&lt;p&gt;An agent running all eight skills simultaneously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Searches for existing implementations before generating (Compose-First)&lt;/li&gt;
&lt;li&gt;Reads module boundaries and specifications (Boundary-Read)&lt;/li&gt;
&lt;li&gt;Reads shared conventions (Protocol-Sync)&lt;/li&gt;
&lt;li&gt;Confirms constraints with the human (Spec-Before-Code)&lt;/li&gt;
&lt;li&gt;Generates within all identified constraints&lt;/li&gt;
&lt;li&gt;Runs mechanical verification (Mechanical-Verify)&lt;/li&gt;
&lt;li&gt;Checks output against architectural properties not in context (Output-Audit)&lt;/li&gt;
&lt;li&gt;Evaluates against declared properties (Property-Check)&lt;/li&gt;
&lt;li&gt;Searches for redundant code to delete (Deletion-Aware)&lt;/li&gt;
&lt;li&gt;Reports what was composed, generated, verified, and flagged&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This agent produces less code, more capability, fewer violations, and better architectural coherence than an agent running without these skills — using the SAME model, the SAME context, the SAME prompts. The difference isn't the AI. It's the workflow around the AI.&lt;/p&gt;

&lt;p&gt;The Fallacies identified what breaks. The skills fix it — not by making the human smarter, but by making the agent's process better. The engineer who installs these skills gives their agent the architectural judgment that the model doesn't have. The model provides the generation. The skills provide the discipline. The combination  "AI-assisted development that works" looks like.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The Fallacies of GenAI Development: &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/the-fallacies-of-genai-development-1m54"&gt;Index&lt;/a&gt; · &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;#1 Faster Generation&lt;/a&gt; · #2-#8 linked from index.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;Superpowers&lt;/a&gt; by Jesse Vincent — the agentic skills framework that proved agents can follow mandatory structured workflows. 209k stars. The skills above follow the same pattern.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/sufield/stave" rel="noopener noreferrer"&gt;Stave&lt;/a&gt; — the specification gate that implements Property-Check and Boundary-Read for cloud infrastructure. 2,662 safety invariants. Deterministic verification.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>programming</category>
    </item>
    <item>
      <title>Fallacies of GenAI Development #8: More AI Agents Means More Productivity</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Sat, 06 Jun 2026 12:31:34 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-8-more-ai-agents-means-more-productivity-2a0m</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-8-more-ai-agents-means-more-productivity-2a0m</guid>
      <description>&lt;p&gt;&lt;em&gt;This is the eighth and final post in a series on the false assumptions teams make when building with generative AI. The series began with the observation that the trough of disillusionment for AI-assisted development has arrived — not because AI is useless, but because eight false assumptions made the trough inevitable. This post covers the last assumption and closes the series.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fallacy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"If one AI agent gives us a 10x boost, ten agents will give us 100x."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's tempting
&lt;/h2&gt;

&lt;p&gt;The arithmetic feels irresistible. One agent generates code for the backend. Another generates the frontend. A third writes tests. A fourth handles database migrations. A fifth generates documentation. Each agent works in parallel. No meetings, waiting or coordination overhead. Pure throughput.&lt;/p&gt;

&lt;p&gt;Leadership sees the potential: a five-person team with fifty agents has the output of a fifty-person team at the cost of a five-person team plus API credits. The scaling is linear. The economics are transformational.&lt;/p&gt;

&lt;p&gt;And the early results confirm it. Each agent, working on its own, produces impressive output. The backend agent generates Go code. The frontend agent generates React components. The test agent generates test suites. Each agent, in isolation, looks like a 10x developer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's wrong
&lt;/h2&gt;

&lt;p&gt;You've seen this problem before. It has a name. It's called distributed systems.&lt;/p&gt;

&lt;p&gt;A distributed system is a collection of independent actors that must coordinate to produce a coherent result. Each actor makes decisions locally. The system's correctness depends on those local decisions being compatible globally. When they aren't, you get inconsistency, conflicts, data corruption, and cascading failures.&lt;/p&gt;

&lt;p&gt;AI agents working on the same codebase are a distributed system. Each agent makes decisions — variable names, error handling strategies, retry policies, data formats, abstraction levels, dependency choices. Each decision is made locally, in the context of one prompt, one file, one task. No agent sees the full picture. No agent coordinates with the other agents. Each agent's decisions are invisible to the others.&lt;/p&gt;

&lt;p&gt;Distributed systems engineers spent 40 years learning that you can't scale a distributed system by adding more nodes. You can only scale it by adding protocols — consensus mechanisms, ordering guarantees, conflict resolution rules, interface contracts. Without protocols, more nodes means more conflicts, not more throughput.&lt;/p&gt;

&lt;p&gt;The same applies to AI agents. More agents without specifications means more invisible decisions, more inconsistency, more cognitive fragmentation — not more productivity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boom
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Month 1-2: The parallel sprint.&lt;/strong&gt; Five agents work simultaneously on different parts of the system. Each produces well-structured code. PRs flow in from every direction. The team merges them rapidly. The system takes shape faster than anyone expected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 3-4: The integration cracks.&lt;/strong&gt; The backend agent chose &lt;code&gt;camelCase&lt;/code&gt; for JSON field names. The frontend agent expected &lt;code&gt;snake_case&lt;/code&gt;. The database migration agent used &lt;code&gt;PascalCase&lt;/code&gt; for column names. None knew about the others' choices. Each was reasonable in isolation. The integration fails silently — data flows through but field mappings are wrong. The bug appears as "the UI shows the wrong value" and takes two days to trace to a naming mismatch across three layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 5-6: The contradictory patterns.&lt;/strong&gt; The backend agent implemented retries with exponential backoff and jitter. The API gateway agent implemented retries with a fixed 3-second delay. Both are valid retry strategies. Both were generated from different training data patterns. When a downstream service is slow, the backend retries with increasing delays while the gateway retries every 3 seconds — creating a thundering herd that overwhelms the already-slow service. The agents' patterns contradicted each other. Neither knew the other existed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 7-8: The architectural drift.&lt;/strong&gt; Each agent, given different tasks over months, evolved different internal patterns. The backend agent started using result types for error handling. The frontend agent uses exceptions. The test agent mixes both depending on which prompt it received. The codebase has three error handling philosophies, each locally consistent, globally incoherent. A new developer opens the codebase and can't determine which pattern is correct because all three exist in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 9-10: The edit war.&lt;/strong&gt; Agent A refactors a shared utility function for performance. Agent B, in a separate task, refactors the same function for readability. Agent A's change merges first. Agent B's change overwrites A's optimization. Neither agent knows the other touched the file. The team pays for the token cost of both refactors and gets the result of neither. Adam Bender of Google has a name for this — in his talk &lt;em&gt;Software Engineering at the Tipping Point&lt;/em&gt;, he calls it the agentic edit war: two agents refactoring the same file toward different goals, livelocking the system while the company pays for the tokens on both sides.&lt;/p&gt;

&lt;p&gt;Worse, without a pattern specification, the cycle repeats. Agent A sees the unoptimized code and refactors it again for performance. Agent B sees the unreadable code and refactors it again for readability. The agents consume tokens infinitely, bouncing the code back and forth between two valid-but-contradictory goals. In distributed systems, this is called livelock — the system is active but making no progress. In AI development, it's called your API bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 11-12: The coordination collapse.&lt;/strong&gt; The team needs to make an architectural change — migrate from REST to gRPC. Each agent needs to be told individually. Each interprets the migration prompt differently. The backend agent generates gRPC server code. The frontend agent keeps generating REST client calls because its prompt wasn't updated. The test agent generates tests for both protocols because it sees both in the codebase. The migration that should take a week takes two months because every agent is working against the others.&lt;/p&gt;

&lt;p&gt;The team discovers they've been managing a distributed system without the protocols that distributed systems require. Each agent is a node. Each node is making decisions. Nobody built the consensus layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What distributed systems already solved
&lt;/h2&gt;

&lt;p&gt;Hold up the two systems side by side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Distributed Computing (1990s):           Distributed AI Development (2020s):
───────────────────────────              ──────────────────────────────────
Multiple processes on multiple nodes     Multiple agents on one codebase
Each makes local decisions               Each makes local decisions
No shared state by default               No shared context by default
Inconsistency is the default outcome     Inconsistency is the default outcome
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Distributed computing solved this with protocols:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Protocol                    What it solves              AI equivalent
────────────────────        ────────────────────        ──────────────────────
Consensus (Paxos, Raft)     Agreement on shared state   Shared specification repo
Ordering (vector clocks)    Event sequencing            Architectural priority rules
Conflict resolution         Concurrent modifications    Specification gate on merge
Interface contracts (IDL)   Cross-service compatibility API contracts + contract tests
Schema evolution            Backward compatibility      Migration specifications
Circuit breakers            Cascading failure prevention Dependency scope limits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every protocol in the left column has an AI development equivalent in the right column. The solutions exist. They're called specifications, contracts, and enforcement gates. They're the same coordination mechanisms — applied to agents instead of processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five coordination failures and their fixes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Naming inconsistency → Convention specification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Problem: Each agent chooses its own naming conventions.
Fix:     A convention specification fed to every agent as context.
         "JSON fields: camelCase. Database columns: snake_case.
          Go structs: PascalCase. Environment variables: UPPER_SNAKE."
         Four lines. Every agent reads them. Every change is checked
         against them by a linter. Inconsistency becomes mechanically
         impossible.

         Critical: the specification must be immutable for the duration
         of concurrent agent tasks. In distributed systems, "split brain"
         happens when nodes have different versions of the truth. If
         Agent A has the old naming convention and Agent B has the new
         one, the codebase gets both. Version the specification. Update
         it between task batches, not during them.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Pattern contradictions → Architectural decision records
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Problem: Each agent implements different patterns for the same concern.
Fix:     One ADR per cross-cutting concern.
         "ADR-007: Retry strategy is exponential backoff with jitter,
          base 100ms, max 5 attempts, across ALL services."
         The specification is the protocol. The CI check is the
         enforcement. Any agent that generates fixed-delay retries
         fails the build.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Edit conflicts → Ownership boundaries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Problem: Two agents modify the same file with contradictory goals.
Fix:     CODEOWNERS + module boundaries.
         Each module has one owner (human or agent). Cross-module
         changes require the interface contract to be satisfied.
         Agents can't modify modules outside their scope.
         Same principle as microservice boundaries — but for agents.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Error handling divergence → Interface contracts
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Problem: Three error handling philosophies coexist in the codebase.
Fix:     One interface contract: "All public functions return
         (result, error). No exceptions. No panics. No sentinel values."
         Enforced by the compiler (Go) or by a linter rule (TypeScript).
         The contract is the specification. The tool is the enforcement.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Migration incoherence → Change specifications
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Problem: An architectural migration is interpreted differently by each agent.
Fix:     A migration specification: "Phase 1: Add gRPC endpoints alongside
         REST. Phase 2: Migrate clients to gRPC. Phase 3: Remove REST.
         Current phase: 1. Agents must not remove REST endpoints."
         Each agent reads the current phase. The specification prevents
         agents from jumping ahead or falling behind.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each fix is small. Each is a few lines of text. Each is fed to agents as context AND enforced mechanically by CI. The context helps agents make correct decisions. The enforcement catches them when they don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle: specifications are protocols for agents
&lt;/h2&gt;

&lt;p&gt;In distributed computing, protocols enable coordination without requiring every node to understand every other node. Node A doesn't need to know Node B's implementation. It needs to know Node B's interface contract. If both nodes respect the protocol, the system is consistent — regardless of how many nodes you add.&lt;/p&gt;

&lt;p&gt;The same principle applies to AI agents. Agent A doesn't need to know Agent B's prompt. It needs to know the specification that governs the module it's working on. If all agents respect the specifications, the codebase is consistent — regardless of how many agents you add.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Distributed computing:     Protocol enables coordination between nodes
Distributed AI dev:        Specification enables coordination between agents

Distributed computing:     More nodes + same protocol = more throughput
Distributed AI dev:        More agents + same specifications = more productivity

Distributed computing:     More nodes + no protocol = more conflicts
Distributed AI dev:        More agents + no specifications = more inconsistency
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scaling agents is scaling a distributed system. The mechanisms are the same. The lesson is the same. The solution is the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  The series, complete
&lt;/h2&gt;

&lt;p&gt;Eight fallacies. One meta-pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fallacy #1: Faster generation = faster engineering
            → The leading sub-system outran the lagging ones

Fallacy #2: Looks correct = is correct
            → Plausibility is not correctness

Fallacy #3: AI can verify AI
            → Correlated failure modes don't converge

Fallacy #4: Drop review = remove bottleneck
            → Removing a gate without replacing it removes the safety net

Fallacy #5: Better context = correct output
            → Input quality doesn't guarantee output correctness

Fallacy #6: Generated code is an asset
            → Code is a liability; capability is the asset

Fallacy #7: Specs are new work
            → The specifications already exist; the enforcement doesn't

Fallacy #8: More agents = more productivity
            → More actors without protocols = more conflicts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every fallacy stems from one root assumption: &lt;strong&gt;generating the output is the hard part.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This assumption is wrong. Understanding the output, verifying it, maintaining it, coordinating the actors that produce it, and preserving the rationale for why it's shaped this way — those are the hard parts. They always were. AI made the easy part faster. The hard parts didn't change.&lt;/p&gt;

&lt;p&gt;Peter Deutsch's Distributed Computing Fallacies worked because they named the assumptions that every developer made, discovered were wrong, and paid for in production. The network is not reliable. Latency is not zero. Bandwidth is not infinite.&lt;/p&gt;

&lt;p&gt;The Fallacies of GenAI Development work the same way. Generation is not engineering. Plausible is not correct. More agents is not more productivity. Each assumption sounds true. Each leads to failure. Each has already been resolved by a domain that learned the lesson first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The resolution to all eight
&lt;/h2&gt;

&lt;p&gt;The resolution is the same across every fallacy, every domain, every era:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recognize the specifications that already exist&lt;/strong&gt; in your system — types, contracts, schemas, boundaries (Fallacy #7). &lt;strong&gt;Enforce them mechanically&lt;/strong&gt; on every change, at machine speed (Fallacies #1, #3, #4). &lt;strong&gt;Verify the output&lt;/strong&gt; against declared properties, not just the input quality (Fallacies #2, #5). &lt;strong&gt;Measure properties verified&lt;/strong&gt;, not code generated (Fallacy #6). &lt;strong&gt;Use specifications as coordination protocols&lt;/strong&gt; for agents (Fallacy #8).&lt;/p&gt;

&lt;p&gt;One architectural principle. Eight applications. The teams that adopt it first will emerge from the trough of disillusionment ahead of everyone else. The teams that don't will learn the same lessons the hard way — the same way distributed systems developers learned Deutsch's fallacies, one production incident at a time.&lt;/p&gt;

&lt;p&gt;The engineer's role has changed. Not from "writing code" to "writing prompts" — that's the Fallacy #4 trap, the prompt operator with no ownership. The real shift is from writing code to &lt;strong&gt;designing protocols&lt;/strong&gt;. The specifications, contracts, boundaries, and enforcement gates that enable agents to coordinate safely. The engineer becomes the protocol designer for a distributed system of AI actors. That's not a demotion from "programmer." It's the same move the industry made when it went from writing assembly to designing systems. The abstraction level changed. The engineering judgment became more valuable, not less.&lt;/p&gt;

&lt;p&gt;The trough is real. The exit is specifications — recognized, enforced, and verified.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This concludes The Fallacies of GenAI Development. The complete series: &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;#1 Faster Generation ≠ Faster Engineering&lt;/a&gt; · &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-2-if-the-output-looks-correct-it-is-correct-5gbf"&gt;#2 Plausible ≠ Correct&lt;/a&gt; · &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-3-you-can-verify-ai-output-with-another-ai-4fpk"&gt;#3 AI Can't Verify AI&lt;/a&gt; · &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-4-dropping-human-review-removes-the-bottleneck-1jo"&gt;#4 Dropping Review ≠ Removing Bottleneck&lt;/a&gt; · &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-5-better-context-prevents-hallucination-he0"&gt;#5 Better Context ≠ Correct Output&lt;/a&gt; · &lt;a href="https://dev.toREPLACE-WITH-CORRECT-URL-FOR-FALLACY-6"&gt;#6 Generated Code Is a Liability&lt;/a&gt; · &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-7-specifications-are-a-new-artifact-you-have-to-create-29d4"&gt;#7 Specifications Already Exist&lt;/a&gt; · #8 More Agents ≠ More Productivity&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For cloud infrastructure specifically, the specification-first model is implemented in &lt;a href="https://github.com/sufield/stave" rel="noopener noreferrer"&gt;Stave&lt;/a&gt; — an open-source tool that evaluates configuration snapshots against 2,662 safety invariants with deterministic mechanical verification. Apache 2.0.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For a single IAM policy file, try &lt;a href="https://github.com/sufield/iam-explain" rel="noopener noreferrer"&gt;iam-explain&lt;/a&gt; — point it at your policy JSON, see what the math says. The specifications are already in your policy. The tool shows you what they mean.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The Fallacies of GenAI Development were inspired by Peter Deutsch's "Fallacies of Distributed Computing" (1994). The resolution draws from Parnas (1972), Altshuller's TRIZ (1946), Byron Cook's automated reasoning at AWS, and evidence from aviation, nuclear operations, financial trading, and Google's monorepo. Each fallacy was discovered independently across domains. The convergence is the evidence.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Fallacies of GenAI Development #7: Specifications Are a New Artifact You Have to Create</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Fri, 05 Jun 2026 11:25:42 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-7-specifications-are-a-new-artifact-you-have-to-create-29d4</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-7-specifications-are-a-new-artifact-you-have-to-create-29d4</guid>
      <description>&lt;p&gt;&lt;em&gt;This is the seventh in a series of eight posts on the false assumptions teams make when building with generative AI. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;Fallacy #1&lt;/a&gt; through &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-5-better-context-prevents-hallucination-he0"&gt;#5&lt;/a&gt; covered why AI-speed development breaks without verification. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-5-better-context-prevents-hallucination-52fp"&gt;Fallacy #6&lt;/a&gt; covered why generated code is a liability, not an asset. This post covers the assumption that stops most teams from adding verification: the belief that it requires writing new specifications from scratch.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fallacy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"To adopt specification-first development, we need to write specifications for everything. That's too much work."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's tempting
&lt;/h2&gt;

&lt;p&gt;The word specification carries baggage. Engineers who've been in the industry long enough remember:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 1980s:&lt;/strong&gt; Formal methods. Z notation. VDM. B method. Full behavioral specifications that were as complex as the implementation. Writing the spec took as long as writing the code. Maintaining both was double the work. When deadline pressure hit, the spec was the first thing dropped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 1990s:&lt;/strong&gt; IEEE 830-style requirements documents. Hundreds of pages. Maintained alongside the code by a dedicated requirements engineer. The document and the code drifted apart within months. The requirements document became fiction that nobody read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 2000s:&lt;/strong&gt; Model-driven development. UML diagrams generated code. The diagrams were the specification. Maintaining the diagrams was a full-time job. When the generated code needed manual modifications, the diagrams and the code diverged permanently.&lt;/p&gt;

&lt;p&gt;Each attempt failed for the same reason: the specification tried to describe EVERYTHING the system does. A complete behavioral specification is as complex as the implementation. Maintaining both is unsustainable. The spec rots. The team abandons it. The effort was wasted.&lt;/p&gt;

&lt;p&gt;Engineers hear specifications and think of these failures. The reaction is immediate: "We tried that. It didn't work. Too much overhead. We're not doing it again."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's wrong
&lt;/h2&gt;

&lt;p&gt;The fallacy is in the assumption that specification means what it meant in the 1980s — a complete behavioral description of the system.&lt;/p&gt;

&lt;p&gt;It doesn't. Look at your codebase right now. You already have specifications. You just don't call them that.&lt;/p&gt;

&lt;h3&gt;
  
  
  The specifications you already maintain
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Type signatures.&lt;/strong&gt; Every function in a typed language has a specification: what it accepts and what it returns. &lt;code&gt;func ParsePolicy(path string) (*PolicyDocument, error)&lt;/code&gt; — that's a specification. It says: this function takes a string, returns a document or an error. Any code that calls it with the wrong type fails at compile time. You already write these. You already maintain them. You already enforce them mechanically — the compiler is the enforcement gate.&lt;/p&gt;

&lt;p&gt;You're already using formal methods. Type checking is the most successful formal verification tool in history. It proves a mathematical property about your code — type safety — every time you compile. It's fast, deterministic, and catches errors before any human sees them. We have to apply the success of the compiler to your other boundaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API contracts.&lt;/strong&gt; Your REST API has an OpenAPI spec. Your gRPC services have Protocol Buffer definitions. Your GraphQL API has a schema. Each one specifies: these are the endpoints, these are the fields, these are the types, these are the required parameters. Any client that violates the contract gets an error. You already write and version them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database schemas.&lt;/strong&gt; Your migration files specify: these are the tables, these are the columns, these are the constraints, these are the foreign keys. &lt;code&gt;NOT NULL&lt;/code&gt;, &lt;code&gt;UNIQUE&lt;/code&gt;, &lt;code&gt;FOREIGN KEY&lt;/code&gt; — each is a specification enforced by the database engine on every write. You already write these.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration schemas.&lt;/strong&gt; Your Kubernetes manifests specify resource limits, health checks, replica counts. Your Terraform files specify infrastructure state. Your CI pipeline files specify build steps and their dependencies. Each is a specification that a machine reads and enforces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interface boundaries.&lt;/strong&gt; Your Go packages have exported and unexported functions. Your Java classes have public and private methods. Your Python modules have &lt;code&gt;__all__&lt;/code&gt; lists. Each boundary specifies: this is the interface, everything else is hidden. Parnas told you to create these in 1972. You did.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Parnas said
&lt;/h3&gt;

&lt;p&gt;David Parnas's 1972 paper "On the Criteria To Be Used in Decomposing Systems into Modules" argued that humans can't hold entire systems in their heads. The solution: information hiding. Each module exposes an interface (what it does) and hides its implementation (how it does it).&lt;/p&gt;

&lt;p&gt;This paper is the foundation of every modular programming language, every API boundary, every microservice architecture, every package system. When you write a Go package with exported functions, you're implementing Parnas. When you define a Protocol Buffer service, you're implementing Parnas. When you create a database schema with foreign key constraints, you're implementing Parnas.&lt;/p&gt;

&lt;p&gt;You've been writing specifications your entire career. You just called them interfaces and schemas and contracts and types.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's missing
&lt;/h2&gt;

&lt;p&gt;The specifications exist. What's missing is a single thing: &lt;strong&gt;mechanical enforcement across all of them, on every change, at AI speed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your type system enforces type specifications — but only within one language, one compilation unit. It doesn't enforce that the TypeScript client matches the Go server's API contract.&lt;/p&gt;

&lt;p&gt;Your database engine enforces schema constraints — but only at write time. It doesn't enforce that the application code handles all the constraint violations correctly.&lt;/p&gt;

&lt;p&gt;Your API contract exists — but most teams don't have a CI check that fails the build when generated code deviates from the OpenAPI spec.&lt;/p&gt;

&lt;p&gt;Your module boundaries exist — but nothing prevents an AI agent from generating code that reaches across boundaries through reflection, type casting, or dynamic imports.&lt;/p&gt;

&lt;p&gt;The specifications are there. The enforcement is human — code review, manual testing, "I'll check that during review." And as &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-4-dropping-human-review-removes-the-bottleneck-1jo"&gt;Fallacy #4&lt;/a&gt; showed, human enforcement doesn't scale to AI-speed generation.&lt;/p&gt;

&lt;p&gt;There's an irony here that connects to the GenAI context: AI models are remarkably good at following well-defined contracts and remarkably bad at guessing implicit ones. When your OpenAPI spec is enforced mechanically, you give the AI a fixed target. It doesn't have to guess your intent. It just has to satisfy the contract. The mechanical enforcement that catches human errors ALSO catches AI errors — and the AI benefits from the explicit boundary even more than the human does, because the AI has zero institutional knowledge to fall back on when the contract is absent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three-layer model
&lt;/h2&gt;

&lt;p&gt;Every domain that faced the "system too complex for humans to understand" problem arrived at the same three-layer structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1 — Boundaries (Parnas, 1972):
    The specifications. Already in your codebase.
    Module interfaces, API contracts, type signatures, schemas.
    Defines WHAT each part promises to the rest.

Layer 2 — Mechanical enforcement (the missing layer):
    A gate that checks EVERY change against EVERY boundary.
    At machine speed. Before deployment.
    The type checker is Layer 2 for type specifications.
    CI contract tests are Layer 2 for API specifications.
    What's missing: Layer 2 for architectural and security specifications.

Layer 3 — Rationale preservation:
    WHY each boundary exists.
    The failure mode it prevents. The incident that motivated it.
    Connected to the boundary so that changing it triggers
    review of the rationale.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 1 you already have.&lt;/strong&gt; Your interfaces, contracts, types, and schemas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 you partially have.&lt;/strong&gt; The compiler is Layer 2 for types. The database engine is Layer 2 for schemas. What you're missing: Layer 2 for the architectural constraints, security properties, and cross-service contracts that currently live in people's heads or in documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 you barely have.&lt;/strong&gt; Maybe some Architecture Decision Records. Maybe some comments explaining "why" in the code. Rarely connected to the boundaries they explain. Rarely enforced — nobody checks whether a proposed change to an interface violates the rationale that shaped it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The extreme domains added Layer 2
&lt;/h2&gt;

&lt;p&gt;No safety-critical domain stopped at Layer 1. Every one discovered that boundaries without enforcement don't survive contact with reality at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microprocessor design:&lt;/strong&gt; Interface specifications exist (bus protocols, timing diagrams). Formal verification tools CHECK every circuit against the specifications mechanically. You can't tape out a chip without passing the verification suite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nuclear operations:&lt;/strong&gt; Operating procedures exist (the specifications). Physical interlocks ENFORCE parameter limits mechanically. The reactor scrams automatically when parameters exceed the envelope — regardless of what the operator does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aviation:&lt;/strong&gt; Flight envelopes exist (the specifications). Fly-by-wire systems ENFORCE them mechanically. The pilot can't stall the aircraft — the computer overrides the input before it reaches the control surfaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google monorepo:&lt;/strong&gt; API contracts exist (the specifications). CI gates ENFORCE them on every commit. A large-scale change touching millions of lines of code merges only when every affected contract test passes mechanically.&lt;/p&gt;

&lt;p&gt;Each domain: specifications existed (Layer 1). Human enforcement couldn't keep up (Layer 2 missing). They added mechanical enforcement (Layer 2). The system survived scaling.&lt;/p&gt;

&lt;p&gt;Your codebase is in the same position. Layer 1 exists. Layer 2 is partial. AI-speed generation made the gap visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this isn't the 1980s
&lt;/h2&gt;

&lt;p&gt;The formal methods failures of the 1980s had a specific cause: the specification was a COMPLETE BEHAVIORAL DESCRIPTION — as complex as the code. That's not what this article is proposing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1980s approach (failed):
    Write a full specification of everything the system does.
    Spec complexity ≈ code complexity.
    Maintaining both is double the work.
    Spec rots. Team abandons it.

Current approach (what already exists + enforcement):
    Recognize the specifications you already maintain.
    Add mechanical enforcement for the ones that aren't enforced.
    Spec complexity &amp;lt;&amp;lt; code complexity because these are
    INTERFACE specifications, not BEHAVIORAL specifications.
    Types, contracts, schemas — small, stable, already maintained.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The type signature &lt;code&gt;func(string) (*Document, error)&lt;/code&gt; is not a complete behavioral specification of the function. It's an interface specification — what goes in, what comes out. It's smaller than the code. It changes less often. The compiler already enforces it.&lt;/p&gt;

&lt;p&gt;The OpenAPI spec is not a complete description of your API's behavior. It's an interface specification — endpoints, fields, types, required parameters. It's smaller than the code. It changes less often. Contract testing already enforces it.&lt;/p&gt;

&lt;p&gt;The database schema is not a complete model of your data's semantics. It's a structural specification — tables, columns, constraints. It's smaller than the application code. It changes less often. The database engine already enforces it.&lt;/p&gt;

&lt;p&gt;Each of these succeeded where the 1980s failed because they specify INTERFACES, not BEHAVIOR. Interfaces are small. Behavior is large. Small specifications are maintainable. Large specifications rot.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can do this week
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Inventory your existing specifications.&lt;/strong&gt; Open your codebase. Count: How many typed function signatures? How many API contracts (OpenAPI, Protobuf, GraphQL schemas)? How many database schemas with constraints? How many CI pipeline definitions? You'll find more specifications than you expected. You've been writing them your entire career.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Identify the unenforced specifications.&lt;/strong&gt; Of the specifications you found, which ones have mechanical enforcement (compiler, contract test, schema validator, CI check)? Which ones exist as documents but aren't checked on every change? The unenforced ones are your Layer 2 gaps. Each gap is a place where AI-generated code can violate an interface without anyone catching it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Enforce one specification mechanically this week.&lt;/strong&gt; Pick the most critical unenforced specification. Your OpenAPI contract that exists but has no contract test? Add one. Your module boundary that exists but nothing prevents cross-boundary imports? Add a linter rule — tools like &lt;code&gt;arch-go&lt;/code&gt;, &lt;code&gt;depguard&lt;/code&gt;, or &lt;code&gt;eslint-plugin-import&lt;/code&gt; can enforce that your &lt;code&gt;internal&lt;/code&gt; packages can't be imported by your &lt;code&gt;public&lt;/code&gt; packages. That's a Layer 2 gate for a Layer 1 boundary, implemented in one config file. Your architecture decision that exists as a document but has no CI check? Write one check for one rule.&lt;/p&gt;

&lt;p&gt;One specification. Already written. One enforcement gate. New this week. That's the entire adoption path. Not "write specifications for everything." Not "adopt formal methods." Not "add a six-month process change." One existing specification. One new CI check. This week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Attach one rationale.&lt;/strong&gt; Take the specification you just enforced. Add one comment or one metadata field: WHY does this boundary exist? What incident or failure mode motivated it? When someone proposes changing it next quarter, the rationale will save the team from reintroducing the problem the boundary was designed to prevent.&lt;/p&gt;

&lt;p&gt;A Layer 3 comment isn't &lt;code&gt;// This parses JSON&lt;/code&gt;. It's:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// We use this specific parser because it handles the non-standard&lt;/span&gt;
&lt;span class="c1"&gt;// date format from the legacy billing system. The standard library&lt;/span&gt;
&lt;span class="c1"&gt;// parser silently truncates these dates, causing invoice mismatches.&lt;/span&gt;
&lt;span class="c1"&gt;// See Incident #402. Do not replace without verifying billing&lt;/span&gt;
&lt;span class="c1"&gt;// integration tests pass with the legacy date format.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this, an AI agent will optimize the code by replacing the parser with the standard library version. The tests pass — because the test fixtures use standard dates. The billing integration breaks in production — because real billing data uses the legacy format. The rationale prevents the optimization that looks correct but isn't.&lt;/p&gt;

&lt;p&gt;The specifications already exist. Parnas gave you the principle. Your language gave you the type system. Your API framework gave you the contract. Your database gave you the schema. The only new thing is the enforcement gate — and that's a CI configuration, not a cultural transformation.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next and final in the series: **Fallacy #8 — "More AI Agents Means More Productivity."&lt;/em&gt;* Why adding agents without adding specifications is like scaling a distributed system without adding protocols — and why the same coordination mechanisms that solved distributed computing solve distributed AI development.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Fallacies of GenAI Development #6: AI-Generated Code Is an Asset</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Wed, 03 Jun 2026 10:28:19 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-5-better-context-prevents-hallucination-52fp</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-5-better-context-prevents-hallucination-52fp</guid>
      <description>&lt;p&gt;&lt;em&gt;This is the sixth in a series of eight posts on the false assumptions teams make when building with generative AI. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;Fallacy #1&lt;/a&gt; covered the generation-engineering gap. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-2-if-the-output-looks-correct-it-is-correct-5gbf"&gt;Fallacy #2&lt;/a&gt; covered plausible vs. correct. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-3-you-can-verify-ai-output-with-another-ai-4fpk"&gt;Fallacy #3&lt;/a&gt; covered AI verifying AI. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-4-dropping-human-review-removes-the-bottleneck-1jo"&gt;Fallacy #4&lt;/a&gt; covered removing review. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-5-better-context-prevents-hallucination-he0"&gt;Fallacy #5&lt;/a&gt; covered context vs. verification. This post covers the assumption that generated code is value — when it's actually cost.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fallacy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"The AI generated 10,000 lines this week. We're 10x more productive."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's tempting
&lt;/h2&gt;

&lt;p&gt;Productivity has always been hard to measure in software. Lines of code was a bad metric, but it was a metric. When AI made code generation fast, the metric became irresistible again. The team generated more code. The PRs are larger. The features ship faster. The graphs go up and to the right.&lt;/p&gt;

&lt;p&gt;Leadership loves it. More output. More features. More velocity. The investment in AI tooling is paying off — look at the numbers. The team is producing more than ever.&lt;/p&gt;

&lt;p&gt;And it feels good to the engineers too. You describe a feature. The AI generates the implementation. You ship it. The dopamine hit of productivity is real. You built many things today. The backlog is shrinking. The sprint velocity is through the roof.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's wrong
&lt;/h2&gt;

&lt;p&gt;Jeff Atwood said it plainly: &lt;strong&gt;code is a liability, not an asset.&lt;/strong&gt; The concept originates in Lean Manufacturing, applied to software by Mary and Tom Poppendieck (2003): unshipped code is Work in Progress waste, and shipped code is a maintenance burden. AI didn't change the economics — it made it faster to create the liability.&lt;/p&gt;

&lt;p&gt;Every line of code is something you have to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compile.&lt;/strong&gt; More code means longer build times. Bigger binaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test.&lt;/strong&gt; More code means more test cases needed. Dependencies grow quadratically — 10x code may mean 100x test compute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debug.&lt;/strong&gt; More code means more potential failure points. More places for bugs to hide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure.&lt;/strong&gt; More code means more attack surface. More code paths to audit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Understand.&lt;/strong&gt; More code means more cognitive load. More context to hold when making changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintain.&lt;/strong&gt; More code means more things that break when dependencies change. More migration work when frameworks evolve.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Studies by the Consortium for Information &amp;amp; Software Quality (CISQ) put numbers to this: the cost of developing code is typically only 20–30% of its total lifecycle cost. The remaining 70–80% is maintenance. If AI makes development 10x faster but doesn't change the maintenance profile, the total lifecycle savings are negligible — only a 10% reduction of the 30% development slice. If the AI increases code volume, the total cost of ownership actually &lt;em&gt;rises&lt;/em&gt;, even if development was free.&lt;/p&gt;

&lt;p&gt;The team that generated 10,000 lines this week didn't create 10,000 lines of value. They created 10,000 lines of ongoing cost. The value was in the feature. The cost is in the code. The feature could have been delivered with 1,000 lines if the right abstractions existed. The 9,000 extra lines are pure liability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boom
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Month 1-3: The output surge.&lt;/strong&gt; The team generates more code than ever. Features ship. The backlog shrinks. Sprint velocity doubles, then triples. Leadership presents the metrics at the quarterly review. AI is working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4-6: The build slows down.&lt;/strong&gt; Compilation time increases. CI pipeline takes longer. Developers wait for builds. The wait time is small at first — 10%, 15% longer. Nobody notices because the generation is so fast. But the trend is upward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 7-9: The test burden compounds.&lt;/strong&gt; New code depends on existing code. Existing code depends on other existing code. The dependency graph grows quadratically. A change to one module triggers tests in 50 other modules. The test suite that ran in 8 minutes now takes 45 minutes. Developers start skipping test runs locally and relying on CI. CI queues back up.&lt;/p&gt;

&lt;p&gt;Winters, Manshreck, and Wright document this in &lt;em&gt;Software Engineering at Google&lt;/em&gt; (2020). Adam Bender names it precisely: "If your codebase is 10 times larger and you're trying to test all the dependencies so that you're sure nothing will break, you may have upwards of 100 times as many tests running. Maybe 1,000 times as many tests. That's going to be a line item in your budget."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 10-12: The maintenance cliff.&lt;/strong&gt; A dependency needs updating. A security patch requires changes. A framework is deprecated. Each change touches thousands of generated lines that nobody fully understands. The team that generated code in minutes spends weeks migrating it. The code that was cheap to produce is expensive to maintain. Lehman's Second Law of Software Evolution (1980) predicts this: as a program is continually changed, its complexity increases unless active work is done to reduce it. AI is purely additive — it adds structure-deteriorating code at high speed. Without deliberate refactoring and deletion, the system reaches the maintenance cliff faster than human teams can recover.&lt;/p&gt;

&lt;p&gt;There's a subtler rot underneath. Five developers each generated their own version of &lt;code&gt;formatDate()&lt;/code&gt;, &lt;code&gt;parseUserInput()&lt;/code&gt;, and &lt;code&gt;retryWithBackoff()&lt;/code&gt;. Each used a different prompt. Each produced a slightly different implementation. A bug is found in one version of &lt;code&gt;parseUserInput()&lt;/code&gt; — but nobody knows four other versions exist scattered across the codebase. The fix patches one. The other four remain broken. This is inconsistency debt — the invisible cost of generation without reuse. A library would have had one implementation. Five generations created five liabilities. Google's "One Version Rule" (documented in &lt;em&gt;Software Engineering at Google&lt;/em&gt;) exists precisely to prevent this: every library has a single version across the monorepo. AI generation is the ultimate violator of the One Version Rule — each prompt produces a "new version" of a common utility, creating dependency hell within a single codebase.&lt;/p&gt;

&lt;p&gt;And then the quiet realization: the team isn't building new features anymore. They're maintaining old code. The AI generates new code faster than ever. But the team spends most of their time on the code that already exists — patching, migrating, debugging, securing. The generation speed is irrelevant when the maintenance burden consumes all available time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ratio inverts.&lt;/strong&gt; In month 1, the team spent 90% of their time building new features and 10% maintaining existing code. By month 12, it's 30% building and 70% maintaining. The AI made the 30% faster. Nobody made the 70% faster. The overall velocity DECREASES even as the generation speed INCREASES.&lt;/p&gt;

&lt;h2&gt;
  
  
  The unit of progress is wrong
&lt;/h2&gt;

&lt;p&gt;The software industry has been optimizing for the wrong unit of progress for decades. AI made the mistake faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Code generation:     More lines → more volume → more liability
Function composition: Same functions, composed differently → more capability → zero new liability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unix proved the right unit 50 years ago. Doug McIlroy, the inventor of Unix pipes, summarized the philosophy in 1978: "Expect the output of every program to become the input to another, as yet unknown, program." &lt;code&gt;grep&lt;/code&gt;, &lt;code&gt;sort&lt;/code&gt;, &lt;code&gt;uniq&lt;/code&gt;, &lt;code&gt;awk&lt;/code&gt;, &lt;code&gt;sed&lt;/code&gt; — each is a function with a stable interface (stdin/stdout). They've survived:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The transition from mainframes to minicomputers to PCs to servers to cloud&lt;/li&gt;
&lt;li&gt;Six generations of operating systems&lt;/li&gt;
&lt;li&gt;Dozens of programming language fashions&lt;/li&gt;
&lt;li&gt;Every deployment paradigm from tape to containers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They survived because the unit of reuse is the &lt;strong&gt;function&lt;/strong&gt;, not the &lt;strong&gt;code&lt;/strong&gt;. Nobody regenerates &lt;code&gt;grep&lt;/code&gt; for each project. Nobody writes a new sorting algorithm for each application. You compose existing functions through stable interfaces. The value is in the composition. The functions are reusable. The liability is near zero because you're not maintaining them, it is the job of the upstream maintainers. AI generation is an anti-Unix force: it generates custom, non-standard implementations every time instead of reusing standard tools.&lt;/p&gt;

&lt;p&gt;Google proved the same principle at organizational scale. When a team needs an RSS feed generator, they reuse the existing RSS function. They don't generate new code that reimplements RSS parsing. Google's internal codebase has millions of reusable functions with stable interfaces. The productivity comes from knowing which function to use, not from generating a new implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The metric inversion
&lt;/h2&gt;

&lt;p&gt;The fallacy persists because teams measure the wrong thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What teams measure:          What they should measure:
─────────────────────        ───────────────────────────
Lines of code generated      Properties verified per module
PRs merged per sprint        Specification coverage (% of behaviors governed)
Features shipped             Functions reused vs. code generated
Sprint velocity points       Maintenance burden (% time on existing vs. new)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The left column goes up when you generate more code. Leadership celebrates. The right column would show that more generated code means LOWER specification coverage (more ungoverned behavior), LOWER reuse ratios (more reimplemented functions), and HIGHER maintenance burden (more time on existing code).&lt;/p&gt;

&lt;p&gt;The team that generated 10,000 lines and verified zero properties is more productive on the left-column metrics and more indebted on the right-column metrics. The left column measures output speed. The right column measures engineering health. When these diverge — output going up, health going down — the team is building debt, not value.&lt;/p&gt;

&lt;h2&gt;
  
  
  The resolution: compose, don't generate
&lt;/h2&gt;

&lt;p&gt;The alternative to generating more code is composing fewer, better-verified units.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GENERATE (current model):
    Need an RSS parser    → AI generates 200 lines
    Need a date formatter → AI generates 80 lines
    Need an HTTP client   → AI generates 150 lines
    Need JSON validation  → AI generates 120 lines

    Total: 550 lines of new code to maintain
    Properties verified: 0
    Reuse potential: 0 (tightly coupled to this project)

COMPOSE (alternative model):
    Need an RSS parser    → import rss-parser (maintained upstream)
    Need a date formatter → import date-fns (maintained upstream)
    Need an HTTP client   → import http-client (maintained upstream)
    Need JSON validation  → validate against JSON Schema (maintained upstream)

    Total: 4 import statements + 20 lines of composition
    Properties verified: each library has its own test suite
    Reuse potential: 100% (same libraries across every project)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The composed version has 96% less code. 96% less to compile, test, debug, secure, understand, and maintain. The libraries are maintained by their upstream communities. The composition is the only new code — and it's 20 lines, reviewable in minutes.&lt;/p&gt;

&lt;p&gt;AI's best role isn't generating the 550 lines. It's helping you FIND the right four libraries and COMPOSE them correctly. The generation is the least valuable part. The selection and composition are the most valuable parts.&lt;/p&gt;

&lt;p&gt;The expert AI-assisted engineer isn't a faster typist. They're a better librarian. Their value isn't in generating a custom implementation of a common pattern. It's in knowing that an optimized, well-tested, actively-maintained library already exists — and using the AI to write the 20 lines of glue code to connect it. The librarian ships less code and more capability. The typist ships more code and more debt. Caldiera and Basili (1991) demonstrated this in their research on software reuse: successful software evolution depends on prioritizing reuse over creation. AI makes creation so cheap that it crowds out the impulse to search for existing libraries — a phenomenon engineers recognize as "Not Invented Here Syndrome," now running at machine speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  When generation is appropriate
&lt;/h2&gt;

&lt;p&gt;Generation is appropriate when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No reusable function exists.&lt;/strong&gt; Genuinely novel logic with no upstream library. Generate it — but immediately extract it as a reusable function with a stable interface, tests, and a specification. Don't leave it inline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The code is disposable.&lt;/strong&gt; Prototypes, experiments, one-off scripts. Generate freely. But don't let disposable code become production code. Bender's isolation principle: "You don't want that cool prototype code to find its way into production."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The composition IS the code.&lt;/strong&gt; Glue code that connects well-tested components. This is appropriate for generation because the value is in the wiring, the individual components are already verified, and the glue code can be verified against the interface contracts of the components it connects.&lt;/p&gt;

&lt;p&gt;In each case, the generated code should be the MINIMUM necessary — not the maximum the AI can produce. The AI that generates the least code to solve the problem is more valuable than the AI that generates the most. Less code = less liability = less maintenance = more time for new capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Delete Code Test
&lt;/h2&gt;

&lt;p&gt;If your team generated 10,000 lines last week and you deleted 9,000 of them, replacing them with imports and 200 lines of composition — would the system work the same way?&lt;/p&gt;

&lt;p&gt;If yes, you generated 9,000 lines of pure liability.&lt;/p&gt;

&lt;p&gt;If you're not sure, you don't have enough specifications to know — which is Fallacy #2 (plausible ≠ correct) and Fallacy #5 (context ≠ verification) revisiting you.&lt;/p&gt;

&lt;p&gt;The goal isn't more code. It's more CAPABILITY with LESS CODE. Every line you didn't generate is a line you don't maintain. Every function you reused instead of regenerated is a function someone else maintains. Every specification you verified is a property that holds regardless of how much code is behind it.&lt;/p&gt;

&lt;p&gt;Code is a liability. Less of it is better. The most productive team isn't the one that generates the most. It's the one that ships the most capability while maintaining the least code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can do this week
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Measure your generation-to-reuse ratio.&lt;/strong&gt; For your last 10 AI-generated files, count: how many imported existing libraries vs. how many reimplemented functionality that libraries already provide? If reimplementation exceeds reuse, the AI is generating liability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Before generating, ask: "Does this function already exist?"&lt;/strong&gt; Better yet, ask the AI: "Is there a popular, well-maintained open-source library that handles [Task X]?" BEFORE asking it: "Write a function that handles [Task X]." If a library exists, the AI's job is to write the integration code, not the logic. The 5-minute search saves weeks of future maintenance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Track maintenance burden monthly.&lt;/strong&gt; What percentage of your team's time goes to maintaining existing code vs. building new capabilities? If that percentage is rising while code generation is also rising, the generation is feeding the maintenance burden. The metric tells you whether your AI investment is producing value or producing debt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Measure your deletions-to-additions ratio.&lt;/strong&gt; A high-performing AI-assisted team should be deleting redundant code and replacing it with better abstractions as often as they add new code. If your ratio is 10:1 additions-to-deletions, the codebase is growing without consolidation. If it's closer to 2:1 or even 1:1, the team is composing — replacing generated sprawl with verified, reusable units. The ratio tells you whether the AI is building capability or building debt.&lt;/p&gt;

&lt;p&gt;The AI is a remarkable code generator. That's the problem. The skill isn't generating more code. It's knowing when NOT to generate — when to compose, when to reuse, when to import, and when to write the 20 lines of composition instead of the 550 lines of reimplementation.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series: **Fallacy #7 — "Specifications Are a New Artifact You Have to Create."&lt;/em&gt;* Why the specifications already exist in your codebase, why Parnas told you to create them in 1972, and why the only new thing is mechanical enforcement.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Caldiera, G. and Basili, V. (1991). "Identifying and Qualifying Reusable Software Components." &lt;em&gt;IEEE Computer&lt;/em&gt;, 24(2).&lt;/li&gt;
&lt;li&gt;CISQ (Consortium for Information &amp;amp; Software Quality). "The Cost of Poor Software Quality in the US."&lt;/li&gt;
&lt;li&gt;Lehman, M.M. (1980). "Programs, Life Cycles, and Laws of Software Evolution." &lt;em&gt;Proceedings of the IEEE&lt;/em&gt;, 68(9).&lt;/li&gt;
&lt;li&gt;McIlroy, M.D. (1978). "Unix Time-Sharing System: Foreword." &lt;em&gt;The Bell System Technical Journal&lt;/em&gt;, 57(6).&lt;/li&gt;
&lt;li&gt;Poppendieck, M. and Poppendieck, T. (2003). &lt;em&gt;Lean Software Development: An Agile Toolkit&lt;/em&gt;. Addison-Wesley.&lt;/li&gt;
&lt;li&gt;Winters, T., Manshreck, T., and Wright, H. (2020). &lt;em&gt;Software Engineering at Google: Lessons Learned from Programming Over Time&lt;/em&gt;. O'Reilly Media.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Fallacies of GenAI Development #5: Better Context Prevents Hallucination</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Wed, 03 Jun 2026 10:28:19 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-5-better-context-prevents-hallucination-he0</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-5-better-context-prevents-hallucination-he0</guid>
      <description>&lt;p&gt;&lt;em&gt;This is the fifth in a series of eight posts on the false assumptions teams make when building with generative AI. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;Fallacy #1&lt;/a&gt; covered the generation-engineering gap. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-2-if-the-output-looks-correct-it-is-correct-5gbf"&gt;Fallacy #2&lt;/a&gt; covered plausible vs. correct. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-3-you-can-verify-ai-output-with-another-ai-4fpk"&gt;Fallacy #3&lt;/a&gt; covered AI verifying AI. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-4-dropping-human-review-removes-the-bottleneck-1jo"&gt;Fallacy #4&lt;/a&gt; covered removing the review gate. This post covers the assumption that better input guarantees correct output.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fallacy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"If we give the AI better documentation, up-to-date APIs, and more context, it won't hallucinate."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's tempting
&lt;/h2&gt;

&lt;p&gt;You've seen the problem firsthand. You ask the AI to call an API. It uses a deprecated endpoint. You ask it to implement a library function. It invents a method that doesn't exist. You ask it to follow your team's coding conventions. It generates code that looks like it read the conventions from 2019.&lt;/p&gt;

&lt;p&gt;The diagnosis seems obvious: the AI doesn't have the right information. The fix seems obvious: give it the right information. RAG retrieves relevant documents before the AI generates. Context Hub tools fetch the latest API documentation. System prompts inject your team's conventions. The context window fills with the information the AI needs.&lt;/p&gt;

&lt;p&gt;And it works. The AI generates code that uses the current API. It follows the latest conventions. The hallucination rate drops measurably. The investment in context infrastructure pays off.&lt;/p&gt;

&lt;p&gt;So what's the fallacy?&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's wrong
&lt;/h2&gt;

&lt;p&gt;Better context reduces hallucination. It does not eliminate it. Production failures live in the gap between reduced and eliminated.&lt;/p&gt;

&lt;p&gt;Three reasons context can't close the gap completely:&lt;/p&gt;

&lt;h3&gt;
  
  
  Reason 1: The AI can ignore the context
&lt;/h3&gt;

&lt;p&gt;Retrieval-Augmented Generation retrieves documents and places them in the context window. The model is SUPPOSED to use them. But the model can — and does — override the retrieved context with patterns from its training data when the training data feels more natural to the generation process.&lt;/p&gt;

&lt;p&gt;Liu et al. (2023) documented this as the "lost in the middle" problem: LLM performance degrades significantly when relevant information is placed in the middle of a long context window. Performance is highest at the beginning or end, and drops in between. This alone proves that "more context" is not a linear path to "more correctness."&lt;/p&gt;

&lt;p&gt;But the problem is deeper than positional bias. Research into parametric versus non-parametric memory (Longpre et al., 2021; Neeman et al., 2022) shows that when an LLM's internal training weights (parametric) conflict with the provided RAG context (non-parametric), the model frequently defaults to its training — especially when the training data was highly reinforced. The model's internal weights represent billions of patterns. The context window represents a few thousand tokens. When the two conflict, the model doesn't reliably choose the context.&lt;/p&gt;

&lt;p&gt;You can test this yourself: provide a document that explicitly contradicts the model's training data. Ask a question about it. The model will sometimes answer from the document and sometimes answer from its training. The percentage depends on the model, the prompt, and the specific conflict. It's never 100% from the document. The model isn't reading and following instructions. It's predicting the most likely next token given ALL its inputs — training weights and context window combined.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reason 2: Context is necessary, verification is separate
&lt;/h3&gt;

&lt;p&gt;Context answers: "What information should the AI use?"&lt;br&gt;
Verification answers: "Did the AI use the information correctly?"&lt;/p&gt;

&lt;p&gt;These are different questions with different mechanisms. Having the right ingredients doesn't guarantee the dish is correct. A chef with perfect ingredients can still combine them wrong, overcook the protein, or plate the wrong dish entirely. The ingredients are necessary. The taste test is still required.&lt;/p&gt;

&lt;p&gt;Modern benchmarks confirm this gap. The Retrieval-Augmented Generation Benchmark (RGB) shows that models often fail to reason correctly over retrieved documents even when the retrieval is 100% accurate. Having the right documents in the context window and using them correctly are independent capabilities — and current models fail at the second even when the first is perfect.&lt;/p&gt;

&lt;p&gt;In engineering terms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context pipeline:       Ensures the AI HAS the right information
                        → vector database, retrieval, ranking, injection
                        → measured by: retrieval accuracy, relevance scores

Verification pipeline:  Ensures the AI USED the information correctly
                        → specification checks, contract tests, property verification
                        → measured by: output correctness against declared properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most teams invest heavily in the first pipeline and have nothing for the second. They measure retrieval quality — "did we give the AI the right documents?" — but not output correctness — "did the output satisfy the properties it should satisfy?"&lt;/p&gt;

&lt;p&gt;This is like a restaurant that invests millions in sourcing the finest ingredients, tracks every supplier relationship, monitors freshness down to the hour — but has no quality check on the finished dish. The ingredients are world-class. The plate that reaches the customer might still have no flavor or taste.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reason 3: Context doesn't cover properties
&lt;/h3&gt;

&lt;p&gt;Even with perfect context — every document retrieved, every API current, every convention injected — the AI has no mechanism to enforce PROPERTIES that span the generated output.&lt;/p&gt;

&lt;p&gt;"No function in this module may call an external service without a timeout." That's a property. It's not in any API documentation. It's not in any retrieved document. It's an architectural decision the team made. The AI has no way to know about it from context alone, because it was never written as a retrievable document — it lives in the team's shared understanding.&lt;/p&gt;

&lt;p&gt;"This data pipeline must preserve message ordering." That's a constraint. It's implicit in the architecture. Even if someone wrote it down, RAG would have to retrieve THAT SPECIFIC document for THAT SPECIFIC generation. The probability of the right document being retrieved for the right generation at the right time decreases as the number of such properties grows.&lt;/p&gt;

&lt;p&gt;Context can provide facts. It can't provide the complete set of properties that must hold across every generated artifact. Properties are exhaustive ("ALL functions must have timeouts"). Context is sampled ("HERE are some relevant documents").&lt;/p&gt;

&lt;p&gt;There's a deeper structural reason RAG misses properties. RAG retrieves by semantic similarity — vector math that finds documents "close to" the query in meaning space. A security invariant ("never log PII") is semantically FAR from a function that processes user data. They aren't similar in the vector space. The RAG system never retrieves the security rule for that specific generation — because security constraints and implementation code don't look alike in embeddings. Facts live near the code that uses them. Properties live in a different semantic neighborhood entirely.&lt;/p&gt;

&lt;p&gt;Mishra et al. (2022) demonstrated a related failure: LLMs struggle significantly with negation and constraints in instructions. "Never log PII" is a negative property. Even when the constraint is explicitly in the context, the model's training data is composed almost entirely of positive examples ("do Y"), not negative ones ("never do X"). The context provides the constraint. The model's architecture biases against following it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boom
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Month 1-3: The context investment.&lt;/strong&gt; The team builds a sophisticated RAG pipeline. Vector database, document preprocessing, chunk optimization, relevance scoring. The AI generates better code. Hallucination rate drops from 15% to 5%. The team publishes an internal blog post celebrating the improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4-6: The 5% that matters.&lt;/strong&gt; The 5% residual hallucination isn't random. It's concentrated in the hard cases — the ones where the context conflicts with the training data, the ones where the property isn't in any retrievable document, the ones where the AI needs to reason about composition rather than pattern-match individual facts. These are disproportionately the cases that cause production incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 7-9: The false confidence incident.&lt;/strong&gt; A developer generates code for a new feature. RAG retrieves the correct API documentation. The AI uses the correct API endpoint. The code compiles. The tests pass. But the code violates an architectural constraint — it makes a synchronous call to an external service without a timeout, inside a transaction. Nothing in the retrieved context mentioned the timeout requirement. The constraint was in the team's architecture decision records, which were never indexed in the vector database. The service hangs. The database connection pool exhausts. The outage lasts four hours.&lt;/p&gt;

&lt;p&gt;Here's the cruel irony: better context made the hallucination MORE dangerous, not less. The code used the correct API. It followed the correct conventions. It looked MORE correct than code generated without RAG — which made everyone trust it more. Better context doesn't stop the AI from generating violations. It gives the AI better facts to wrap the violation in. The plausibility goes up. The scrutiny goes down. The damage goes up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post-mortem:&lt;/strong&gt; "We had the right context. The AI used the right API. But the AI didn't know about the timeout constraint because we didn't retrieve it. And we didn't retrieve it because we didn't index it. And we didn't index it because we didn't know we needed to."&lt;/p&gt;

&lt;p&gt;The post-mortem reveals the structural gap: context improves what the AI generates FROM. It doesn't check what the AI generates AGAINST. The retrieved documents were correct. The generated code was plausible. The property that was violated was never in the context pipeline — and no verification pipeline existed to catch it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 10, the response:&lt;/strong&gt; The team indexes their ADRs in the vector database. They add more documents to the RAG pipeline. The context gets better. The next incident is a different property that wasn't indexed. The whack-a-mole cycle begins — each incident reveals a property that should have been in the context but wasn't. The team can never index EVERYTHING because they don't have a complete list of everything that matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The resolution: verify the output, not just the input
&lt;/h2&gt;

&lt;p&gt;Context and verification are complementary. Both are necessary. Neither is sufficient alone.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INPUT QUALITY (context):
    "Give the AI the right information"
    → RAG, Context Hub, up-to-date docs, system prompts
    → Reduces hallucination
    → Measured by: retrieval accuracy

OUTPUT CORRECTNESS (verification):
    "Check that the AI used the information correctly"
    → Specification gates, contract tests, property checks
    → Catches violations regardless of context quality
    → Measured by: properties satisfied per change
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The team that invests only in context is optimizing input quality and hoping output correctness follows. Sometimes it does. The 5% where it doesn't is where the incidents live.&lt;/p&gt;

&lt;p&gt;The team that adds verification checks the output REGARDLESS of how it was generated. Whether the AI had perfect context or no context at all, the verification catches violations — because the verification checks PROPERTIES, not INPUTS.&lt;/p&gt;

&lt;h3&gt;
  
  
  The kitchen analogy, completed
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context = Ingredients
    Better ingredients → better chance of a good dish
    But: a chef can still combine them wrong

Verification = Taste test
    Check the finished dish against the standard
    Catches errors regardless of ingredient quality

Great restaurants have BOTH:
    World-class sourcing AND quality checks on every plate

Great engineering has BOTH:
    RAG pipeline AND specification verification on every change
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Investing in RAG without investing in verification is like a restaurant that sources the finest ingredients but has no chef tasting the dish before it reaches the customer. Most dishes will be fine. The ones that aren't will be the ones the customer remembers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the context pipeline can't replace
&lt;/h2&gt;

&lt;p&gt;Five categories of problems that better context cannot solve:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Compositional correctness.&lt;/strong&gt; The AI generates Function A correctly (right context retrieved) and Function B correctly (right context retrieved). The composition of A and B violates a cross-cutting property. No individual document in the RAG pipeline covers the composition — because the property emerges from the interaction, not from any single component.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Architectural constraints.&lt;/strong&gt; "All inter-service calls must use gRPC with deadline propagation." This is an organizational decision, not a fact in a document. Even if indexed, RAG must retrieve it for EVERY generation that involves a service call. One missed retrieval = one violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Negative properties.&lt;/strong&gt; "This function must NEVER log PII." Context tells the AI what TO do. Negative properties tell it what NOT to do. The AI can follow the positive context perfectly and still violate a negative property that wasn't in the retrieved documents — and even when the constraint is retrieved, the model's training bias toward positive examples means it struggles to follow negative instructions reliably (Mishra et al., 2022).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Implicit conventions.&lt;/strong&gt; "Error handling in this codebase uses result types, not exceptions." The convention is embedded in 10,000 existing functions. RAG might retrieve a few examples. The AI might still generate exceptions because its training data favors exceptions. The context helps. It doesn't guarantee.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Mathematical properties.&lt;/strong&gt; "The sum of all line items must equal the invoice total." This is arithmetic. The AI can have perfect context about the invoice schema and still generate code where floating-point arithmetic introduces a rounding error. The property requires verification, not context.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can do this week
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. List five properties that must hold in your system.&lt;/strong&gt; Not facts the AI needs to know — PROPERTIES the output must satisfy. "All endpoints require authentication." "No database query uses string concatenation for parameters." "All monetary amounts use integer cents, not floating-point dollars." These are your verification targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Check: are any of these in your RAG pipeline?&lt;/strong&gt; Probably not. Properties live in people's heads, in architecture decision records, in tribal knowledge. They're not the kind of documents teams typically index for retrieval. This gap is your vulnerability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Add one property as a mechanical check.&lt;/strong&gt; Not in the RAG pipeline — in the CI pipeline. A check that runs on every change, regardless of what context the AI had. The property either holds or it doesn't. The check is the verification layer that context can't provide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Audit your RAG pipeline against your properties.&lt;/strong&gt; Pick 10 recent successful retrievals — cases where RAG provided the right documents and the AI generated correct-looking code. For each one, ask: "If the AI had used this context perfectly but ignored our internal timeout policy (or our PII logging rule, or our authentication requirement), would anything have caught it?" If the answer is no for even one, you have a context-only architecture. The context is doing its job. The verification layer doesn't exist yet.&lt;/p&gt;

&lt;p&gt;Your context pipeline is probably good. Keep investing in it. Better retrieval means fewer hallucinations. But add the verification pipeline alongside it — because the hallucinations that survive good context are the ones that cause the most damage. They're the subtle ones. The plausible ones. The ones that pass every test because nobody wrote a test for that specific property. And they're the ones that a mechanical check catches on every change, every time, regardless of what was or wasn't in the context window.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series: **Fallacy #6 — "AI-Generated Code Is an Asset."&lt;/em&gt;* Why every line of generated code is a liability, why the right unit of progress isn't code volume, and what Unix figured out about composable functions 50 years ago.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Liu, N.F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." &lt;em&gt;Transactions of the ACL&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Longpre, S., et al. (2021). "Entity-Based Knowledge Conflicts in Question Answering." &lt;em&gt;EMNLP 2021&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Mishra, S., et al. (2022). "Cross-Task Generalization via Natural Language Crowdsourcing Instructions." &lt;em&gt;ACL 2022&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Neeman, E., et al. (2022). "DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering." &lt;em&gt;ACL 2023&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Fallacies of GenAI Development #4: Dropping Human Review Removes the Bottleneck</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:18:31 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-4-dropping-human-review-removes-the-bottleneck-1jo</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-4-dropping-human-review-removes-the-bottleneck-1jo</guid>
      <description>&lt;p&gt;&lt;em&gt;This is the fourth in a series of eight posts on the false assumptions teams make when building with generative AI. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;Fallacy #1&lt;/a&gt; covered why faster generation doesn't mean faster engineering. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-2-if-the-output-looks-correct-it-is-correct-5gbf"&gt;Fallacy #2&lt;/a&gt; covered why plausible isn't correct. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-3-you-can-verify-ai-output-with-another-ai-4fpk"&gt;Fallacy #3&lt;/a&gt; covered why AI can't reliably verify AI. This post covers what happens when the review gate is removed entirely.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fallacy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Human code review is the bottleneck. If we drop it, the pipeline moves at AI speed."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's tempting
&lt;/h2&gt;

&lt;p&gt;The math is simple and the frustration is real. AI generates a PR in 3 minutes. Human review takes 3 hours. The human is 60x slower than the machine. If you have five developers each generating AI-assisted PRs, one tech lead reviewing them becomes the constraint on the entire team's output. The pipeline stalls at the review stage.&lt;/p&gt;

&lt;p&gt;The trending solution: drop the review. Let AI write 100% of the code. Trust the tests. Ship faster. Prominent voices in the industry advocate this explicitly — if the review is the bottleneck, remove it. The reviewer's time is better spent on higher-level work.&lt;/p&gt;

&lt;p&gt;The logic is compelling until you ask one question: what replaced the review?&lt;/p&gt;

&lt;p&gt;Research backs the concern. Perry et al.'s 2023 Stanford study ("Do Users Write More Insecure Code with AI Assistants?") proved that developers using AI assistants wrote &lt;em&gt;more&lt;/em&gt; insecure code but were &lt;em&gt;more likely to believe it was secure&lt;/em&gt;. The human "reviewer bottleneck" isn't just a speed problem — it's a cognitive failure caused by a false sense of security. The faster the AI generates, the more confident the developer becomes, and the less they scrutinize.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's wrong
&lt;/h2&gt;

&lt;p&gt;A gate exists for a reason. It catches things. When you remove a gate, the things it was catching start reaching production. The question isn't "is the review slow?" The question is "what was the review catching that nothing else catches?"&lt;/p&gt;

&lt;p&gt;Human code review catches at least three categories of problems that no other part of the pipeline addresses:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Properties nobody wrote tests for.&lt;/strong&gt; Tests verify what someone THOUGHT to verify. Review catches what nobody thought to test — an architectural violation, a security assumption, a performance implication, a business logic error that doesn't have a test case because nobody anticipated it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Compositional correctness.&lt;/strong&gt; Tests verify individual components. Review is often the only place someone looks at how components interact — does this change break the implicit contract between module A and module B? Does this new endpoint introduce a dependency cycle? Does this database migration interact badly with the concurrent migration from the other team?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Design coherence.&lt;/strong&gt; Tests verify behavior. Review verifies intent — is this the right approach? Does this change align with the architecture? Are we building the right thing, or just building a thing that passes tests? This is judgment, not verification. But it's judgment that prevents the codebase from drifting into incoherence over hundreds of changes.&lt;/p&gt;

&lt;p&gt;Drop the review and all three categories reach production unchecked. Not immediately — the tests still catch the easy bugs. But the hard problems — the architectural drift, the composition errors, the untested security assumptions — accumulate invisibly.&lt;/p&gt;

&lt;p&gt;Manny Lehman proved this in 1980. His Second Law of Software Evolution states explicitly: &lt;em&gt;"As an evolving program is continually changed, its complexity, reflecting deteriorating structure, increases unless work is done to maintain or reduce it."&lt;/em&gt; Code review is the primary mechanism for that active work. Removing it doesn't pause the entropy — it confirms that the entropy will accumulate until the system becomes unmanageable. Dropping review makes architectural collapse a mathematical certainty, not a risk.&lt;/p&gt;

&lt;p&gt;Rigby and Bird's 2013 empirical study of thousands of code reviews at Microsoft and in open-source projects ("Convergent Software Peer Review Practices") found that the primary value of review isn't bug-finding. It is knowledge transfer, design improvement, and maintaining team-wide standards. Drop the review and you don't just miss bugs — you kill the Theory Building (Naur, 1985) and the Conceptual Integrity (Brooks, 1975) of the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boom
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Month 1-3: The velocity spike.&lt;/strong&gt; PRs merge without waiting for reviewers. Feature delivery accelerates visibly. The team ships more in three months than the previous six. Leadership celebrates. Metrics look great.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4-6: The silent accumulation.&lt;/strong&gt; Each AI-generated PR that skipped review carried a small number of decisions nobody examined. A variable naming convention drifted. An error handling strategy diverged between services. A retry policy was implemented differently in three modules. None triggers a test failure. Each is a micro-crack in the architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 7-9: The first incident nobody can diagnose.&lt;/strong&gt; A production failure in a code path nobody understands. The developer on call opens the file. It was AI-generated. It was never reviewed. Nobody built a mental model of how it works. The developer reads the code — it looks correct. The bug is in the interaction between this function and another function in a different service, also AI-generated, also never reviewed. Debugging takes three days. Writing the code took three minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 10-12: The architecture change that can't be made.&lt;/strong&gt; The team needs to refactor a core module. The module has been modified by AI agents dozens of times since anyone reviewed it. The tests pass. The code reads well. But nobody knows WHY it's structured the way it is. Nobody knows which behaviors are intentional and which are accidental artifacts of AI generation. The team is afraid to change it. The code that was built in months can't be safely modified in months.&lt;/p&gt;

&lt;p&gt;The deeper damage: ownership loss. When no human reviews the code, no human owns the code. The team discovers they aren't software engineers anymore — they're prompt operators, powerless to fix a system they didn't build and don't understand. The prompts generated the code. The code runs the business. Nobody in between can explain how.&lt;/p&gt;

&lt;p&gt;Sarkar et al.'s 2024 study on developer experience with AI coding assistants ("What is it like to use a generative AI coding assistant?") found exactly this pattern: AI-assisted coding leads to "shallow understanding." Developers focus on the immediate fix and fail to build the deep mental model of how the change affects the rest of the system. If no human reviewed the code, the cognitive debt is at its maximum when the incident occurs.&lt;/p&gt;

&lt;p&gt;This is the cognitive debt trajectory from &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;Fallacy #1&lt;/a&gt;, now at the code level. The review was the only mechanism that built human understanding of AI-generated changes. Without it, the understanding was never formed. The debt compounds silently until it's called in — always during an incident, always at the worst possible time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three models of review
&lt;/h2&gt;

&lt;p&gt;The industry is debating between two models. There are actually three.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model 1 — Human reviews everything:
    AI generates code → Human reads every line → Ship

    ✓ Accurate (human judgment catches subtle issues)
    ✗ Slow (human is the bottleneck at 60x slower than AI)

    This is what teams had. It doesn't scale.

Model 2 — Nobody reviews:
    AI generates code → Tests pass → Ship

    ✓ Fast (pipeline moves at AI speed)
    ✗ Unsafe (properties nobody tested reach production unchecked)

    This is what teams are moving toward. It breaks at Month 7.

    Note: "AI code review" feels like Model 1 but is actually Model 2
    with a false sense of security. As Fallacy #3 showed, the AI
    reviewer has the same failure modes as the AI generator. The
    safety profile is Model 2. The confidence level is Model 1.
    That mismatch is where the damage compounds.

Model 3 — Specification gate reviews:
    AI generates code → Specification gate verifies → Ship
    Human reviews SPECIFICATIONS, not code

    ✓ Fast (gate operates at machine speed)
    ✓ Accurate (properties verified mechanically, exhaustively)
    ✓ Sustainable (human effort scales with specifications, not code volume)

    This is what every safety-critical domain converged on.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Model 2 looks like an upgrade from Model 1. It's actually a downgrade — it removed the safety mechanism without replacing it. The team traded accuracy for speed, calling it optimization.&lt;/p&gt;

&lt;p&gt;Model 3 is the actual resolution. It doesn't compromise. It separates the two requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ACCURACY&lt;/strong&gt; operates on specifications (small, stable, human-authored, reviewed deliberately)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPEED&lt;/strong&gt; operates on code verification (fast, exhaustive, mechanical, every change)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Specification doesn't mean a 100-page requirements document. It means any machine-verifiable artifact of intent that already exists in your codebase: a TypeScript interface, an OpenAPI definition, a Protocol Buffer schema, a SQL migration, a Semgrep rule, a database constraint. Small. Stable. Human-authored. Machine-enforced.&lt;/p&gt;

&lt;p&gt;The human reviews the rules of the game. The machine reviews every move to ensure the rules weren't broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  How every safety-critical domain got here
&lt;/h2&gt;

&lt;p&gt;No safety-critical domain uses Model 1 or Model 2. Every one uses Model 3. They arrived there independently when their version of "AI-speed generation" exceeded human review capacity:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aviation:&lt;/strong&gt; Jet engines got fast enough that pilots couldn't react to every condition. Model 1 (pilot monitors everything) was too slow. Model 2 (remove the pilot) was too dangerous. Model 3: fly-by-wire envelope protection. The pilot reviews the FLIGHT PLAN (specification). The computer enforces the FLIGHT ENVELOPE (mechanical gate). The pilot can't stall the aircraft even if they try — the specification gate overrides the input. This isn't ad hoc — DO-178C (the standard for flight software certification) requires that &lt;em&gt;requirements&lt;/em&gt; (specifications) be reviewed by humans for intent, while &lt;em&gt;code&lt;/em&gt; is verified against those requirements using deterministic tools: structural coverage analysis, data coupling analysis, formal methods. Humans never review every line of flight code. They review the specification of what the code must do, and machines verify every line against it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nuclear operations:&lt;/strong&gt; Reactor dynamics happen faster than operators can track every parameter. Model 1 (operator monitors all parameters) was too slow. Model 2 (remove operator oversight) was too dangerous. Model 3: automated protection systems. The operator reviews the PROCEDURES (specification). The interlocks enforce PARAMETER LIMITS (mechanical gate). The reactor scrams automatically if parameters exceed the envelope — regardless of operator input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Financial trading:&lt;/strong&gt; Algorithmic execution happens in microseconds. Model 1 (human reviews every trade) was too slow. Model 2 (no review) caused flash crashes. Model 3: pre-trade risk checks. The risk manager reviews the RISK LIMITS (specification). The system enforces them on EVERY TRADE (mechanical gate). No order that violates the limits reaches the exchange — regardless of what the algorithm generated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google monorepo:&lt;/strong&gt; 2 billion lines of code. Model 1 (human reviews every change to every dependency) was too slow. Model 2 (merge without review) would break the monorepo. Model 3: automated testing + API contract enforcement. The team reviews INTERFACE CONTRACTS (specification). CI enforces them on EVERY CHANGE (mechanical gate). A large-scale change touching millions of lines merges — because every affected test passes mechanically. As Winters, Manshreck, and Wright document in &lt;em&gt;Software Engineering at Google&lt;/em&gt; (2020), the code review chapter makes this explicit: "mechanical checks" (linters, tests) are automated, but "design and intent" review is the gate that keeps the system coherent. Even the world's largest codebase doesn't drop review — it moves review to the highest level of abstraction.&lt;/p&gt;

&lt;p&gt;The pattern repeats: when generation speed exceeds human review capacity, the review doesn't disappear. It splits into two activities at two speeds. The human does the slow, high-judgment work (reviewing specifications). The machine does the fast, exhaustive work (verifying code against specifications).&lt;/p&gt;

&lt;h2&gt;
  
  
  The TRIZ contradiction that forces Model 3
&lt;/h2&gt;

&lt;p&gt;This isn't a preference. It's a resolution of a physical contradiction.&lt;/p&gt;

&lt;p&gt;The review must be ACCURATE — it requires human domain expertise to judge whether the code is correct, whether the architecture is sound, whether the security properties hold.&lt;/p&gt;

&lt;p&gt;The review must be FAST — human review speed is the bottleneck that prevents the team from capturing AI's productivity gains.&lt;/p&gt;

&lt;p&gt;Same system. Opposite requirements. TRIZ's separation principle: if a system must be simultaneously A and not-A, separate the contradiction across different artifacts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ACCURATE (human speed):
    Human authors specification    → small, slow, requires expertise
    Human reviews specification    → infrequent, high-judgment

FAST (machine speed):
    Machine verifies code          → instant, exhaustive, every change
    Machine blocks violations      → deterministic, no judgment needed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both requirements satisfied. Neither compromised. The accuracy lives in the specification review. The speed lives in the mechanical verification. Different artifacts, different speeds, no contradiction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can do this week
&lt;/h2&gt;

&lt;p&gt;If you've already dropped human review (Model 2):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Identify what the review was catching.&lt;/strong&gt; Look at your last 20 review comments before you dropped the process. Categorize: how many were about properties (should always be true), how many about composition (interaction between components), how many about design (is this the right approach)? The specification gate replaces the property-category comments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Convert one review comment into a specification.&lt;/strong&gt; "This endpoint must always require authentication" — that's a specification. Add it as a CI check. Mechanical. Deterministic. Every PR. You just restored one piece of what the review was providing, at machine speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Keep a "review debt" log.&lt;/strong&gt; Every time an incident occurs in code that was never reviewed, log it. Track the category. After a quarter, the log tells you exactly which specifications you need. The incidents write the specification backlog for you.&lt;/p&gt;

&lt;p&gt;If you still have human review (Model 1):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Identify the bottleneck reviews.&lt;/strong&gt; Which reviews take the longest? Which ones block the most PRs? These are the candidates for Model 3 — convert the reviewer's judgment into specifications and enforce them mechanically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Convert one slow review into a fast gate.&lt;/strong&gt; The reviewer who checks "does this PR conform to our API contract" — replace that with a contract test. The reviewer who checks "does this migration have a rollback" — replace that with a CI check for rollback scripts. Each conversion speeds up the pipeline without removing the safety mechanism.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Measure reviewer time on specifications vs. code.&lt;/strong&gt; If the reviewer spends 80% of their time checking properties (mechanical work) and 20% on design judgment (human work), there's a 4x speedup available by moving the mechanical work to CI. The reviewer focuses on the 20% that requires human judgment. The machine handles the 80% that doesn't.&lt;/p&gt;

&lt;p&gt;The bottleneck is real. The review is slow. But removing the review is removing the brakes. The resolution is to move the review to the right level — specifications for humans, code verification for machines — and let each operate at its natural speed.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series: **Fallacy #5 — "Better Context Prevents Hallucination."&lt;/em&gt;* Why improving input quality doesn't guarantee output correctness, and why verification must check the output — not just improve the input.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Lehman, M.M. (1980). "Programs, Life Cycles, and Laws of Software Evolution." &lt;em&gt;Proceedings of the IEEE&lt;/em&gt;, 68(9), 1060–1076. Second Law: complexity increases unless active work is done to reduce it.&lt;/li&gt;
&lt;li&gt;Rigby, P.C. &amp;amp; Bird, C. (2013). "Convergent Contemporary Software Peer Review Practices." &lt;em&gt;ESEC/FSE 2013&lt;/em&gt;. Empirical study: review's primary value is knowledge transfer and design improvement, not bug-finding.&lt;/li&gt;
&lt;li&gt;Perry, N. et al. (2023). "Do Users Write More Insecure Code with AI Assistants?" &lt;em&gt;IEEE S&amp;amp;P 2023&lt;/em&gt;. Stanford study: AI-assisted developers write more insecure code while believing it is more secure.&lt;/li&gt;
&lt;li&gt;Sarkar, A. et al. (2024). "What is it like to use a generative AI coding assistant? An interpretative phenomenological analysis." Microsoft Research. AI coding leads to shallow understanding and missed system-level effects.&lt;/li&gt;
&lt;li&gt;DO-178C (2011). "Software Considerations in Airborne Systems and Equipment Certification." RTCA/EUROCAE. Aviation software standard: human-reviewed requirements, machine-verified code.&lt;/li&gt;
&lt;li&gt;Winters, T., Manshreck, T. &amp;amp; Wright, H. (2020). &lt;em&gt;Software Engineering at Google.&lt;/em&gt; O'Reilly. Chapter on code review: mechanical checks automated, design and intent review is the coherence gate.&lt;/li&gt;
&lt;li&gt;Naur, P. (1985). "Programming as Theory Building." &lt;em&gt;Microprocessing and Microprogramming&lt;/em&gt;, 15(5), 253–261.&lt;/li&gt;
&lt;li&gt;Brooks, F.P. (1975). &lt;em&gt;The Mythical Man-Month.&lt;/em&gt; Addison-Wesley. Conceptual integrity as the most important consideration in system design.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Fallacies of GenAI Development #3: You Can Verify AI Output With Another AI</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Mon, 01 Jun 2026 11:17:30 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-3-you-can-verify-ai-output-with-another-ai-4fpk</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-3-you-can-verify-ai-output-with-another-ai-4fpk</guid>
      <description>&lt;p&gt;&lt;em&gt;This is the third in a series of eight posts on the false assumptions teams make when building with generative AI. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;Fallacy #1&lt;/a&gt; covered why faster generation doesn't mean faster engineering. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-2-if-the-output-looks-correct-it-is-correct-5gbf"&gt;Fallacy #2&lt;/a&gt; covered why plausible isn't correct. This post covers why using one AI to check another doesn't solve the problem — it doubles it.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fallacy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"If the AI makes mistakes, use another AI to check its work."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://openreview.net/forum?id=IkmD3fKBPQ" rel="noopener noreferrer"&gt;Huang et al&lt;/a&gt;. (ICLR 2024) showed that LLMs cannot reliably self-correct their reasoning without external feedback, and in some cases self-correction makes the output worse. LLM-as-judge is a special case of this: the same class of system evaluating its own output using the same reasoning that produced the errors. Formal verifiers, schema validators, and dissimilar reasoning engines provide the external feedback the paper says is required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's tempting
&lt;/h2&gt;

&lt;p&gt;The logic feels airtight. You wouldn't trust a single developer to ship code without review. So you add a reviewer. In AI systems, the reviewer is another AI — a guardrail, a grader, an LLM-as-judge. The first AI generates. The second AI checks. Two opinions are better than one.&lt;/p&gt;

&lt;p&gt;The industry has formalized this into named patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails on input:&lt;/strong&gt; An LLM checks whether the user's prompt is safe before the main LLM processes it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails on output:&lt;/strong&gt; An LLM checks whether the main LLM's response is appropriate before the user sees it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-judge:&lt;/strong&gt; A grader LLM scores the quality of the main LLM's output against a rubric.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI code review:&lt;/strong&gt; An LLM reviews AI-generated code for bugs, security issues, and style.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each pattern adds a layer of verification. Each layer is another LLM call. The architecture looks like defense in depth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's wrong
&lt;/h2&gt;

&lt;p&gt;The verifier has the same failure modes as the thing it's verifying.&lt;/p&gt;

&lt;p&gt;An LLM checking another LLM's output for hallucination can itself hallucinate. An LLM checking for prompt injection can itself be prompt-injected. An LLM reviewing code for security vulnerabilities can miss the same subtle patterns the generating LLM introduced — because both are doing the same thing: pattern matching on text.&lt;/p&gt;

&lt;p&gt;This is not defense in depth. Defense in depth requires each layer to have DIFFERENT failure modes. A firewall and an intrusion detection system provide defense in depth because they fail differently. The firewall fails on novel protocols, the IDS fails on encrypted payloads. Neither failure mode overlaps with the other.&lt;/p&gt;

&lt;p&gt;Two LLMs have OVERLAPPING failure modes. Both hallucinate. Both are gullible to adversarial inputs. Both miss mathematical properties that are invisible in the text. Both confuse plausibility with correctness. Adding a second LLM doesn't eliminate the failure mode. It adds another instance of it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The math of stacked probabilities
&lt;/h3&gt;

&lt;p&gt;If your generating LLM is 95% reliable and your guardrail LLM is 95% reliable, the combined reliability is NOT 99.75% (as it would be with independent failure modes). It's somewhere closer to 95% — because the failure modes are correlated. The cases where the generator fails are disproportionately the cases where the guardrail also fails, because both struggle with the same category of hard inputs.&lt;/p&gt;

&lt;p&gt;The inputs that fool one LLM are often the inputs that fool the other. A policy that's mathematically equivalent to &lt;code&gt;Principal: *&lt;/code&gt; through complex condition logic fools the generator (it doesn't understand the math) and fools the guardrail (it also doesn't understand the math). A prompt injection disguised as a legitimate system instruction bypasses the input guardrail for the same reason it would bypass the main model — both process it as plausible text.&lt;/p&gt;

&lt;h3&gt;
  
  
  The prompt injection example
&lt;/h3&gt;

&lt;p&gt;This makes the problem concrete. The Fowler/Subramaniam GenAI patterns article recommends guardrails on input to prevent prompt injection. The architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User prompt → Guardrail LLM ("is this safe?") → Main LLM → Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The guardrail LLM reads the user's prompt and decides whether it contains an injection attempt. But the guardrail is ALSO an LLM processing text. A sufficiently clever injection that reads as legitimate to one LLM will read as legitimate to the other — because both are doing the same kind of text pattern matching.&lt;/p&gt;

&lt;p&gt;Now consider the alternative architecture. Instead of a free-text prompt surface, expose typed RPC methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Agent&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3 buckets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Agent&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;property&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_public_access&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:s3:::data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Agent&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;compliance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;framework&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hipaa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod-account&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's no prompt to inject into. The agent calls typed methods that accept typed parameters and return structured data. The guard isn't another LLM, it's the type system. The attack surface doesn't exist because the architecture doesn't have a free-text channel.&lt;/p&gt;

&lt;p&gt;Prompt injection isn't mitigated. It's structurally unreachable. The architecture doesn't have the surface for the attack to exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern across nine Fowler mitigations
&lt;/h2&gt;

&lt;p&gt;This isn't just about prompt injection. Every reactive GenAI pattern shares the same structural limitation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Reactive pattern&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Why it doesn't converge on reliability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Provides context to reduce hallucination&lt;/td&gt;
&lt;td&gt;The LLM can still ignore the context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input guardrails&lt;/td&gt;
&lt;td&gt;LLM checks if the input is safe&lt;/td&gt;
&lt;td&gt;The guardrail LLM has the same gullibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output guardrails&lt;/td&gt;
&lt;td&gt;LLM checks if the output is appropriate&lt;/td&gt;
&lt;td&gt;The guardrail LLM has the same blind spots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-as-judge&lt;/td&gt;
&lt;td&gt;LLM scores output quality&lt;/td&gt;
&lt;td&gt;The judge LLM has the same hallucination risk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query rewriting&lt;/td&gt;
&lt;td&gt;LLM improves the user's query&lt;/td&gt;
&lt;td&gt;The rewriting LLM can make the query worse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reranking&lt;/td&gt;
&lt;td&gt;LLM rescores retrieved documents&lt;/td&gt;
&lt;td&gt;The reranker LLM has the same relevance errors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-tuning&lt;/td&gt;
&lt;td&gt;Retrain the model on domain data&lt;/td&gt;
&lt;td&gt;The fine-tuned model still hallucinates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evals&lt;/td&gt;
&lt;td&gt;LLM scores output against rubrics&lt;/td&gt;
&lt;td&gt;The eval LLM has probabilistic accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI code review&lt;/td&gt;
&lt;td&gt;LLM reviews code for bugs&lt;/td&gt;
&lt;td&gt;The reviewer LLM misses the same edge cases&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each pattern wraps a non-deterministic system with another non-deterministic layer. Each layer adds cost and latency. None makes the underlying unreliability structurally unreachable. They reduce the probability of failure. They don't eliminate the failure mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boom
&lt;/h2&gt;

&lt;p&gt;Teams that rely on AI-to-verify-AI hit a specific wall: the recursive hallucination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 1:&lt;/strong&gt; The team deploys an LLM guardrail to check AI-generated IAM policies before deployment. The guardrail catches obvious issues — wildcard principals, missing conditions. Leadership is satisfied. We have AI reviewing AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 3:&lt;/strong&gt; A subtle policy change passes both the generator and the guardrail. The policy's condition blocks, evaluated together, are mathematically equivalent to &lt;code&gt;Principal: *&lt;/code&gt;. Neither LLM understands the math. Both pattern-matched the text. The text looks restrictive. The math isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4:&lt;/strong&gt; The incident. A public access path exists through the policy composition. Data is exposed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4, the post-mortem:&lt;/strong&gt; "Why did the guardrail miss this?" Because the guardrail reads policy TEXT the same way the generator writes policy TEXT. It checked whether the text LOOKED like it restricted access. It didn't check whether the LOGIC restricted access. You didn't have two independent checks. You had two mirrors reflecting the same error back at each other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 5, the response:&lt;/strong&gt; The team adds a THIRD LLM to review the guardrail's decisions. The recursive hallucination deepens. The cost triples. The latency triples. The correlated failure mode remains. The next incident will be a different policy composition that all three LLMs approve because all three pattern-match text, not logic.&lt;/p&gt;

&lt;p&gt;The recursive hallucination is the AI verification equivalent of an infinite loop. Each iteration adds cost without adding reliability, because each layer fails on the same category of inputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What deterministic verification looks like
&lt;/h2&gt;

&lt;p&gt;The alternative isn't better AI verification. It's verification that doesn't use AI at all for the properties that can be checked mechanically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Probabilistic (LLM-as-judge):
    "Does this code violate the security policy?"
    → Depends on the grader's interpretation
    → Run it twice, might get different answers
    → Misses mathematical equivalences invisible in text

Deterministic (mechanical check):
    "Does this configuration expose a public endpoint without authentication?"
    → Yes or no. Every time. Same input, same output.
    → Computed over the actual state, not over the text description
    → Catches mathematical equivalences because it evaluates logic, not text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deterministic verification has specific tools at each level of sophistication:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type checking.&lt;/strong&gt; The compiler verifies type contracts. If the AI generates code that violates a type signature, the compiler catches it — instantly, deterministically, with zero LLM calls. This is verification at the speed of code generation with zero probability of false negatives for the properties types can express.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contract testing.&lt;/strong&gt; Tools like Pact and Dredd verify that an API implementation matches its OpenAPI specification. Every endpoint, every field, every response code — checked mechanically against the declared contract. The specification is the source of truth. The implementation either matches or it doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Property-based testing.&lt;/strong&gt; Tools like QuickCheck and Hypothesis verify that a property holds across thousands of randomly generated inputs. Not "does this specific input produce this specific output?" but "does this property hold for ALL inputs the tool can generate?" One level of abstraction higher than example-based testing. One level closer to proof.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Static analysis.&lt;/strong&gt; Tools like Semgrep, SonarQube, and Checkov check structural properties of the code without running it. "Does any code path reach a database query with unsanitized user input?" "Does this Terraform plan create a public S3 bucket?" Checked across the entire codebase or infrastructure definition, mechanically, on every commit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formal verification.&lt;/strong&gt; Tools like Z3, Dafny, and Lean prove properties mathematically across ALL possible inputs. "Does there exist any request that this policy allows from a public principal?" If the solver says UNSAT, no such request exists — proved, not tested. AWS uses this (Zelkova) to verify IAM policies billions of times per day.&lt;/p&gt;

&lt;p&gt;Each tool is deterministic. Each produces the same answer for the same input every time. Each catches a category of errors that LLM-as-judge cannot — because the errors are mathematical, not textual.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use which
&lt;/h2&gt;

&lt;p&gt;Not every property can be verified deterministically. The decision tree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Can the property be expressed as a type constraint?
    → YES: Use the type system. Cheapest. Fastest. Already deployed.

Can the property be expressed as a contract (API spec, schema)?
    → YES: Use contract testing. Mechanical. Deterministic.

Can the property be expressed as "for all inputs, X holds"?
    → YES: Use property-based testing. Thousands of random inputs.
           Or formal verification for mathematical proof.

Can the property be expressed as a structural pattern in code?
    → YES: Use static analysis. No runtime needed. Every commit.

Is the property about tone, style, user experience, or subjective quality?
    → YES: Use LLM-as-judge. This is the ONLY category where
           probabilistic verification is appropriate — because the
           property itself is subjective.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLM-as-judge is appropriate for subjective properties. It's inappropriate for properties that have deterministic answers. Using an LLM to check whether code violates a type contract is like using a language model to check whether 2 + 2 = 4. You could. But the calculator is faster, cheaper, and never wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle: match the verifier to the property
&lt;/h2&gt;

&lt;p&gt;The resolution isn't never use AI for verification. It's use the right verification method for each property.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Subjective properties    → LLM-as-judge (probabilistic, appropriate)
    "Is this response helpful?"
    "Is this code well-styled?"
    "Is this documentation clear?"

Structural properties    → Static analysis (deterministic)
    "Does this code path sanitize user input?"
    "Does this function handle the error case?"

Contract properties      → Contract testing (deterministic)
    "Does this API match its specification?"
    "Does this response include all required fields?"

Universal properties     → Property-based testing / formal verification (deterministic)
    "Can any unauthorized principal access this resource?"
    "Does the balance ever go negative for valid transactions?"
    "Does there exist any input that causes this function to hang?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most properties in a security-critical, financially-critical, or data-integrity-critical system are NOT subjective. They have deterministic answers. Verifying them with an LLM is using the wrong tool — because a better tool exists for that specific category.&lt;/p&gt;

&lt;p&gt;The teams that emerge from the trough of disillusionment fastest will be the ones that stop using AI to verify everything and start matching each property to the verification method that works for it. Subjective? LLM. Everything else? Mechanical.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can do this week
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Audit your current AI verification stack.&lt;/strong&gt; List every place you use an LLM to check another LLM's output. For each one, ask: is the property being checked subjective or deterministic? If it's deterministic, you're using the wrong tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Replace one probabilistic check with a deterministic one.&lt;/strong&gt; If you have an LLM reviewing code for "does it match the API spec" — replace it with a contract test. If you have an LLM checking "is the output valid JSON" — replace it with a schema validator. One replacement. This week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Measure the difference.&lt;/strong&gt; How many violations did the deterministic check catch that the LLM missed? How much faster does it run? How much cheaper is it? The numbers will make the case for replacing the rest.&lt;/p&gt;

&lt;p&gt;The LLM is a remarkable tool. Use it where it's good at: subjective judgment, creative generation, natural language understanding. Don't use it as a calculator. Don't use it as a type checker. Don't use it as a theorem prover. Those tools already exist. They're faster, cheaper, and they never hallucinate.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series: **Fallacy #4 — "Dropping Human Review Removes the Bottleneck."&lt;/em&gt;* Why removing a gate without replacing it isn't optimization — it's removing the brakes. Three models of review, and why the third one is the only one that works at AI speed.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>engineering</category>
    </item>
    <item>
      <title>The Industry Needs an Open Reasoning Spec. Seven Papers Explain What Goes In It.</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Sun, 31 May 2026 16:44:20 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/the-industry-needs-an-open-reasoning-spec-seven-papers-explain-what-goes-in-it-jn4</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/the-industry-needs-an-open-reasoning-spec-seven-papers-explain-what-goes-in-it-jn4</guid>
      <description>&lt;p&gt;The API ecosystem had a coordination problem. Every API was described differently — prose documentation, custom schemas, tribal knowledge. Then a standard format emerged. One format. Every tool reads it. Code generators, documentation engines, test frameworks, mock servers, SDK builders — all consume the same specification. The ecosystem unified around one artifact.&lt;/p&gt;

&lt;p&gt;The AI era has the same coordination problem — but for a different artifact. Not "how does this API behave?" but "what properties must this system preserve?" "What behavioral contracts must hold?" "Why was this boundary placed here?" "What invariants must every change respect?"&lt;/p&gt;

&lt;p&gt;Seven foundational CS papers — Parnas, Naur, Brooks, Knuth, Dijkstra, Liskov, Lehman — converge on the same conclusion: the value of software isn't in the code. It's in the properties, contracts, boundaries, and rationale that make the code safe to modify. AI generates the code. Nothing standardizes the properties. The code is generated at scale. The properties are scattered across READMEs, ADRs, Slack threads, and people's heads.&lt;/p&gt;

&lt;p&gt;The industry needs an Open Reasoning Spec — a machine-readable standard for properties, contracts, and rationale that AI agents consume, enforcement engines verify, and humans read.&lt;/p&gt;

&lt;h2&gt;
  
  
  What exists today (and why it's insufficient)
&lt;/h2&gt;

&lt;p&gt;Several standards describe parts of the problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Standard        What it describes              What's missing
────────        ─────────────────              ──────────────
OpenAPI         API shapes (endpoints, types)  Behavioral contracts (idempotency,
                                               ordering, invariants)

JSON Schema     Data shapes (fields, types)    Semantic constraints (this field
                                               means admin access, not just boolean)

OSCAL           Compliance controls            Mechanical predicates (how to CHECK
                                               the control, not just describe it)

STIX            Threat intelligence            Property definitions (what must be
                                               TRUE, not just what threats exist)

OPA/Rego        Policy rules                   Rationale (WHY this policy exists,
                                               what decision it implements)

CEL             Predicate expressions          Context (what the predicate means
                                               in domain terms, who validated it)

ADRs            Decision rationale             Enforcement link (the ADR says WHY
                                               but nothing connects it to a CHECK)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each standard covers one layer. None connects them. The API shape (OpenAPI) doesn't link to the behavioral contract (Liskov). The compliance control (OSCAL) doesn't link to the mechanical predicate (CEL). The decision rationale (ADR) doesn't link to the enforcement gate (CI check). The standards exist in silos — exactly like API descriptions before OpenAPI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the seven papers require
&lt;/h2&gt;

&lt;p&gt;Each foundational paper identifies a specific element that must be in the specification:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Paper&lt;/th&gt;
&lt;th&gt;What must be specified&lt;/th&gt;
&lt;th&gt;Current gap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Parnas (1972)&lt;/td&gt;
&lt;td&gt;Module boundaries — what's inside, what's outside, what the interface promises&lt;/td&gt;
&lt;td&gt;Boundaries exist in code (packages, modules) but aren't declared in a consumable spec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Naur (1985)&lt;/td&gt;
&lt;td&gt;Domain theory — how the software maps to the real world&lt;/td&gt;
&lt;td&gt;Lives in people's heads. No format captures it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brooks (1986)&lt;/td&gt;
&lt;td&gt;Essential complexity — which parts are irreducibly hard&lt;/td&gt;
&lt;td&gt;Not distinguished from accidental complexity in any specification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knuth (1984)&lt;/td&gt;
&lt;td&gt;Narrative — WHY each decision was made&lt;/td&gt;
&lt;td&gt;ADRs exist but aren't machine-readable or linked to enforcement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dijkstra (1972)&lt;/td&gt;
&lt;td&gt;Verifiable properties — what must be TRUE, stated simply enough to prove&lt;/td&gt;
&lt;td&gt;Properties scattered across tests, types, linter rules, and comments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Liskov (1994)&lt;/td&gt;
&lt;td&gt;Behavioral contracts — promises beyond the type signature&lt;/td&gt;
&lt;td&gt;No standard format. Lives in comments, test names, tribal knowledge.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lehman (1980)&lt;/td&gt;
&lt;td&gt;Structural maintenance rules — what's allowed to grow, what must be consolidated&lt;/td&gt;
&lt;td&gt;No standard. Implicit in code review culture.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No existing standard captures all seven. Most capture zero or one. The reasoning specification must capture all seven in one artifact — because they're interdependent. The boundary (Parnas) is meaningless without the contract it enforces (Liskov). The contract is unverifiable without the property definition (Dijkstra). The property is unmaintainable without the rationale (Knuth). The rationale is lost without the boundary that preserves it (Parnas).&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of an Open Reasoning Spec
&lt;/h2&gt;

&lt;p&gt;An Open Reasoning Spec (working name — the community will name it) is a machine-readable document that describes what must be TRUE about a system, WHY it must be true, and HOW to check it. Three sections, mapping to the three layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  Section 1: Boundaries (Layer 1 — Parnas)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;boundaries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;boundary.auth.module&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authentication&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;module&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;auth&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;logic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;lives&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;here"&lt;/span&gt;
    &lt;span class="na"&gt;interface&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authenticate(credentials) → (session, error)&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ValidateSession(token) → (claims, error)&lt;/span&gt;
      &lt;span class="na"&gt;imports_allowed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;crypto/&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database/sessions&lt;/span&gt;
      &lt;span class="na"&gt;imports_forbidden&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;business/&lt;/span&gt;  &lt;span class="c1"&gt;# auth must not depend on business logic&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;api/&lt;/span&gt;       &lt;span class="c1"&gt;# auth must not depend on API layer&lt;/span&gt;
    &lt;span class="na"&gt;enforcement&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;depguard&lt;/span&gt;
      &lt;span class="na"&gt;config_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.depguard.yml&lt;/span&gt;
      &lt;span class="na"&gt;ci_gate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The boundary declares: what the module exports (the interface), what it's allowed to import (dependencies), and what it's forbidden to import (architectural boundaries). The enforcement section links the boundary to the tool that checks it. An agent reading this spec KNOWS the boundary before generating code. A CI gate reading this spec ENFORCES the boundary on every commit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Section 2: Properties and contracts (Layer 2 — Dijkstra + Liskov)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prop.auth.idempotent&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authenticate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;idempotent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;same&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;credentials&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;produce&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;same&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;session"&lt;/span&gt;
    &lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;boundary.auth.module&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;behavioral_contract&lt;/span&gt;
    &lt;span class="na"&gt;predicate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;for all (c: Credentials):&lt;/span&gt;
        &lt;span class="s"&gt;Authenticate(c) == Authenticate(c)&lt;/span&gt;
    &lt;span class="na"&gt;verification&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;property_test&lt;/span&gt;
      &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gopter&lt;/span&gt;
      &lt;span class="na"&gt;test_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auth/auth_property_test.go&lt;/span&gt;
    &lt;span class="na"&gt;mitre_attack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;T1078.004&lt;/span&gt;  &lt;span class="c1"&gt;# relevant threat technique&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prop.iam.no_admin_escalation&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;principal&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;can&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;escalate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;admin&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;through&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;any&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;chain"&lt;/span&gt;
    &lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;global&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;safety_invariant&lt;/span&gt;
    &lt;span class="na"&gt;predicate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;for all (p: Principal, r: Role):&lt;/span&gt;
        &lt;span class="s"&gt;can_assume(p, r) AND is_admin(r) implies is_admin(p)&lt;/span&gt;
    &lt;span class="na"&gt;verification&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cel_predicate&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties.identity.escalation.*.present&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;!=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
      &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stave&lt;/span&gt;
    &lt;span class="na"&gt;validated_against&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;vendor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;datadog&lt;/span&gt;
        &lt;span class="na"&gt;lab&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iam-002-to-admin&lt;/span&gt;
        &lt;span class="na"&gt;result&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;detected&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;vendor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bishopfox&lt;/span&gt;
        &lt;span class="na"&gt;lab&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;privesc4-CreateAccessKey&lt;/span&gt;
        &lt;span class="na"&gt;result&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;detected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each property declares: what must be true (the predicate), how to check it (the verification method and tool), what threat it addresses (MITRE mapping), and what independent oracle validated it (lab traceability). The predicate is machine-readable — an enforcement engine can evaluate it. The description is human-readable — a developer can understand WHY.&lt;/p&gt;

&lt;h3&gt;
  
  
  Section 3: Rationale (Layer 3 — Knuth + Naur)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rationale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rationale.auth.idempotency&lt;/span&gt;
    &lt;span class="na"&gt;decision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authenticate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;be&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;idempotent"&lt;/span&gt;
    &lt;span class="na"&gt;date&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2024-03-15&lt;/span&gt;
    &lt;span class="na"&gt;author&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;engineering-team&lt;/span&gt;
    &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;Incident INC-2024-03-12: retry middleware called Authenticate&lt;/span&gt;
      &lt;span class="s"&gt;twice for the same request. The non-idempotent implementation&lt;/span&gt;
      &lt;span class="s"&gt;created two sessions. The user received two session cookies.&lt;/span&gt;
      &lt;span class="s"&gt;Subsequent requests alternated between sessions, producing&lt;/span&gt;
      &lt;span class="s"&gt;intermittent authorization failures.&lt;/span&gt;
    &lt;span class="na"&gt;consequences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Sessions are keyed by credential hash, not by call sequence&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Duplicate calls return the existing session, not a new one&lt;/span&gt;
    &lt;span class="na"&gt;properties_enforced&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;prop.auth.idempotent&lt;/span&gt;
    &lt;span class="na"&gt;supersedes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
    &lt;span class="na"&gt;temporal_validity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;active&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rationale records WHY the decision was made (the incident), WHAT consequences it has (implementation constraints), WHICH properties enforce it (linked to Section 2), and WHETHER it's still active (temporal validity). When the decision is superseded, the temporal_validity changes and the linked properties are flagged for review.&lt;/p&gt;

&lt;p&gt;This IS the context graph ThoughtWorks describes — but with enforcement links. The rationale connects to the property. The property connects to the verification. The verification connects to the CI gate. The chain is: WHY → WHAT → HOW → CHECKED.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adapted, not copied
&lt;/h2&gt;

&lt;p&gt;The seven papers identified the properties. The Open Reasoning Spec doesn't implement them as the authors originally described — because the authors wrote for human developers, not for AI-assisted development. Each idea was adapted for the new context:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Paper&lt;/th&gt;
&lt;th&gt;Original idea&lt;/th&gt;
&lt;th&gt;Problem with original in AI era&lt;/th&gt;
&lt;th&gt;Adaptation in the spec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Knuth (1984)&lt;/td&gt;
&lt;td&gt;Rationale as comments interwoven with code&lt;/td&gt;
&lt;td&gt;AI overwrites comments on regeneration. Comments go stale silently. Comments are coupled to code that changes.&lt;/td&gt;
&lt;td&gt;Rationale in a SEPARATE artifact (Section 3) that the AI reads but doesn't modify. Linked to enforcement so staleness is detectable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dijkstra (1972)&lt;/td&gt;
&lt;td&gt;Simplicity for human reasoning about correctness&lt;/td&gt;
&lt;td&gt;AI generates code too complex for humans to reason about. Simplicity can't be enforced by asking.&lt;/td&gt;
&lt;td&gt;Properties as MECHANICAL CHECKS (Section 2) that verify correctness without requiring the human to reason about the implementation. The check replaces the reasoning.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Liskov (1994)&lt;/td&gt;
&lt;td&gt;Behavioral contracts as documentation and convention&lt;/td&gt;
&lt;td&gt;Documentation is ignored. Convention is violated by AI that doesn't know the convention.&lt;/td&gt;
&lt;td&gt;Contracts as EXECUTABLE PREDICATES (Section 2) that CI evaluates on every change. The contract is enforced, not documented.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parnas (1972)&lt;/td&gt;
&lt;td&gt;Module boundaries as design decisions in the developer's head&lt;/td&gt;
&lt;td&gt;AI doesn't hold design decisions. The developer who prompted the AI didn't make the boundary decision.&lt;/td&gt;
&lt;td&gt;Boundaries as DECLARED STRUCTURE (Section 1) with enforcement links. The boundary is in the spec, not in someone's head.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Naur (1985)&lt;/td&gt;
&lt;td&gt;Theory built through the act of writing code&lt;/td&gt;
&lt;td&gt;The act of writing was replaced by prompting. Theory is never built.&lt;/td&gt;
&lt;td&gt;Theory EXTERNALIZED into the spec — boundaries + properties + rationale = the theory in durable form. The spec IS the theory, persisted independently of the people who authored it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lehman (1980)&lt;/td&gt;
&lt;td&gt;Structural maintenance through human discipline&lt;/td&gt;
&lt;td&gt;AI generates faster than humans can maintain. Discipline doesn't scale.&lt;/td&gt;
&lt;td&gt;Maintenance rules as AUTOMATED CHECKS — deletion metrics, dependency constraints, complexity thresholds in Section 2. The maintenance is mechanical, not disciplinary.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brooks (1986)&lt;/td&gt;
&lt;td&gt;Essential complexity recognized through the struggle of implementation&lt;/td&gt;
&lt;td&gt;AI eliminates the struggle. Essential complexity is hidden under perfect-looking output.&lt;/td&gt;
&lt;td&gt;Essential complexity NAMED EXPLICITLY as properties in Section 2. The spec forces the author to state what's hard — which properties are domain-specific, which constraints are irreducible. Naming it makes it visible.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The common adaptation across all seven: what the original paper located in the developer's MIND (theory, understanding, discipline, convention, struggle), the spec locates in a DURABLE, ENFORCEABLE ARTIFACT. The mind is fragile — people leave, forget, get overridden by AI. The artifact persists, is version-controlled, is mechanically enforced, and survives personnel changes.&lt;/p&gt;

&lt;p&gt;The specific innovation for the AI era: enforcement replaces documentation. Every previous attempt to capture properties, contracts, and rationale produced DOCUMENTATION — comments, ADRs, convention guides, design docs. Documentation goes stale because nothing checks whether it's still accurate. The Open Reasoning Spec connects every rationale entry to a mechanical check. When the check fails, the rationale is reviewed. When the rationale is superseded, the check is updated. The connection between WHY and WHAT is maintained mechanically, not by human memory.&lt;/p&gt;

&lt;p&gt;This is the difference between "writing better comments" (Knuth's original) and "declaring enforceable properties with linked rationale" (the adaptation). The idea is the same — capture WHY. The mechanism is different — enforce it instead of commenting it. The enforcement is what makes it survive in a world where AI generates and regenerates code continuously.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this enables
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For AI agents
&lt;/h3&gt;

&lt;p&gt;The agent reads the spec before generating code. It knows: the auth module can't import from business/. The Authenticate function must be idempotent. The boundary is enforced by depguard. The agent generates code WITHIN these constraints — not because it was prompted to, but because the constraints are declared in a machine-readable format the agent consumes as context.&lt;/p&gt;

&lt;p&gt;The agent doesn't need the full codebase in context. It needs the reasoning spec. The spec is smaller than the code (hundreds of lines vs. thousands). It contains the information the seven papers say matters: boundaries, properties, contracts, rationale. The code is the implementation. The spec is the theory.&lt;/p&gt;

&lt;h3&gt;
  
  
  For enforcement engines
&lt;/h3&gt;

&lt;p&gt;The CI gate reads the spec and runs every verification: depguard checks boundary imports, gopter runs property tests, Stave evaluates CEL predicates, Z3 checks satisfiability. Each property has a declared verification method. The gate runs them all. The exit code is pass or fail. No human triage. No alert queue.&lt;/p&gt;

&lt;p&gt;When a property fails, the gate produces: WHICH property failed, WHAT the predicate says, WHY the property exists (linked rationale), and WHAT to do about it (remediation from the property definition). The developer sees: "prop.auth.idempotent FAILED because Authenticate created a new session for duplicate credentials. This property exists because of INC-2024-03-12 (see rationale). Fix: key sessions by credential hash."&lt;/p&gt;

&lt;h3&gt;
  
  
  For humans
&lt;/h3&gt;

&lt;p&gt;The developer reads the spec and understands the system at the boundary level — Parnas's information hiding. They don't need to read the implementation. They read the interfaces, the properties, and the rationale. The spec IS Naur's theory, externalized into a durable artifact. The theory survives personnel changes because it's in the spec, not in someone's head.&lt;/p&gt;

&lt;h3&gt;
  
  
  For interoperability
&lt;/h3&gt;

&lt;p&gt;Different tools consume different sections. The linter reads boundaries. The property tester reads contracts. The compliance engine reads properties with MITRE mappings. The AI agent reads all three. The Open Reasoning Spec is the lingua franca between tools — the same way API description standards unified the API tooling ecosystem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tool                Consumes from spec
────                ──────────────────
depguard            boundaries.imports_forbidden
gopter              properties.predicate (property tests)
Stave               properties.predicate (CEL expressions)
Z3                  properties.predicate (SMT-LIB export)
Soufflé             properties.predicate (Datalog rules)
CI gate             all verification sections
AI coding agent     boundaries + properties + rationale
Human developer     boundaries + rationale (readable sections)
Compliance auditor  properties + validated_against (evidence)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One spec. Ten consumers. Each reads the section it needs. The ecosystem builds around the spec the same way the API ecosystem built around OpenAPI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What must be true about the standard
&lt;/h2&gt;

&lt;p&gt;For an Open Reasoning Spec to succeed, it needs five properties that successful standards demonstrate:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Machine-readable AND human-readable.&lt;/strong&gt; YAML or JSON with clear field names. The developer reads it. The CI gate parses it. Same document.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Incrementally adoptable.&lt;/strong&gt; A team can start with one boundary definition and one property. They don't need to spec the entire system on day one. Each section is independently useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Tool-agnostic.&lt;/strong&gt; The spec describes WHAT to check, not WHICH TOOL checks it. The verification section names the tool but the predicate is tool-independent. A property that says "this function is idempotent" can be checked by gopter, QuickCheck, Hypothesis, or any property testing framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Extensible.&lt;/strong&gt; New property types, new verification methods, new rationale formats can be added without breaking existing specs. The same way mature standards add capabilities without breaking existing consumers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Versionable.&lt;/strong&gt; The spec lives in version control alongside the code. Changes to the spec are reviewed in PRs. The spec evolves with the system. Temporal validity on rationale entries handles superseded decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The precedent
&lt;/h2&gt;

&lt;p&gt;API description standards didn't invent API descriptions. Multiple competing formats existed. The standard that succeeded unified the fragments into one format that the ecosystem adopted. The key wasn't inventing something new — it was standardizing what already existed into one interoperable format.&lt;/p&gt;

&lt;p&gt;The reasoning spec fragments already exist: ADRs describe rationale. Type signatures describe boundaries. Property tests describe contracts. Linter configs describe structural rules. OSCAL describes compliance controls. CEL/Rego/OPA describe policy predicates.&lt;/p&gt;

&lt;p&gt;The standard that unifies them — connecting rationale to property to enforcement to verification in one document — doesn't exist yet. The seven foundational papers describe what it must contain. The three-layer model describes its structure. The ecosystem (AI agents, enforcement engines, compliance tools, human developers) describes its consumers.&lt;/p&gt;

&lt;p&gt;Every standard emerges when the coordination problem becomes acute enough. The reasoning coordination problem is becoming acute now — because AI generates code at scale without consuming the properties, contracts, and rationale that make the code safe to modify. The Open Reasoning Spec will emerge when the cost of NOT having it exceeds the cost of creating it. That point is approaching fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Wild West to engineering discipline
&lt;/h2&gt;

&lt;p&gt;Right now AI-assisted development is in its Wild West phase. Every team picks different tools. Every tool solves a different fragment. No shared vocabulary describes what "correct AI-assisted output" means. Companies evaluate tools by token throughput and generation speed — the equivalent of evaluating a construction company by how fast they pour concrete, not by whether the building stands.&lt;/p&gt;

&lt;p&gt;The waste is measurable. Tokens spent generating code that fails review. Tokens spent regenerating code that fails CI. Tokens spent debugging AI-generated code that nobody understands. Tokens spent on feedback flywheels that improve surface quality without mechanical verification. The industry is spending billions on code generation and near-zero on code verification. The ratio is inverted.&lt;/p&gt;

&lt;p&gt;An Open Reasoning Spec changes this in three ways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Companies know what to evaluate.&lt;/strong&gt; Instead of "which AI coding tool generates code fastest?" the question becomes "which tool consumes the reasoning spec and produces output that satisfies the declared properties?" The spec is the evaluation criteria. A tool that generates code that violates the spec's properties is failing — regardless of how fast it generates. A tool that generates less code but satisfies every property is succeeding. The spec makes "correct" measurable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tooling and ecosystem unify around the standard.&lt;/strong&gt; Today: fragmented solutions that don't interoperate. The linter checks boundaries but doesn't know about behavioral contracts. The property tester checks contracts but doesn't know about rationale. The AI agent generates code but doesn't know about either. With a standard: every tool reads the same spec. The linter reads boundaries from Section 1. The property tester reads contracts from Section 2. The AI agent reads all three sections before generating. The compliance auditor reads the validated_against entries. One artifact coordinates the entire toolchain. The ecosystem builds around it instead of fragmenting without it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Enforcement becomes automated.&lt;/strong&gt; The spec declares properties. The CI gate checks them. The check is mechanical — deterministic, independent of the model, runs at machine speed. Every token spent generating code that violates a declared property is a wasted token. The enforcement gate catches it before the code reaches review, before it reaches staging, before it reaches production. The waste moves from "discovered in production" to "caught at generation time." The cost of a violation drops from "incident + remediation + post-mortem" to "regenerate."&lt;/p&gt;

&lt;p&gt;This is how every engineering discipline matured. Civil engineering had its Wild West — buildings collapsed, bridges failed, each builder used different standards. Then building codes emerged. The codes declared properties: load-bearing capacity, wind resistance, seismic tolerance. The inspection process verified the properties mechanically. Builders who met the code shipped. Builders who didn't, didn't. The code was the standard. The inspection was the enforcement. The ecosystem (architects, engineers, inspectors, material suppliers) unified around the code.&lt;/p&gt;

&lt;p&gt;Software engineering is the last major engineering discipline without a standard for declaring and enforcing the properties that matter. The Open Reasoning Spec is that standard. The seven foundational papers describe what it must contain. The three-layer model describes its structure. Six safety-critical domains proved the resolution works. The only question is when the industry adopts it — not whether.&lt;/p&gt;

&lt;p&gt;The Wild West ends when enforcement becomes automated. Not when AI gets better at generating code. Not when feedback flywheels improve prompt quality. Not when context graphs capture more rationale. Enforcement. Automated. Mechanical. Deterministic. Independent of the model. That is the transition from code generation to engineering.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/sufield/stave" rel="noopener noreferrer"&gt;Stave&lt;/a&gt; implements a domain-specific version of the Open Reasoning Spec for cloud security: the observation contract (JSON Schema Draft 2020-12) defines boundaries, the control YAML defines properties with CEL predicates, the defect/infection/failure model captures rationale, and the SIR export enables multi-engine verification (SMT-LIB for Z3, Datalog for Soufflé, facts for Prolog). Every property is lab-validated against independent expert oracles. The format is specific to cloud security. The structure — boundaries + properties + rationale, machine-readable, incrementally adoptable, tool-agnostic — is the structure the open standard needs. Apache 2.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>standards</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>A 1972 Paper Predicted the AI Coding Crisis. Nobody Listened.</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Sun, 31 May 2026 11:40:30 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/a-1972-paper-predicted-the-ai-coding-crisis-nobody-listened-1nhc</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/a-1972-paper-predicted-the-ai-coding-crisis-nobody-listened-1nhc</guid>
      <description>&lt;p&gt;There's a paper from 1972 that describes — with precision — the crisis AI coding agents are creating right now.&lt;/p&gt;

&lt;p&gt;David Parnas published "On the Criteria To Be Used in Decomposing Systems into Modules" when software ran on mainframes and a large program was 10,000 lines. The paper argued that the conventional way of breaking systems apart (by execution steps: input → process → output) was wrong, and that systems should instead be decomposed by &lt;em&gt;design decisions that are likely to change&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The reason was human cognition. Not compiler efficiency. Not runtime performance. The hard bottleneck of software engineering, Parnas argued, is the developer's ability to understand the system well enough to change it safely. Modular boundaries exist to protect human comprehension. Everything else is secondary.&lt;/p&gt;

&lt;p&gt;Fifty-three years later, AI coding agents generate complex logic 5-7x faster than humans can comprehend it. Developers accept code through what can only be called cognitive surrender — they review the output, the tests pass, and they merge it without building a mental model of how or why it works.&lt;/p&gt;

&lt;p&gt;Parnas diagnosed this exact failure mode half a century before it existed. And his cure still works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Firehose Effect
&lt;/h2&gt;

&lt;p&gt;Here's what happens in practice. You prompt an AI agent to implement a retry mechanism with exponential backoff and circuit breaking. The agent returns 300 lines of well-structured code: a &lt;code&gt;RetryPolicy&lt;/code&gt; struct, a &lt;code&gt;CircuitBreaker&lt;/code&gt; with state management, a &lt;code&gt;BackoffCalculator&lt;/code&gt; with jitter, integration tests, and error classification logic that distinguishes transient from terminal failures.&lt;/p&gt;

&lt;p&gt;The code works. The tests pass. You read it, nod at the structure, and merge it.&lt;/p&gt;

&lt;p&gt;Three weeks later, you need to change the retry behavior for one specific endpoint. You open the file. You can't remember how the circuit breaker interacts with the backoff calculator. You're not sure whether changing the max-retries constant affects the circuit breaker's failure threshold. You don't know if the error classifier's transient category includes timeouts or only HTTP 5xx responses.&lt;/p&gt;

&lt;p&gt;You wrote none of this code. You reviewed it once. You understood it well enough to approve it. You didn't understand it well enough to change it safely.&lt;/p&gt;

&lt;p&gt;That gap — between understanding enough to approve and understanding enough to change — is cognitive debt. It lives in your head, not in the code. The code is fine. Your mental model of the code is incomplete. And it became incomplete the moment the AI bypassed the cognitive struggle that would have built it.&lt;/p&gt;

&lt;p&gt;Parnas's term for the cause: the system violated information hiding. Not because the code was poorly written, but because the module boundaries weren't designed to protect your comprehension. The circuit breaker and the backoff calculator and the error classifier are all exposed to you simultaneously. When you need to change one, you have to reason about all three. The code didn't hide its design decisions. It dumped them on you. That's the firehose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parnas's argument (it's not about code organization)
&lt;/h2&gt;

&lt;p&gt;The 1972 paper is widely cited and widely misunderstood. Most developers think Parnas argued for "clear module boundaries" — a code organization principle. He argued for something deeper: that the &lt;em&gt;criterion&lt;/em&gt; for where to place module boundaries should be based on which design decisions are likely to change.&lt;/p&gt;

&lt;p&gt;The conventional approach decomposed systems by processing steps: an input module, a processing module, an output module. Each step becomes a module. Parnas showed this was wrong — not because it's messy, but because it forces developers to understand design decisions that cross module boundaries. When the input format changes, the processing module has to change too, because it knows about the input format. The design decision (input format) leaked across the boundary. The developer changing the input format now has to understand two modules instead of one.&lt;/p&gt;

&lt;p&gt;Parnas's alternative: hide each volatile design decision inside a single module. The input format lives in one module. The processing algorithm lives in another. The output format lives in a third. When the input format changes, only the input module changes. The developer only needs to understand one module. The interface between modules is stable. The implementation behind the interface is hidden.&lt;/p&gt;

&lt;p&gt;The key word is &lt;em&gt;hidden&lt;/em&gt;. Not "encapsulated." Not "abstracted." Hidden. The developer using the module cannot see the implementation. Cannot depend on it. Cannot be confused by it. The design decision is invisible outside its module. The developer's cognitive load is bounded by the interface, not by the implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI violates information hiding by default
&lt;/h2&gt;

&lt;p&gt;AI coding agents don't think in Parnas modules. They think in tasks.&lt;/p&gt;

&lt;p&gt;"Implement a retry mechanism with circuit breaking" is a task. The AI produces a solution to the task — often a good one. But the solution is organized around the task's execution logic, not around which design decisions should be hidden from the developer.&lt;/p&gt;

&lt;p&gt;The retry policy, the circuit breaker state machine, and the error classification logic all land in the same module (or in tightly coupled modules) because they all serve the same task. The AI doesn't ask "which of these design decisions might change independently?" It asks "what code solves the prompt?"&lt;/p&gt;

&lt;p&gt;The result is exactly the decomposition Parnas warned against: organized by execution steps (receive request → classify error → calculate backoff → check circuit state → retry or abort) rather than by design decisions (error classification policy, backoff strategy, circuit threshold). The three design decisions are tangled together. Changing one requires understanding all three.&lt;/p&gt;

&lt;p&gt;This isn't an AI quality problem. A skilled human developer writing under time pressure produces the same tangle. The difference is that the human developer at least struggled through the implementation and built a partial mental model of the tangled decisions. The AI generates the tangle instantly, and the developer never builds the model at all.&lt;/p&gt;

&lt;p&gt;When the AI generates at this speed across dozens of modules, the effect compounds. Each module has leaking design decisions. Each leaked decision is a piece of implementation the developer must understand to make safe changes. Multiply this across an AI-accelerated codebase and you get what the cognitive debt analysis calls the firehose effect — the volume of exposed design decisions exceeds human cognitive capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cognitive firewalls
&lt;/h2&gt;

&lt;p&gt;Parnas's solution, applied to AI-generated code, creates what amounts to a cognitive firewall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ AI Agent generates complex implementation ]
                │
                ▼
┌──────────────────────────────────────┐
│      PARNAS MODULE INTERFACE         │  ← Developer reviews THIS
│  (what it provides, requires,        │
│   guarantees — the contract)         │
├──────────────────────────────────────┤
│  Hidden AI-Generated Internals       │  ← Hidden from working memory
│  (complex, fast, possibly opaque)    │
└──────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The developer doesn't need to understand the AI's internal algorithmic choices. They need to understand the interface: what the module provides, what it requires, what it guarantees, and what happens when those guarantees are violated.&lt;/p&gt;

&lt;p&gt;For the retry example, the Parnas decomposition hides each design decision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Module 1: ErrorClassifier
  Interface: classify(error) → transient | terminal
  Hidden: the classification rules, the regex patterns,
          the HTTP status code mappings

Module 2: BackoffStrategy  
  Interface: nextDelay(attempt) → duration
  Hidden: the backoff curve, the jitter algorithm,
          the maximum delay cap

Module 3: CircuitBreaker
  Interface: allow() → bool; record(outcome)
  Hidden: the state machine, the failure threshold,
          the recovery window, the half-open probe logic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each interface is 1-2 lines. Each hidden implementation might be 50-100 lines of AI-generated code. The developer understands three interfaces (6 lines). The AI generated 150-300 lines behind them. The developer's cognitive load scales with the interfaces, not with the implementation.&lt;/p&gt;

&lt;p&gt;When the developer needs to change the backoff strategy, they modify Module 2. They don't touch Module 1 or Module 3. They don't need to understand the circuit breaker's state machine to change the backoff curve. The design decisions are hidden. The cognitive firewall holds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The criterion AI agents should follow
&lt;/h2&gt;

&lt;p&gt;Parnas's criterion for decomposition — "what design decisions are likely to change?" — is the instruction that's missing from most AI coding workflows.&lt;/p&gt;

&lt;p&gt;When a developer prompts "implement retry with circuit breaking," the AI should ask (or the developer should specify): "which design decisions in this implementation might change independently?"&lt;/p&gt;

&lt;p&gt;The answer for the retry example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The error classification rules will change when new error types are discovered&lt;/li&gt;
&lt;li&gt;The backoff strategy will change when performance requirements shift
&lt;/li&gt;
&lt;li&gt;The circuit breaker thresholds will change when reliability targets change&lt;/li&gt;
&lt;li&gt;The integration between them (retry loop orchestration) will change rarely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each independently-changeable decision becomes a module. The integration becomes a thin orchestrator that calls the modules through their interfaces. The developer understands the orchestrator (10 lines) and the interfaces (6 lines). The implementations are hidden.&lt;/p&gt;

&lt;p&gt;This is not new. Parnas published it in 1972. What's new is that AI makes the violation of this principle automatic and invisible. The AI generates monolithic implementations because monolithic implementations solve the prompt efficiently. The Parnas decomposition is a constraint the developer must impose on the AI's output — the AI won't impose it on itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unix already proved this
&lt;/h2&gt;

&lt;p&gt;Unix's design is Parnas applied at the operating system level. Every Unix tool hides its implementation behind a universal interface: stdin, stdout, exit codes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;grep&lt;/code&gt; doesn't expose its regex engine. &lt;code&gt;sort&lt;/code&gt; doesn't expose its sorting algorithm. &lt;code&gt;uniq&lt;/code&gt; doesn't expose its deduplication logic. Each tool is a Parnas module with a hidden implementation and a stable interface (text streams in, text streams out).&lt;/p&gt;

&lt;p&gt;The result: a developer who has never read a line of &lt;code&gt;grep&lt;/code&gt;'s source code can compose &lt;code&gt;grep | sort | uniq&lt;/code&gt; and predict the output with certainty. The cognitive load is bounded by the interface (text stream processing) not by the implementation (regex finite automata, merge sort, hash-based deduplication).&lt;/p&gt;

&lt;p&gt;AI-generated code should compose the same way. Each module should have a stable interface that a developer can reason about without reading the implementation. The implementation can be AI-generated, opaque, and complex. The interface must be human-readable, stable and simple.&lt;/p&gt;

&lt;p&gt;When the AI generates a module whose interface is as complex as its implementation, it has failed the Parnas test — regardless of whether the code works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interfaces evolve into contracts
&lt;/h2&gt;

&lt;p&gt;Parnas described interfaces within a single program, in a single language. But the principle doesn't stop at the module boundary. The industry has been walking an interface evolution for decades — and each step applies the same information-hiding insight at a larger scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Within a module — language-level interfaces.&lt;/strong&gt; Go's &lt;code&gt;interface{}&lt;/code&gt;, Java's &lt;code&gt;interface&lt;/code&gt;, TypeScript's type signatures. The developer defines the contract in the same language as the implementation. &lt;code&gt;type ErrorClassifier interface { Classify(err error) ErrorKind }&lt;/code&gt;. The compiler enforces it. The implementation is hidden behind the type system. Every developer already knows this level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Between modules — package boundaries.&lt;/strong&gt; Go's &lt;code&gt;internal/&lt;/code&gt; convention. Java's module system. Package-level exports vs. private internals. The developer sees the exported API. The unexported implementation is hidden by the language's visibility rules. Same Parnas principle, enforced by the language's module system rather than by discipline alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Between systems — API contracts.&lt;/strong&gt; OpenAPI, gRPC/Protocol Buffers, GraphQL schemas. The interface is no longer in the same language as the implementation. The API contract is expressed in a language-independent schema. A Python client and a Go server share the OpenAPI spec — neither needs to understand the other's implementation, or even what language it's written in. This was the first major jump: information hiding across language boundaries. The contract became a standalone artifact, separate from any implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 4: Between organizations — standard formats.&lt;/strong&gt; JSONL, SARIF, SMT-LIB, OCSF. The interface is no longer between two specific systems — it's a standard that any system can implement. SARIF doesn't belong to any security scanner; it's a format any scanner can emit and any dashboard can consume. The contract became universal. The implementation became fully interchangeable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 5: Between human and machine — specifications.&lt;/strong&gt; This is the level the cognitive debt crisis demands. The interface is no longer between two pieces of software. It's between a human's intent and a machine's execution. A CEL predicate, a reasoning spec YAML, a typed invariant — these are contracts between the developer who knows what must be true and the engine (or AI agent) that makes it true.&lt;/p&gt;

&lt;p&gt;The evolution at each level:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Developer reasons about&lt;/th&gt;
&lt;th&gt;Interface construct&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Within a program&lt;/td&gt;
&lt;td&gt;A function — inputs, outputs, behavior&lt;/td&gt;
&lt;td&gt;Language-level signatures&lt;/td&gt;
&lt;td&gt;Go &lt;code&gt;interface{}&lt;/code&gt;, Java &lt;code&gt;interface&lt;/code&gt;, C function prototypes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Between modules&lt;/td&gt;
&lt;td&gt;A module — responsibilities, boundaries&lt;/td&gt;
&lt;td&gt;Package/module exports&lt;/td&gt;
&lt;td&gt;Go &lt;code&gt;internal/&lt;/code&gt;, Java 9 modules, C header files (&lt;code&gt;.h&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Between systems&lt;/td&gt;
&lt;td&gt;A system — capabilities, failure modes&lt;/td&gt;
&lt;td&gt;API contracts, protocols&lt;/td&gt;
&lt;td&gt;Unix pipes (stdin/stdout), RPC, REST/OpenAPI, gRPC/Protobuf&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Between organizations&lt;/td&gt;
&lt;td&gt;An ecosystem — interoperability, data exchange&lt;/td&gt;
&lt;td&gt;Standard formats&lt;/td&gt;
&lt;td&gt;SARIF, JSONL, EDI, SWIFT messages, HL7/FHIR, OCSF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Between intent and execution&lt;/td&gt;
&lt;td&gt;Intent — what must always be true&lt;/td&gt;
&lt;td&gt;Specifications, invariants&lt;/td&gt;
&lt;td&gt;CEL predicates, reasoning specs, TLA+ properties&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each level raises the abstraction. At Level 1, the developer reasons about control flow and data types within a single program. At Level 3, Unix proved that processes don't need to share memory or language — &lt;code&gt;grep | sort | uniq&lt;/code&gt; composes through stdin/stdout, the universal system-level interface. At Level 4, organizations that have never communicated can exchange data because SARIF and HL7 define the shape independently of any implementation. At Level 5, the developer reasons about what must be true, regardless of how any program, module, or system implements it.&lt;/p&gt;

&lt;p&gt;At each level, the same thing happens: the interface enables reasoning without understanding the implementation. At Level 1, you reason about the function without reading its body. At Level 3, you reason about the service without knowing its language. At Level 5, you reason about the system's safety without reading any code at all.&lt;/p&gt;

&lt;p&gt;But something deeper shifts alongside the interface: what the developer &lt;em&gt;understands&lt;/em&gt; changes.&lt;/p&gt;

&lt;p&gt;At Level 1, the developer understands &lt;strong&gt;a function&lt;/strong&gt; — its signature, its preconditions, its return type. The unit of understanding is a single operation.&lt;/p&gt;

&lt;p&gt;At Level 2, the developer understands &lt;strong&gt;a module&lt;/strong&gt; — its exported API, its responsibilities, its boundaries. The unit of understanding is a coherent set of operations.&lt;/p&gt;

&lt;p&gt;At Level 3, the developer understands &lt;strong&gt;a system's capability&lt;/strong&gt; — what the service does, not how it does it. "This service classifies errors" is a capability. The developer integrates with the capability through a contract (OpenAPI spec), not by reading source code. Integration becomes easier precisely because the contract specifies the capability at a level above the implementation.&lt;/p&gt;

&lt;p&gt;At Level 4, the developer understands &lt;strong&gt;a data shape&lt;/strong&gt; — what the standard format carries, what fields mean, what consuming tools expect. SARIF findings have a &lt;code&gt;ruleId&lt;/code&gt;, a &lt;code&gt;level&lt;/code&gt;, a &lt;code&gt;message&lt;/code&gt;. Any tool that emits this shape integrates with any dashboard that reads it. Understanding the shape is understanding the integration.&lt;/p&gt;

&lt;p&gt;At Level 5, the developer understands &lt;strong&gt;an invariant&lt;/strong&gt; — what must always be true about the system. Not how it's enforced. Not what code checks it. What property holds. "No unauthorized principal can read sensitive data through any path in the access graph" is a statement about the system's meaning, not its mechanism.&lt;/p&gt;

&lt;p&gt;Notice what happens to durability as you climb. Understanding a function's implementation is fragile — it changes with every refactor, every AI regeneration, every optimization. Understanding a module's API is more durable — APIs change less often than implementations. Understanding a system's capability is more durable still — capabilities persist across rewrites. Understanding an invariant is the most durable of all — "no unauthorized access to sensitive data" doesn't change when you switch languages, rewrite the backend, or replace the team.&lt;/p&gt;

&lt;p&gt;Cognitive debt compounds fastest at the lowest levels of understanding, because that's where change is fastest and understanding is most fragile. Level 1 understanding (function internals) erodes the moment AI regenerates the function. Level 5 understanding (safety invariants) persists across every regeneration, because the invariant describes intent, not implementation.&lt;/p&gt;

&lt;p&gt;This is why the specification-first model resolves cognitive debt: it moves the developer's understanding to the level where it's most durable. You stop understanding the code (Level 1-2, fragile, erodes with every AI-generated change) and start understanding the invariants (Level 5, durable, persists across implementations). The code becomes disposable. The understanding doesn't.&lt;/p&gt;

&lt;p&gt;The industry already walked Levels 1 through 4. Every developer has written a Go interface (Level 1), consumed an OpenAPI spec (Level 3), and emitted SARIF or JSONL (Level 4). The patterns are familiar. The cognitive skill — reasoning about the contract rather than the implementation — is already trained.&lt;/p&gt;

&lt;p&gt;Level 5 is the same skill applied one level higher. Instead of "I don't need to understand the HTTP handler's implementation, I need to understand the OpenAPI spec," it's "I don't need to understand the AI-generated codebase, I need to understand the safety invariants." The reasoning is the same. The level of abstraction shifts. The cognitive load drops — because the specification is always smaller than the implementation, at every level, by definition. That's Parnas's insight, scaled to its logical conclusion.&lt;/p&gt;

&lt;p&gt;What makes Level 5 urgent now is AI. At Levels 1-4, humans wrote the implementations behind the interfaces. The implementations were slow to produce, so cognitive debt accumulated gradually. At Level 5, AI generates the implementations instantly. The implementations are vast, fast-changing, and opaque. The only thing that keeps the developer's understanding intact is the specification — the Level 5 interface. Without it, the developer has no contract to reason against. They're back to reading AI-generated code line by line, which is Stage 2 feedback — the level Parnas proved was insufficient in 1972.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical debt vs. cognitive debt
&lt;/h2&gt;

&lt;p&gt;The Parnas lens sharpens the distinction between technical debt and cognitive debt:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Technical debt&lt;/th&gt;
&lt;th&gt;Cognitive debt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Where it lives&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In the repository — messy code, poor formatting, tight coupling&lt;/td&gt;
&lt;td&gt;In human heads — eroded understanding of how and why the system works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;How it's detected&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Linters, static analysis, code smells, compiler warnings&lt;/td&gt;
&lt;td&gt;Crises, outages, fear of changing code, "don't touch that module" folklore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parnas diagnosis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bad code structure that fails the modularity test — design decisions leaked across boundaries&lt;/td&gt;
&lt;td&gt;Breakdown of the interface contract — the internals have overwhelmed the developer's ability to reason about the module&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;How it compounds&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Linearly — each shortcut adds friction&lt;/td&gt;
&lt;td&gt;Exponentially — each un-understood module makes adjacent modules harder to understand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;How AI affects it&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI can reduce it — AI-generated code is often well-formatted and structurally clean&lt;/td&gt;
&lt;td&gt;AI accelerates it — well-formatted code that nobody understands is still cognitive debt&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The last row is the critical one. AI-generated code often has &lt;em&gt;less&lt;/em&gt; technical debt than human-written code — better naming, consistent formatting, comprehensive error handling. And it simultaneously creates &lt;em&gt;more&lt;/em&gt; cognitive debt — because the developer didn't build the mental model during writing.&lt;/p&gt;

&lt;p&gt;Tackling technical debt (reformatting, renaming, extracting functions) doesn't reduce cognitive debt. The code looks better. The developer's understanding is unchanged. The cognitive debt measures something the linter can't see: does the developer understand this code well enough to change it safely?&lt;/p&gt;

&lt;p&gt;Parnas's answer: they don't need to understand the code. They need to understand the interface. If the module boundary is drawn correctly (hiding volatile design decisions), the developer's understanding of the interface is sufficient for safe changes. The implementation is someone else's problem — whether someone else is another developer, another team, or an AI agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for AI-assisted development
&lt;/h2&gt;

&lt;p&gt;Three concrete practices follow from applying Parnas to AI-generated code:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specify the module boundaries before prompting the AI.&lt;/strong&gt; Don't prompt "implement retry with circuit breaking." Prompt: "implement three modules: an ErrorClassifier with interface &lt;code&gt;classify(error) → transient | terminal&lt;/code&gt;, a BackoffStrategy with interface &lt;code&gt;nextDelay(attempt) → duration&lt;/code&gt;, and a CircuitBreaker with interface &lt;code&gt;allow() → bool, record(outcome)&lt;/code&gt;. Each module hides its implementation. Then implement a RetryOrchestrator that calls them through their interfaces." The Parnas decomposition is the developer's job. The implementation is the AI's job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review interfaces, not implementations.&lt;/strong&gt; When the AI returns the code, review the interfaces first. Are they stable? Do they hide the volatile design decisions? Can a developer who reads only the interfaces predict the module's behavior? If the interface exposes implementation details (a circuit breaker interface that reveals its internal state enum), send it back. The interface is the cognitive firewall. If the firewall leaks, the module fails regardless of how correct the implementation is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure cognitive debt by interface complexity, not code complexity.&lt;/strong&gt; A module with 500 lines of AI-generated implementation behind a 2-line interface has low cognitive debt. A module with 50 lines of AI-generated code and a 20-line interface has high cognitive debt — the developer must understand 20 interface elements to use it safely. The ratio of interface complexity to implementation complexity is the Parnas metric for cognitive exposure. When AI generates code, track the interface complexity, not the line count.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 53-year head start
&lt;/h2&gt;

&lt;p&gt;Parnas proved in 1972 that human cognitive capacity is the hard constraint of software engineering. Not compilation speed. Not runtime performance. Not deployment frequency. The developer's ability to understand the system well enough to change it safely — that's the bottleneck that everything else depends on.&lt;/p&gt;

&lt;p&gt;For 53 years, this insight was treated as a code organization principle: define modules, use good interfaces, practice encapsulation. It was right but not urgent. Humans wrote code slowly enough that they built mental models as they went. The cognitive bottleneck was real but manageable.&lt;/p&gt;

&lt;p&gt;AI made it urgent. The cost of writing code dropped to near zero. The cost of understanding code stayed constant. The gap between production speed and comprehension speed — the gap Parnas identified as the fundamental constraint — is now growing exponentially.&lt;/p&gt;

&lt;p&gt;The cure is the same as Parnas prescribed: decompose by design decisions, hide implementations behind stable interfaces, bound the developer's cognitive load to the interface rather than the implementation. The AI generates the implementation. The developer owns the interface. The interface is the understanding.&lt;/p&gt;

&lt;p&gt;Parnas gave us a 53-year head start on solving this crisis. We just didn't know we'd need it until the AI made the constraint visible.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article is part of a series on cognitive debt in AI-assisted development. Related: "Six Contradictions Behind Cognitive Debt" (TRIZ analysis), "Six Industries That Already Solved Cognitive Debt" (cross-domain precedents), and "The Next Platform Won't Track Code. It'll Track Intent." (the specification-first platform).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>programming</category>
    </item>
    <item>
      <title>Google Has 1,000 Platform Engineers Making Security Invisible. You Have Zero. Here's How Agents Close the Gap.</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Sat, 30 May 2026 11:57:33 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/google-has-1000-platform-engineers-making-security-invisible-you-have-zero-heres-how-agents-58ed</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/google-has-1000-platform-engineers-making-security-invisible-you-have-zero-heres-how-agents-58ed</guid>
      <description>&lt;p&gt;Google's internal infrastructure makes misconfiguration impossible. A platform SYNTHESIZES safe configurations from one-line declarations. Developers write &lt;code&gt;service: order-processor&lt;/code&gt;. The platform produces IAM roles, network policies, TLS certificates, monitoring, secrets management — all from pre-approved templates.&lt;/p&gt;

&lt;p&gt;It works. Near-zero misconfiguration incidents internally. The approach is PROVEN.&lt;/p&gt;

&lt;p&gt;It cost Google a thousand-plus platform engineers, a decade of investment, and organizational authority most companies don't have. Spotify built Backstage. Netflix built the Paved Road. Shopify built Polaris. Each invested YEARS and TEAMS to reach the same outcome.&lt;/p&gt;

&lt;p&gt;The rest of the industry without platform teams — gets told: be more careful.&lt;/p&gt;

&lt;p&gt;There's an alternative to templates or more platform engineers. AGENTS executing reasoning engines against machine-verifiable contracts. The same security guarantees, without the platform team. Because the reasoning is independent of the cloud provider, it works on every cloud simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three eras of cloud security scaling
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Era&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;How it scales&lt;/th&gt;
&lt;th&gt;Who can afford it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Era 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual audits, reviews, training&lt;/td&gt;
&lt;td&gt;Hire more people (linear)&lt;/td&gt;
&lt;td&gt;Anyone — but it doesn't work at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Era 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human-coded templates + internal platforms&lt;/td&gt;
&lt;td&gt;Better platform teams (sub-linear)&lt;/td&gt;
&lt;td&gt;Google, Netflix, Spotify, Shopify — top 5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Era 3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent-driven reasoning against formal specs&lt;/td&gt;
&lt;td&gt;Adding agents + reasoning engines (logarithmic)&lt;/td&gt;
&lt;td&gt;Anyone with CI/CD&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Era 1 is where 80% of organizations are today. More training and reviews. More "be more careful." Linear scaling: double the developers, double the security bugs, need double the security engineers.&lt;/p&gt;

&lt;p&gt;Era 2 is where the Big Tech giants are. Templates, Golden Paths, Paved Roads. The platform team absorbs the complexity. Sub-linear scaling: more developers don't produce proportionally more bugs because the platform prevents the bugs structurally. It works. But it costs millions per year in platform engineering salaries.&lt;/p&gt;

&lt;p&gt;Era 3 is what happens when you make the reasoning MACHINE-EXECUTABLE. Instead of human-coded templates that generate safe configurations, you have machine-verifiable contracts that PROVE configurations are safe. The agents don't generate code, they evaluate state against invariants using formal reasoning engines. Logarithmic scaling: adding reasoning capacity (more specs, more engines) covers exponentially more configurations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How each Big Tech model maps — and where agents evolve it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Spotify's Golden Paths → Reasoning specs as logic gates
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Spotify's model:&lt;/strong&gt; Software Templates in Backstage. Developer clicks "Create Secure Storage." Backstage synthesizes the Terraform, IAM roles, and monitoring automatically. The right way is the easiest way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limitation:&lt;/strong&gt; Templates are STATIC. Human-authored. Cover pre-approved patterns only. A developer building something custom that doesn't fit a template falls off the Golden Path and is on their own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent evolution:&lt;/strong&gt; Reasoning specs act as LOGIC GATES for any path — not just pre-approved templates. While Spotify's templates verify that the CHOSEN PATH is safe, agents executing reasoning specs verify that ANY CONFIGURATION meets the same invariants. Custom paths get the same safety guarantees as Golden Paths.&lt;/p&gt;

&lt;h3&gt;
  
  
  Netflix's Paved Road → Agents as GPS + safety inspector
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Netflix's model:&lt;/strong&gt; Freedom and Responsibility. Stay on the Paved Road (using ConsoleMe for IAM, standard deployment tools) and everything is automated and secure. Go off-road and you're responsible for your own security.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limitation:&lt;/strong&gt; Off-road is where innovation happens. And off-road is where breaches happen. The model accepts that custom work is inherently riskier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent evolution:&lt;/strong&gt; Agents executing reasoning engines (Z3 for satisfiability, Soufflé for reachability, Prolog for logic programs) provide Paved Road-level assurance EVEN FOR OFF-ROAD configurations. The agent doesn't care whether the configuration came from a template or from a developer's custom Terraform. It evaluates the STATE against invariants. Custom configurations get formally verified — not just template-checked.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shopify's abstracted infrastructure → Machine-readable contracts as the API
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Shopify's model:&lt;/strong&gt; Minimize the surface area of choice. Developers interact with an internal platform that hides Kubernetes and cloud provider complexity. The platform synthesizes configuration based on application needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limitation:&lt;/strong&gt; The platform team must maintain the abstraction layer. Every new cloud feature requires platform-team work to incorporate. The abstraction lags the cloud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent evolution:&lt;/strong&gt; JSON schemas and published contracts ARE the abstraction layer — but machine-readable. The contracts define what safe state looks like for each asset type. Agents can consume these contracts directly, playing the role of the "Platform Team" without the platform team. New cloud features get covered by adding a schema and controls — not by rebuilding the abstraction layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google's formal verification → The same rigor, externalized
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Google's model:&lt;/strong&gt; Internal formal verification of infrastructure invariants. The most rigorous approach — and the most expensive to staff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limitation:&lt;/strong&gt; Requires formal-methods expertise that's rare and expensive. Google can hire it. Most organizations can't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent evolution:&lt;/strong&gt; The formal rigor lives in the REASONING ENGINES (Z3, Soufflé, Clingo, Prolog, PRISM) — not in the operator's head. The operator writes invariants as CEL predicates. The system exports standardized facts (JSONL, SMT-LIB). The reasoning engines consume the facts and produce formally grounded results. The operator gets Google-level formal verification without needing to write TLA+ or hire formal-methods PhDs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The staffing math
&lt;/h2&gt;

&lt;p&gt;The argument becomes undeniable:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Era 2 staffing (Big Tech IDP model):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Platform team:           5-20 engineers (ongoing)
Security architects:     2-5 (ongoing)
Template authors:        3-8 (ongoing)  
Cloud-specific experts:  1-3 per cloud provider
Total:                   15-40 engineers dedicated to the platform
Cost:                    $3M-$10M/year in salaries alone
Time to value:           12-24 months
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Era 3 staffing (agent-driven reasoning):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Security architect:      1 (writes reasoning specs + catalog controls)
Infrastructure:          Existing CI/CD pipeline
Reasoning engines:       Open source (Z3, Soufflé, Clingo, Prolog)
Cloud-specific work:     Steampipe collectors (community-maintained)
Total:                   1 engineer + existing infrastructure
Cost:                    One salary + open-source tooling
Time to value:           Days to weeks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shift from 15-40 engineers to 1 engineer isn't a quality trade-off. It's an ARCHITECTURAL trade-off. The platform-team model puts the intelligence in HUMAN-WRITTEN TEMPLATES. The agent-driven model puts the intelligence in MACHINE-EXECUTABLE REASONING SPECS. The specs are reusable, composable, and formally verifiable. The templates are bespoke, cloud-specific, and manually maintained.&lt;/p&gt;

&lt;p&gt;Staffing Era 3: Era 3 still requires "Policy Governance." While you don't need 40 engineers to build the platform, you still need security stakeholders to define the invariants (e.g., "What counts as a 'Production' asset?").&lt;/p&gt;

&lt;p&gt;The Agent definition: AI Agents (LLMs) and Reasoning Agents (Symbolic Logic) distinction: This article focuses on Symbolic AI (Z3, Prolog), which is much more reliable for security than Generative AI (LLMs). We are talking about reasoning engines over chatbots.&lt;/p&gt;

&lt;h2&gt;
  
  
  The multi-cloud problem — and why it kills most approaches
&lt;/h2&gt;

&lt;p&gt;Here the architecture produces an advantage no cloud vendor can match.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Vendor Lockin Problem
&lt;/h3&gt;

&lt;p&gt;AWS Security Hub, Google Security Command Center, Azure Defender — each offers multi-cloud support. What they  mean: "send all your Azure and GCP logs into OUR database."&lt;/p&gt;

&lt;p&gt;The cloud provider wants to be the manager of managers. They use multi-cloud as a way to pull data INTO their ecosystem. They don't want to make it easy for you to leave. Their multi-cloud story is a lock-in strategy wearing an interoperability costume.&lt;/p&gt;

&lt;h3&gt;
  
  
  The black-box problem with security vendors
&lt;/h3&gt;

&lt;p&gt;Wiz, Prisma Cloud, Orca — each is genuinely multi-cloud. Their agents scan AWS, Azure, and GCP. But the logic they use to determine risk is a proprietary black box.&lt;/p&gt;

&lt;p&gt;If you want to change how a public bucket is defined differently for Azure vs AWS (because your organization has different policies per cloud), you wait for the vendor to update their product. You can't write your own formal proof. You can't inspect the logic. You can't extend it. You rent the verdict.&lt;/p&gt;

&lt;h3&gt;
  
  
  The universal translation layer
&lt;/h3&gt;

&lt;p&gt;Because the architecture uses an intermediate representation — standardized facts in JSONL and SMT-LIB format — the reasoning is INDEPENDENT of the cloud provider.&lt;/p&gt;

&lt;p&gt;A reasoning spec for transitive reachability is written ONCE. Because the input data is normalized into machine-verifiable contracts through vendor-neutral schemas, the same Z3 or Prolog code works for AWS, Azure, and GCP.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional multi-cloud:
    Write "public bucket" rule for AWS    (AWS-specific syntax)
    Write "public bucket" rule for Azure  (Azure-specific syntax)  
    Write "public bucket" rule for GCP    (GCP-specific syntax)
    Maintain 3 versions. Test 3 versions. Debug 3 versions.

Agent-driven reasoning:
    Write reasoning spec ONCE
    Agents apply it to normalized facts from ANY cloud
    One spec. One test. One truth.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The staffing implication in a multi-cloud world:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Traditional multi-cloud&lt;/th&gt;
&lt;th&gt;Agent-driven reasoning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hiring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Need an AWS expert, an Azure expert, and a GCP expert&lt;/td&gt;
&lt;td&gt;Need ONE security architect who understands the contracts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Write rules 3 times in 3 syntaxes&lt;/td&gt;
&lt;td&gt;Write the reasoning spec ONCE; agents apply to all clouds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 different dashboards, 3 different finding formats&lt;/td&gt;
&lt;td&gt;One unified stream of machine-readable facts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New cloud&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hire another expert, write rules again&lt;/td&gt;
&lt;td&gt;Map new cloud to existing schemas; specs work immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Comparison
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS Security Hub    = a FEATURE of AWS      (you're locked in)
Wiz                 = a SERVICE you rent     (you can't inspect or extend the logic)
Stave               = a PROTOCOL you own     (the reasoning is yours, runs anywhere)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A feature lives inside a vendor's ecosystem. A service lives inside a vendor's infrastructure. A protocol lives inside YOUR infrastructure — air-gapped, credential-free, extensible, inspectable.&lt;/p&gt;

&lt;p&gt;No cloud vendor can offer provider-independent reasoning because their primary goal is selling their own compute and storage — not making you cloud-agnostic. No security SaaS can offer inspectable reasoning because their business model depends on the logic being proprietary. The architecture that's logic-first and cloud-second can only be built by someone whose business model DOESN'T depend on cloud lock-in or proprietary logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The democratization argument
&lt;/h2&gt;

&lt;p&gt;Google, Spotify, Netflix, and Shopify discovered that human error is a scaling constant. Double the developers → double the security bugs. Unless you change the RELATIONSHIP between the developer and the infrastructure.&lt;/p&gt;

&lt;p&gt;They changed the relationship through massive platform-team investment. Templates. Golden Paths. Paved Roads. Abstracted infrastructure. It worked. It cost millions per year. It's inaccessible to the 95%.&lt;/p&gt;

&lt;p&gt;The agent-driven reasoning model changes the same relationship through a different mechanism: instead of human-authored templates that generate safe code, machine-executable specs that PROVE code is safe. The mechanism is different. The outcome is the same: developers can't produce unsafe configurations because the system catches them — whether the configuration came from a template or from scratch.&lt;/p&gt;

&lt;p&gt;The secret sauce that let Big Tech scale without security collapse was never the TEMPLATES. It was the INVARIANTS — the knowledge of what safe means, expressed in a form the machine can evaluate. Templates are ONE way to express invariants. Reasoning specs against machine-verifiable contracts are ANOTHER. The second way doesn't require a platform team.&lt;/p&gt;

&lt;p&gt;You can finally have the Google/Netflix security model without hiring 1,000 platform engineers. The invariants are in the catalog. The reasoning is in the engines. The evaluation is in the pipeline. The platform team is replaced by agents executing formal proofs against standardized facts.&lt;/p&gt;

&lt;p&gt;That's not multi-cloud support. That's multi-cloud abstraction. Not a feature of one cloud. Not a service you rent. A protocol you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this relates to existing compliance mods
&lt;/h2&gt;

&lt;p&gt;The Era-3 agent model needs two things existing per-resource framework mods can't structurally provide: machine-verifiable &lt;em&gt;compositional&lt;/em&gt; contracts (agents reason across resources, not within them) and an evaluation surface independent of the cloud-provider's SQL schema (so the agent reuses one reasoning vocabulary across AWS, GCP, Azure, K8s). &lt;a href="https://hub.powerpipe.io/mods/turbot/aws_compliance" rel="noopener noreferrer"&gt;&lt;code&gt;turbot/steampipe-mod-aws-compliance&lt;/code&gt;&lt;/a&gt; ships ~540 controls across 16+ frameworks and is the right tool for "render me a CIS dashboard for the auditor" — its SQL is tied to live AWS APIs by design. Stave's CEL predicates + JSON-Schema-anchored snapshot + nine-engine export are the agent-consumable form: authorship-agnostic, provider-independent, composition-aware. Two surfaces, complementary jobs, both render in Powerpipe — see &lt;a href="https://github.com/sufield/stave/blob/main/docs/comparison/aws-compliance-mod.md" rel="noopener noreferrer"&gt;github.com/sufield/stave/blob/main/docs/comparison/aws-compliance-mod.md&lt;/a&gt; for the side-by-side.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Era 3 cloud security: agent-driven reasoning against machine-verifiable contracts. CEL predicates evaluated against air-gapped snapshots. Standardized facts (JSONL, SMT-LIB) exported for nine external reasoning engines (Z3, Soufflé, Clingo, Prolog, PRISM, and more). Provider-independent. Logic-first. Cloud-second. &lt;a href="https://github.com/sufield/stave" rel="noopener noreferrer"&gt;Stave&lt;/a&gt;, an open-source risk reasoning engine. The Google security model without the Google platform team. Try it: &lt;code&gt;bash examples/demo-ai-security/run.sh&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>security</category>
      <category>architecture</category>
      <category>ai</category>
    </item>
    <item>
      <title>Six Domains Already Solved AI's Cognitive Debt Problem. Software Is the Last to Learn.</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Fri, 29 May 2026 15:17:28 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/six-domains-already-solved-ais-cognitive-debt-problem-software-is-the-last-to-learn-34m3</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/six-domains-already-solved-ais-cognitive-debt-problem-software-is-the-last-to-learn-34m3</guid>
      <description>&lt;p&gt;Nuclear submarine crews operate reactors they can't fully comprehend in real-time. Microprocessor engineers design chips with billions of transistors no single person understands. Air traffic controllers manage thousands of aircraft simultaneously. Medical specialists treat patients whose full condition exceeds any one doctor's knowledge. Legal systems maintain millions of laws no lawyer has read entirely. Google engineers work in a monorepo of 2 billion lines nobody holds in their head.&lt;/p&gt;

&lt;p&gt;Each domain faced the same crisis software development is facing now: the system became too complex for human comprehension, and the consequences of misunderstanding became catastrophic. Each domain solved it. Each arrived at the same three-layer structure independently.&lt;/p&gt;

&lt;p&gt;Software development is the last domain to hit this wall — because AI-assisted code generation is the forcing function that made the complexity exceed human comprehension in months rather than decades. The solution already exists. It's been proven across six domains. The only thing missing is the correct vocabulary to carry it over.&lt;/p&gt;

&lt;p&gt;That vocabulary requires separating three debts that the industry currently conflates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three debts, three locations
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Technical debt:
    Location:  The codebase
    What it is: Code that works but is structured poorly — 
                shortcuts, missing abstractions, duplicated logic
    Who named it: Ward Cunningham (1992)
    Key property: DELIBERATE — the team chose the shortcut
                  knowingly, intending to pay it back later

Cognitive debt:
    Location:  The developer's mind
    What it is: Loss of comprehension — the developer cannot
                reason about the system, cannot predict its behavior,
                cannot modify it safely
    Who named it: Margaret-Anne Storey, building on Peter Naur's
                  "Programming as Theory Building" (1985)
    Key property: The debt lives in PEOPLE, not in code.
                  The code can be well-structured. The developer still
                  doesn't understand it.

Intent debt:
    Location:  The system's enforcement layer
    What it is: The gap between what the system SHOULD preserve
                and what is actually expressed as an executable
                constraint that the system CAN enforce
    Who named it: Margaret-Anne Storey — formalized in "From Technical
                  Debt to Cognitive and Intent Debt: Rethinking Software
                  Health in the Age of AI" (arxiv 2603.22106, March 2026).
                  Storey defines intent debt as the absence of externalized
                  rationale that developers and AI agents need to work 
                  safely with code.
    Key property: The debt lives in the ABSENCE of a mechanism.
                  The intent exists (someone decided something)
                  but the system has no way to enforce it.

    Note: Storey's definition emphasizes externalized KNOWLEDGE
    broadly. This article narrows the focus to externalized
    EXECUTABLE CONSTRAINTS specifically — the subset of intent
    that can be mechanically enforced. The distinction matters:
    an ADR that explains WHY is externalized knowledge (reduces
    intent debt in Storey's framing). A CI check that ENFORCES
    the why is an executable constraint (reduces intent debt in
    this article's framing). Both are necessary. The executable
    constraint is what makes the intent durable at machine speed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are independent. You can have any combination of the three. A system can have zero technical debt (well-structured code), high cognitive debt (nobody understands it), and high intent debt (no constraints are expressed). Or well-structured code, full comprehension, but no mechanical enforcement. Each combination produces different failures and requires different solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the distinction matters
&lt;/h2&gt;

&lt;p&gt;When the terms are conflated, the solutions are wrong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If you think cognitive debt IS intent debt:
    You try to fix comprehension by writing specifications.
    But specifications don't make the developer understand the code.
    They make the SYSTEM enforce constraints. The developer's mental
    model is still missing. The specification gate catches violations —
    but the developer can't debug, modify, or extend the code because
    they never built the theory of how it works.

If you think intent debt IS cognitive debt:
    You try to fix enforcement by improving developer understanding.
    Pair programming, code reviews, documentation. The developer
    understands the system — but the understanding lives in their head.
    When they leave, the intent leaves with them. When AI generates
    code at 3 AM, no human is reviewing. The understanding exists.
    The enforcement doesn't.

If you think either IS technical debt:
    You try to fix both by refactoring code.
    But the code might already be well-structured. Cognitive debt doesn't live
    in the code — it lives in the developer's mind. Intent debt doesn't
    live in the code — it lives in the absence of a constraint.
    Refactoring addresses neither.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each debt has a different solution because each lives in a different place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Parnas root
&lt;/h2&gt;

&lt;p&gt;David Parnas's 1972 paper "On the Criteria To Be Used in Decomposing Systems into Modules" is the common root of both cognitive debt and intent debt — though Parnas didn't use either term.&lt;/p&gt;

&lt;p&gt;His insight: humans can't hold entire systems in their heads. His solution: information hiding. Each module exposes an interface (what it does) and hides its implementation (how it does it).&lt;/p&gt;

&lt;p&gt;The module interface is a single artifact that addresses BOTH debts simultaneously:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;As a comprehension mechanism&lt;/strong&gt; (addresses cognitive debt): the developer only needs to understand what the module promises, not how it works internally. The interface is the boundary of comprehension. The implementation behind the boundary is irrelevant to anyone outside the module. This is how Boeing engineers work on a 787 — each engineer understands their subsystem's interface contracts, not the entire aircraft.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;As an intent mechanism&lt;/strong&gt; (addresses intent debt): the interface IS the declaration of what must be true at this boundary. The function signature, the API contract, the type constraint — each is an expressed intent. The system can enforce it. The compiler checks type signatures. The contract test checks API conformance. The database engine checks schema constraints.&lt;/p&gt;

&lt;p&gt;Parnas didn't need to distinguish cognitive debt from intent debt because in 1972, one act produced both: the developer DESIGNED the module interface, which simultaneously built their mental model (comprehension) and declared the contract (intent). The act of writing code was the mechanism that produced both outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI broke
&lt;/h2&gt;

&lt;p&gt;AI-assisted development decoupled comprehension from intent expression. The Su-Field model from TRIZ makes the decoupling visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before AI: one field produces two useful outputs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Su-Field Model — Manual Development

         F (writing code)
        / \
       /   \
      /     \
    S1       S2
 Developer   Codebase

 Writing code is the FIELD that connects the developer
 to the codebase. This single field produces TWO outputs:

 Output 1: Comprehension (cognitive debt ↓)
   The developer builds a mental model WHILE writing.
   Typing, debugging, rewriting — each forces internalization
   of the logic, the edge cases, the constraints.

 Output 2: Expressed intent (intent debt ↓)
   The code itself captures decisions. The function signature
   IS the interface contract. The module boundary IS the
   architectural intent. The test IS the specification.

 One field. Two outputs. Both debts managed by a single act.
 The system is COMPLETE — no missing interactions.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In TRIZ terms, this is a complete Su-Field system: two substances (developer, codebase) connected by one field (writing code) that produces the desired effects (comprehension + expressed intent). The system works because the field simultaneously acts on both the developer (building understanding) and the codebase (encoding intent).&lt;/p&gt;

&lt;h3&gt;
  
  
  With AI: the field changes, both outputs disappear
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Su-Field Model — AI-Assisted Development

         F' (prompting)
        / \
       /   \
      /     \
    S1       S3
 Developer   AI Agent
                |
                | generates code
                ↓
               S2
            Codebase

 The field changed from WRITING CODE to PROMPTING.
 A new substance (S3: AI agent) was inserted between
 the developer and the codebase.

 Output 1: Comprehension → LOST
   The developer prompts. The AI writes. The developer
   never types the code, never debugs the logic, never
   struggles with the edge cases. The mental model is
   never formed. Cognitive debt accumulates.

 Output 2: Expressed intent → LOST
   The AI generates code from patterns, not from the
   developer's architectural decisions. No explicit
   interface contract was declared. No module boundary
   was designed. The AI's output captures PATTERNS
   from training data, not INTENT from the developer.
   Intent debt accumulates.

 The field (prompting) doesn't produce either output.
 The system is INCOMPLETE — both useful effects are missing.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In TRIZ terms, the field substitution (writing → prompting) eliminated both useful outputs. The system went from complete (one field, two outputs) to incomplete (one field, zero outputs). The inserted substance (AI agent) produces a new output (code, faster) but doesn't produce the two outputs the old field provided (comprehension, expressed intent).&lt;/p&gt;

&lt;h3&gt;
  
  
  The TRIZ resolution: add fields, don't restore the old one
&lt;/h3&gt;

&lt;p&gt;TRIZ says: don't try to restore the original field (that would mean going back to manual coding). Instead, add new fields that independently produce the missing outputs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Su-Field Model — Resolved System

    Read top to bottom. The AI generates code into the codebase.
    Two additional fields act on the same codebase independently.

                S1 Developer
                |
                | F' (prompting)
                ↓
                S3 AI Agent
                |
                | generates code
                ↓
                S2 Codebase
               / \
              /   \
             /     \
            /       \
           ↓         ↓
          F2          F3
    Parnas boundaries  Mechanical enforcement
           |           |
           ↓           ↓
      S1 Developer   S4 Specification gate
      COMPREHENDS    ENFORCES intent
      at interface   on every change
      level          mechanically
    (cognitive       (intent
     debt ↓)          debt ↓)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;F'  produces code (fast, AI-generated)
F2  produces comprehension (developer understands interfaces,
    not the AI-generated implementation behind them)
F3  produces enforced intent (the gate checks constraints
    regardless of who generated the code)

Three fields. One codebase. Two restored outputs.
The system is COMPLETE again.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The key: F' (prompting) still works — you keep the speed gains from AI generation. F2 and F3 are ADDED alongside it, not instead of it. The developer doesn't go back to writing all the code. They understand the system through its interfaces (F2) and the system enforces their intent through mechanical checks (F3). The AI generates freely within the boundaries that F2 and F3 maintain.&lt;/p&gt;

&lt;h3&gt;
  
  
  The same Su-Field structure across all six domains
&lt;/h3&gt;

&lt;p&gt;The reason the solution carries over from nuclear submarines to software is not analogy. It's structural identity. The Su-Field model of the problem is the SAME in every domain. When the problem structure is the same, the solution structure is the same.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Every domain that solved this had the same Su-Field problem:

    One field (F) previously produced two outputs:
        Output 1: Human comprehension of the system
        Output 2: Expressed constraints the system enforces

    The field was replaced or overwhelmed:
        Nuclear:     Reactor dynamics exceed operator comprehension speed
        Chips:       Transistor count exceeds designer comprehension capacity
        Aviation:    Aircraft speed exceeds pilot reaction time
        Medicine:    Patient complexity exceeds single-doctor knowledge
        Law:         Legal corpus exceeds single-lawyer memory
        Google:      Codebase size exceeds single-engineer understanding
        Software/AI: Code generation speed exceeds developer comprehension

    Both outputs were lost:
        Comprehension → operators/designers/pilots/doctors can't hold
                        the full system in their heads anymore
        Enforcement  → the intent isn't mechanically checked because
                        it relied on the human who could no longer keep up

    Every domain added the same two fields:

        F2: Comprehension through bounded interfaces
            Nuclear:   Operator understands PROCEDURES, not reactor physics
            Chips:     Designer understands INTERFACE SPECS, not transistors
            Aviation:  Pilot understands FLIGHT PLAN, not aerodynamics
            Medicine:  Specialist understands THEIR DOMAIN, not all medicine
            Law:       Lawyer understands THEIR JURISDICTION, not all law
            Google:    Engineer understands THEIR MODULE'S API, not 2B lines
            Software:  Developer understands PARNAS BOUNDARIES, not AI code

        F3: Enforcement through mechanical gates
            Nuclear:   Physical interlocks enforce parameter limits
            Chips:     Formal verification tools check every interface
            Aviation:  Fly-by-wire enforces flight envelope
            Medicine:  Lab equipment produces deterministic diagnostics
            Law:       Statutory databases flag conflicts automatically
            Google:    CI gates block contract violations on every commit
            Software:  Specification gate checks constraints on every change
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seven domains. Same problem structure. Same solution structure. The solution didn't transfer by metaphor. It transferred because the Su-Field model is structurally identical: one field replaced → two outputs lost → two fields added to restore each output independently.&lt;/p&gt;

&lt;p&gt;This is why the three-layer model works for software: it's not an adaptation of what nuclear submarines do. It IS what nuclear submarines do — the same architectural resolution of the same structural problem, expressed in software-native mechanisms (Parnas boundaries instead of operating procedures, CI gates instead of physical interlocks, ADRs instead of basis documents).&lt;/p&gt;

&lt;p&gt;The three-layer model IS this resolved Su-Field system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1 (Parnas boundaries)     = F2 — restores comprehension
Layer 2 (Mechanical enforcement) = F3 — restores expressed intent
Layer 3 (Rationale preservation) = connects F2 and F3 — 
                                    the WHY links the boundary
                                    to the constraint so both
                                    persist together
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The act that produced both outputs was replaced. The outputs must now be produced independently. That's why cognitive debt and intent debt require separate solutions — they were coupled through a field that no longer exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  The eight states: every combination of three debts
&lt;/h2&gt;

&lt;p&gt;Each debt is either HIGH or LOW. Three independent debts produce eight possible states. Each state has a different consequence and a different fix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────┬──────────┬───────────┬────────┬────────────────────────────────────────┐
│  # │Technical │ Cognitive │ Intent │ Consequence                            │
│    │  Debt    │   Debt    │  Debt  │                                        │
├────┼──────────┼───────────┼────────┼────────────────────────────────────────┤
│  1 │  LOW     │   LOW     │  LOW   │ BEST CASE. Code is well-structured.    │
│    │          │           │        │ Team understands the system. Intent     │
│    │          │           │        │ is enforced mechanically. Safe to       │
│    │          │           │        │ change, safe to scale, safe to hand    │
│    │          │           │        │ to AI agents.                          │
├────┼──────────┼───────────┼────────┼────────────────────────────────────────┤
│  2 │  LOW     │   LOW     │  HIGH  │ FRAGILE. Code is well-structured.      │
│    │          │           │        │ Team understands it. But intent lives   │
│    │          │           │        │ in people's heads, not in constraints.  │
│    │          │           │        │ One departure or one AI agent that      │
│    │          │           │        │ doesn't know the rules → violations.   │
│    │          │           │        │ FIX: Add enforcement to existing specs. │
├────┼──────────┼───────────┼────────┼────────────────────────────────────────┤
│  3 │  LOW     │   HIGH    │  LOW   │ GOVERNED DESPITE IGNORANCE. Code is    │
│    │          │           │        │ well-structured. Nobody understands it. │
│    │          │           │        │ But constraints catch violations        │
│    │          │           │        │ mechanically. Safe to run, hard to      │
│    │          │           │        │ modify. The AI-era quadrant: the gate   │
│    │          │           │        │ compensates for missing comprehension.  │
│    │          │           │        │ FIX: Invest in Parnas boundaries and    │
│    │          │           │        │ rationale (Layers 1 + 3).              │
├────┼──────────┼───────────┼────────┼────────────────────────────────────────┤
│  4 │  LOW     │   HIGH    │  HIGH  │ TIME BOMB. Code is well-structured     │
│    │          │           │        │ but nobody understands it and nothing   │
│    │          │           │        │ enforces the intent. Working by         │
│    │          │           │        │ accident. Any change can break it.      │
│    │          │           │        │ Incidents are undiagnosable.            │
│    │          │           │        │ FIX: Add enforcement first (Layer 2),   │
│    │          │           │        │ then boundaries (Layer 1).             │
├────┼──────────┼───────────┼────────┼────────────────────────────────────────┤
│  5 │  HIGH    │   LOW     │  LOW   │ MANAGEABLE MESS. Code has shortcuts    │
│    │          │           │        │ and duplication. But the team           │
│    │          │           │        │ understands it and constraints enforce  │
│    │          │           │        │ safety. The team can refactor safely    │
│    │          │           │        │ because they know what to change and    │
│    │          │           │        │ the gate verifies they didn't break     │
│    │          │           │        │ anything.                              │
│    │          │           │        │ FIX: Refactor the code (standard).     │
├────┼──────────┼───────────┼────────┼────────────────────────────────────────┤
│  6 │  HIGH    │   LOW     │  HIGH  │ LEGACY SYSTEM. Code has shortcuts.     │
│    │          │           │        │ Team understands it (the one person    │
│    │          │           │        │ who's been here 10 years). No          │
│    │          │           │        │ mechanical enforcement. Everything      │
│    │          │           │        │ depends on that person. Classic legacy  │
│    │          │           │        │ pattern.                               │
│    │          │           │        │ FIX: Capture the person's knowledge as │
│    │          │           │        │ constraints before they leave.         │
├────┼──────────┼───────────┼────────┼────────────────────────────────────────┤
│  7 │  HIGH    │   HIGH    │  LOW   │ SAFE BUT FROZEN. Code has shortcuts.   │
│    │          │           │        │ Nobody understands it. But constraints  │
│    │          │           │        │ enforce safety. The system runs but     │
│    │          │           │        │ can't evolve — any refactoring attempt  │
│    │          │           │        │ is blocked by incomprehension.          │
│    │          │           │        │ FIX: Invest in comprehension (Parnas    │
│    │          │           │        │ boundaries + rationale) to enable       │
│    │          │           │        │ safe refactoring.                      │
├────┼──────────┼───────────┼────────┼────────────────────────────────────────┤
│  8 │  HIGH    │   HIGH    │  HIGH  │ WORST CASE. Code has shortcuts.        │
│    │          │           │        │ Nobody understands it. Nothing          │
│    │          │           │        │ enforces the intent. The system is      │
│    │          │           │        │ unmaintainable, unsafe, and cannot be   │
│    │          │           │        │ modified without risk of outage.        │
│    │          │           │        │ This is Month 12 of the Fallacies      │
│    │          │           │        │ timeline without intervention.          │
│    │          │           │        │ FIX: Triage. Add enforcement for the    │
│    │          │           │        │ highest-risk properties first.         │
└────┴──────────┴───────────┴────────┴────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most teams building with AI are in State 4 (TIME BOMB) or moving toward State 8 (WORST CASE). The code is well-structured because the AI generates conventional-looking code. But nobody understands it (cognitive debt HIGH — the developer never wrote it) and nothing enforces the intent (intent debt HIGH — no constraints were declared).&lt;/p&gt;

&lt;p&gt;The fastest path from State 4 to State 3 (GOVERNED DESPITE IGNORANCE): add mechanical enforcement (Layer 2). This doesn't restore comprehension — the developer still doesn't understand the AI-generated code. But the gate catches violations regardless. The system is safe to run even when it's hard to modify. Layer 1 (Parnas boundaries) and Layer 3 (rationale) can follow — they make the system modifiable, not just safe.&lt;/p&gt;

&lt;p&gt;The fastest path from State 8 to safety: triage. Identify the three highest-risk properties. Add enforcement for those three. You're not in State 1 (best case). You're in a partial State 7 (safe but frozen for the governed properties). That's dramatically better than State 8. Expand from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the current mitigations don't resolve the debts
&lt;/h2&gt;

&lt;p&gt;The emerging responses to cognitive debt in AI-assisted development — consolidated by Storey from practitioners including Simon Willison, Martin Fowler, Steve Yegge, and discussions across Hacker News and LinkedIn — cluster around five practices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More rigorous code review&lt;/li&gt;
&lt;li&gt;Writing tests that capture intent&lt;/li&gt;
&lt;li&gt;Updating design documents continuously&lt;/li&gt;
&lt;li&gt;Treating prototypes as disposable&lt;/li&gt;
&lt;li&gt;Using AI to support cognitive tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each practice is reasonable. None resolves the structural problem. Each is a braking mechanism — it manages the debt by slowing down, not by changing where the debt accumulates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More rigorous review&lt;/strong&gt; is human-speed enforcement. It addresses cognitive debt (the reviewer builds understanding while reviewing) and partially addresses intent debt (the reviewer catches violations of unwritten constraints). But it doesn't scale to AI-speed generation. When the AI produces 10x more changes, the reviewer becomes the bottleneck — or cuts corners, which means the debt isn't being managed at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tests that capture intent&lt;/strong&gt; are L2 cache — they catch violations for the cases someone thought to test. But they only cover what was anticipated. The property that wasn't tested is the property that gets violated. Tests address intent debt for KNOWN intents. They don't address the intents nobody wrote tests for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Updating design documents&lt;/strong&gt; is rationale preservation — but only if the documents are connected to enforcement. A design document that says "all external calls must have timeouts" doesn't prevent an AI agent from generating a call without a timeout. The document exists. The enforcement doesn't. This is intent debt: the intent is expressed (in a document) but not executable (no CI check verifies it).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treating prototypes as disposable&lt;/strong&gt; is correct but narrow. It addresses one source of cognitive debt (prototype code that was never meant to be understood). It doesn't address production code that was AI-generated and never understood by anyone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using AI to support cognitive tracking&lt;/strong&gt; introduces the failure mode from Fallacy #3: the AI that tracks cognitive state has the same probabilistic limitations as the AI that generated the code. The tracker can miss the same patterns the generator introduced.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the theory lives matters
&lt;/h2&gt;

&lt;p&gt;Storey correctly identifies that the "theory of the system" is distributed across people, documentation, tests, conversations, tooling, and agents. This is a precise observation. The next step is recognizing that some of these locations are FRAGILE and some are DURABLE:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FRAGILE locations (theory is lost when conditions change):
    People           → leave the company, change roles, forget
    Conversations    → ephemeral, unrecorded, unreproducible
    AI agents        → stateless, no memory across sessions

DURABLE locations (theory persists independently of conditions):
    Type signatures  → enforced by the compiler on every build
    API contracts    → enforced by contract tests on every change
    Database schemas → enforced by the engine on every write
    CI checks        → enforced on every merge, regardless of author
    Tests            → enforced on every run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resolution isn't to maintain all locations equally. It's to move the critical pieces from fragile locations to durable ones. The intent that lives in a person's head (fragile) must be expressed as a CI check (durable). The rationale that lives in a conversation (fragile) must be recorded as an ADR connected to the constraint it explains (durable).&lt;/p&gt;

&lt;p&gt;This is the answer to Storey's open question: "How will teams externalize intent and sustain shared understanding?" By moving the theory from locations that erode to locations that persist and enforcing it mechanically so that the theory holds regardless of who is on the team, which AI agent is generating code, or how fast the codebase is changing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three debts, the three layers, the three solutions
&lt;/h2&gt;

&lt;p&gt;Because the debts are independent and live in different places, each requires its own intervention — not a braking mechanism, but a structural change that moves the theory from fragile locations to durable ones:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To reduce cognitive debt&lt;/strong&gt; (restore comprehension):&lt;/p&gt;

&lt;p&gt;Parnas boundaries are the primary mechanism. The developer understands the system at the interface level — what each module promises, what it accepts, what it returns. The implementation behind the interface is irrelevant to comprehension. The developer doesn't need to understand 10,000 lines of AI-generated code. They need to understand 50 interface contracts.&lt;/p&gt;

&lt;p&gt;Supporting mechanisms: Architecture Decision Records (ADRs) explain WHY the boundaries are shaped the way they are. Worked examples show HOW the modules compose. Both help future developers build the mental model the original developer had.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To reduce intent debt&lt;/strong&gt; (express intent as executable constraints):&lt;/p&gt;

&lt;p&gt;Mechanical enforcement is the primary mechanism. The intent is expressed as a typed predicate, a contract test, a schema constraint, a linter rule, a CI check. The system can evaluate it on every change, at machine speed, deterministically.&lt;/p&gt;

&lt;p&gt;The insight from Fallacy #7: the constraints usually ALREADY EXIST. Type signatures, API contracts, database schemas, module boundaries — these are expressed intent. The debt isn't in the absence of constraints. It's in the absence of ENFORCEMENT of constraints that already exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To reduce both simultaneously:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The module boundary addresses both — as a comprehension mechanism (the developer only needs to understand the interface) and as an intent mechanism (the interface is the enforceable contract). Adding a third layer — rationale preservation — prevents both debts from recurring by recording WHY the boundary exists and connecting the rationale to the constraint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1 (Parnas boundaries): Reduces cognitive debt
    Module interfaces make the system comprehensible
    at the interface level.

Layer 2 (Mechanical enforcement): Reduces intent debt
    Constraints make the intent executable by the system.
    Catches violations regardless of developer comprehension.

Layer 3 (Rationale preservation): Reduces both
    Records WHY (addresses cognitive debt for future developers)
    AND connects the why to the constraint (ensures the intent
    survives personnel changes and AI-driven modifications).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why "specification debt" is not precise
&lt;/h2&gt;

&lt;p&gt;Some discussions use "specification debt" as a catch-all. This term is imprecise because it conflates two different absences:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Absence of a specification&lt;/strong&gt; (the interface or constraint doesn't exist): This is a Parnas boundary problem — the module was never decomposed with a well-defined interface. The solution is to create the boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Absence of enforcement&lt;/strong&gt; (the specification exists but isn't checked mechanically): This is an enforcement problem — the API contract exists but no contract test verifies it. The type signature exists but the language allows unsafe casts. The database schema has constraints but the application bypasses them. The solution isn't a new specification. It's a CI check on the existing one.&lt;/p&gt;

&lt;p&gt;Calling both "specification debt" loses the distinction between "we don't have the spec" and "we have the spec but don't enforce it." The second is cheaper and faster to fix — because the artifact already exists. Conflating them makes teams think they need to write new specifications when they actually need to enforce existing ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  The terminology, settled
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Term              Location           What it is                    Solution
──────────        ──────────         ──────────────────            ──────────────────
Technical debt    Codebase           Poor structure, shortcuts     Refactor the code

Cognitive debt    Developer's mind   Loss of comprehension         Parnas boundaries
                                                                   (understand interfaces,
                                                                   not implementations)

Intent debt       System's           Gap between what SHOULD       Mechanical enforcement
                  enforcement        be preserved and what IS      of existing constraints
                  layer              expressed as an executable    (CI checks, contract
                                     constraint                    tests, linter rules)

Specification     Ambiguous —        Either missing spec           Split into:
debt              avoid              OR missing enforcement        → missing boundary
                                                                     (create it)
                                                                   → missing enforcement
                                                                     (enforce it)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Technical debt lives in the code. Cognitive debt lives in people. Intent debt lives in the absence of enforcement. Each has a different location, a different cause, and a different fix. Conflating them leads to refactoring when you should be enforcing, enforcing when you should be teaching, or teaching when you should be building boundaries.&lt;/p&gt;

&lt;p&gt;The terms are not synonyms. They are independent axes. Getting them right changes the diagnosis. The diagnosis changes the architecture. The architecture changes the outcome.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This terminology clarification is relevant to &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/the-fallacies-of-genai-development-1m54"&gt;The Fallacies of GenAI Development&lt;/a&gt;, where several fallacies involve cognitive debt (Fallacies #1, #2, #4) and others involve intent debt (Fallacies #5, #7, #8). Distinguishing them clarifies which fallacy produces which debt and which layer of the resolution addresses it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The three-layer model (Parnas boundaries + mechanical enforcement + rationale preservation) is the architectural pattern that addresses both debts independently through a shared mechanism: the module boundary strengthened with enforcement and rationale. Parnas (1972) is the root. The three layers are the completion.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Fallacies of GenAI Development #2: If the Output Looks Correct, It Is Correct</title>
      <dc:creator>Bala Paranj</dc:creator>
      <pubDate>Fri, 29 May 2026 10:25:03 +0000</pubDate>
      <link>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-2-if-the-output-looks-correct-it-is-correct-5gbf</link>
      <guid>https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-2-if-the-output-looks-correct-it-is-correct-5gbf</guid>
      <description>&lt;p&gt;&lt;em&gt;This is the second in a series of eight posts on the false assumptions teams make when building with generative AI. &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;Fallacy #1&lt;/a&gt; covered why faster code generation doesn't mean faster engineering. This post covers why code that looks right isn't necessarily right — and what right requires.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fallacy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"The AI-generated code compiles, passes tests, and reads well. Therefore it's correct."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's tempting
&lt;/h2&gt;

&lt;p&gt;You prompt an AI agent. It generates a function. The function compiles. The existing tests pass. You read through it — the variable names make sense, the logic follows a recognizable pattern, the error handling looks reasonable. A colleague glances at it during code review. Looks good to me. You merge.&lt;/p&gt;

&lt;p&gt;The code LOOKS like it was written by someone who understood the problem. The structure is good. The naming is conventional. The patterns are familiar. Everything about it signals competence.&lt;/p&gt;

&lt;p&gt;This is more convincing than obviously bad code. Obviously bad code gets caught. Code that looks good gets waved through — by reviewers, by CI pipelines, by the developer's own judgment. The danger isn't code that fails visibly. It's code that fails invisibly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's wrong
&lt;/h2&gt;

&lt;p&gt;AI-generated code is optimized for plausibility, not correctness. The model produces output that matches patterns from its training data — patterns of code that LOOKED correct in the millions of repositories it learned from. The output inherits the surface appearance of correctness without inheriting the underlying reasoning that made the original code correct. &lt;/p&gt;

&lt;p&gt;Bender, Gebru, et al. (2021) formalized this in "On the Dangers of Stochastic Parrots" — LLMs are probabilistic models that stitch together sequences based on statistical likelihood, without reference to meaning or truth. The AI produces plausible code because it's predicting the most likely &lt;em&gt;shape&lt;/em&gt; of a correct answer, not reasoning about correctness.&lt;/p&gt;

&lt;p&gt;A 2023 Stanford study made this concrete: Perry et al. found that developers using AI assistants wrote significantly more insecure code than those without one — and were &lt;em&gt;more likely to believe&lt;/em&gt; their code was secure. The AI made the code look so professional that the developer's critical thinking was bypassed.&lt;/p&gt;

&lt;p&gt;Three specific failure modes:&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure mode 1: Correct for the common case, wrong for the edge case
&lt;/h3&gt;

&lt;p&gt;The AI handles the happy path flawlessly. It handles the most common error cases. It misses the edge case that matters — the one that constitutes your security boundary, your financial calculation precision, or your data integrity guarantee.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// AI-generated: looks correct&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;calculateInterest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;principal&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;principal&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="m"&gt;365.0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compiles. Passes the test for a standard 30-day period. Reads well. But: leap years? Negative principal? Rate above 1.0? Days exceeding the term? Overflow on large principals? Each edge case is a potential financial error that won't surface in normal testing.&lt;/p&gt;

&lt;p&gt;The AI didn't think about these cases. It pattern-matched a common interest calculation. The developer who would have WRITTEN this function would have encountered the edge cases during development — struggling with the leap year question, asking about negative values, checking the spec. Correctness comes from that struggle. The AI skipped it. OpenAI's own Codex evaluation (Chen et al., 2021) documented this: while Codex solved 70%+ of simple problems, performance dropped significantly as complexity or the need for specific logical constraints increased. AI is trained on the average code, meaning it defaults to the most common — and often least robust — implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure mode 2: Locally correct, globally wrong
&lt;/h3&gt;

&lt;p&gt;Each function is correct in isolation. The composition is wrong. Function A correctly parses input. Function B correctly transforms data. Function C correctly writes output. But A's output format doesn't match B's expected input. Or B's transformation assumes an ordering that A doesn't guarantee. Or C writes to a resource that A already locked.&lt;/p&gt;

&lt;p&gt;This is the composition problem from &lt;a href="https://dev.to/bala_paranj_059d338e44e7e/fallacies-of-genai-development-1-faster-code-generation-means-faster-engineering-2jbl"&gt;Fallacy #1&lt;/a&gt; — but at the code level, not the system level. AI generates each piece by pattern-matching against similar pieces in training data. The pieces look correct individually. Nobody checks whether they compose correctly, because each piece passed its own review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure mode 3: Semantically different from what was intended
&lt;/h3&gt;

&lt;p&gt;The code does something. It does it correctly. It's not what you wanted.&lt;/p&gt;

&lt;p&gt;You asked for a function that validates user input. The AI generates a function that checks string length and character types. You meant a function that checks the input against the business rules for your specific domain — valid account numbers, permitted transaction types, jurisdictional constraints. The AI generated a PLAUSIBLE interpretation of validates. Not YOUR interpretation.&lt;/p&gt;

&lt;p&gt;The code compiles. The tests pass (because the tests also validate string length and characters). The review looks fine (the function does what it says). But the intended validation — the business rules — was never implemented. The AI solved the how perfectly. It guessed the why. And without a specification that anchors the why, no amount of code review will catch the gap — because the code does something reasonable. It's just not the right something.&lt;/p&gt;

&lt;p&gt;Peter Naur explained why in 1985: software isn't just the code — it's the mental theory the programmer possesses about how the software handles the problem. The AI can generate the artifact, but it doesn't possess the theory. Without the theory, the code is an artifact that might look like the solution but lacks the internal logic of the intended design. The developer who writes the validation function builds a theory of the domain's rules during development. The AI skips that theory-building entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What correct requires
&lt;/h2&gt;

&lt;p&gt;Byron Cook built Amazon's automated reasoning organization — 300+ scientists, 15+ teams, formal verification embedded across AWS. The insight he discovered over 11 years: &lt;strong&gt;executives don't want bug reports. They want proofs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Newcombe, Cook, et al. (2015) documented this journey in "How Amazon Web Services Uses Formal Methods" — AWS uses TLA+ and model checking to find bugs that traditional testing and code reviews would never find. At industrial scale, "tests pass" is an insufficient definition of "correct."&lt;/p&gt;

&lt;p&gt;The distinction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Looks correct" (what teams do now):
    Code compiles           ← syntax check
    Tests pass              ← checks tested cases
    Review approves         ← human judgment on appearance

    Gap: what about the cases nobody tested?
    Gap: what about the compositions nobody checked?
    Gap: what about the properties nobody thought to verify?

"IS correct" (what proof provides):
    Property declared       ← "no unauthorized access path exists"
    Property verified       ← checked against ALL possible inputs
    Evidence produced       ← the specific evaluation trace

    No gap: the property either holds for every case or it doesn't.
    The verification is exhaustive, not sampled.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks correct is an opinion. "IS correct" is evidence. The difference between them is the difference between "I looked and didn't see anything wrong" and "I proved nothing wrong exists."&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost of plausible
&lt;/h2&gt;

&lt;p&gt;The cost isn't immediate. Plausible code ships. It works for weeks, months, sometimes years. The cost arrives when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An incident occurs that nobody can diagnose.&lt;/strong&gt; The code that looked correct has a subtle bug in an edge case. The developer who merged it has no mental model of how it works — they read it, it looked fine, they approved. Debugging AI-generated code you don't understand takes longer than writing the code would have taken, because you're building the mental model during the crisis instead of during development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An auditor asks for evidence.&lt;/strong&gt; "How do you know this function correctly handles PII?" The answer: "It passed code review and the tests pass." The auditor: "Show me the tests." You show them. The tests check the happy path. The auditor: "What about edge cases X, Y, Z?" Silence. The test suite verified what someone thought to test. Nobody thought to test the thing the auditor is asking about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A security researcher finds a path nobody anticipated.&lt;/strong&gt; The AI-generated IAM policy is correct for the intended use case. But the policy's conditions, evaluated together, are mathematically equivalent to &lt;code&gt;Principal: *&lt;/code&gt; — allowing public access through a logical path nobody wrote because the AI pattern-matched the condition blocks from training data without understanding their composition. The policy LOOKS restrictive. The math says it isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The resolution: properties, not appearances
&lt;/h2&gt;

&lt;p&gt;The cache hierarchy from Fallacy #1 has layers. The first two already exist in most codebases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L1 cache (types):
    The compiler catches type violations instantly.
    AI generates code with wrong types → caught before any human sees it.
    Fast. Deterministic. Already deployed.

L2 cache (tests):
    CI catches behavioral violations before merge.
    AI generates code that breaks existing tests → caught before merge.
    Fast. Deterministic for tested cases. Already deployed.

L3 cache (specification gate — what's missing):
    A mechanical check that verifies properties nobody wrote tests for.
    Security invariants. Architectural boundaries. Cross-service contracts.
    Composition correctness across module boundaries.

    Existing L3-adjacent tools you can adopt today:
    → Property-based testing (QuickCheck, Hypothesis) — tests properties
      across randomly generated inputs, not hand-picked examples
    → Static analysis (Semgrep, SonarQube) — checks structural patterns
      across the codebase without running the code  
    → Contract testing (Pact, Dredd) — verifies API implementations
      match their OpenAPI/Swagger specifications
    → Formal verification (Z3, AWS Zelkova) — proves properties
      mathematically across ALL possible inputs

    Each is a step closer to L3. Property-based testing is the
    easiest first step — it moves you from "tested 5 examples"
    to "tested 10,000 random inputs against one property."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;L1 and L2 verify what someone THOUGHT to check — type contracts and test cases. L3 verifies what must ALWAYS be true — properties that hold regardless of implementation.&lt;/p&gt;

&lt;p&gt;The L3 check for the interest calculation: "the result must never be negative for positive principal and positive rate." One property. Verified on every change. Catches the edge case the test missed — not because someone anticipated the specific failure, but because the property is universal.&lt;/p&gt;

&lt;p&gt;The L3 check for the IAM policy: "no principal outside the organization can access any resource tagged as sensitive." Not a test case for one specific policy. A property verified across every policy in the snapshot. Catches the mathematical-equivalent-to-star policy — not by pattern-matching the text, but by evaluating the logic.&lt;/p&gt;

&lt;p&gt;The L3 check for the composition: "Function B's input type must be a subset of Function A's output type." Not verified by testing A and B separately. Verified by checking the interface contract between them — mechanically, on every change that touches either function.&lt;/p&gt;

&lt;h2&gt;
  
  
  The difference between testing and proving
&lt;/h2&gt;

&lt;p&gt;Testing checks specific inputs. If you test 1,000 inputs and they all pass, you know 1,000 inputs work. You don't know about input 1,001.&lt;/p&gt;

&lt;p&gt;Proving checks ALL inputs. If a property is proved, it holds for every possible input — including the ones nobody thought to test. The verification is mathematical, not sampled.&lt;/p&gt;

&lt;p&gt;Dijkstra stated this in 1970: "Program testing can be used to show the presence of bugs, but never to show their absence." This is the fundamental limit that separates testing from verification. Moving from example-based testing to property-based verification is not an improvement in degree — it's a change in kind.&lt;/p&gt;

&lt;p&gt;AWS learned this distinction at scale. Cook: "You can't go to a customer and say 'Good news, we found 10,000 more bugs.' They say 'Why am I using AWS if you have bugs?' But you CAN say 'We proved this property holds under these assumptions.' That's why they moved their data to the cloud."&lt;/p&gt;

&lt;p&gt;The same distinction applies to AI-generated code. "This code passed 47 tests" is useful but incomplete. "This code satisfies these 12 properties across ALL possible inputs" is evidence. The first is testing. The second is proving. The invisible bugs live in the gaps between them.&lt;/p&gt;

&lt;p&gt;You don't need to prove EVERYTHING — that's the formal methods mistake of the 1980s. You need to prove the PROPERTIES THAT MATTER — security invariants, financial correctness guarantees, data integrity constraints, architectural boundaries. The small set of things that must ALWAYS be true, regardless of how the AI implemented them.&lt;/p&gt;

&lt;p&gt;And you don't have to jump straight to mathematical proofs. There's a practical gradient:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example-based tests:    "This input produces this output"           (5 cases checked)
Property-based tests:   "For ALL inputs, this property holds"       (10,000 random cases)
Contract tests:         "This API matches its specification"         (every endpoint, every field)
Formal verification:    "This property holds for EVERY possible case" (mathematical proof)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each step catches more than the previous one. Property-based testing (QuickCheck, Hypothesis) is the most accessible first step — one afternoon to adopt, and it immediately catches edge cases that example-based tests miss. You don't need Z3 to start. You need one property and one tool that checks it across more inputs than you'd write by hand.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can do this week
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Identify one property that must always hold in your system.&lt;/strong&gt; Not a test case. A property. "No API endpoint returns PII without authentication." "No database query returns results from a tenant other than the requesting tenant." "No financial calculation produces a negative balance for a credit transaction." One property. Write it down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Ask: would your current tests catch a violation?&lt;/strong&gt; If the AI generated code that subtly violated this property — not obviously, but through an edge case or a composition error — would your test suite catch it? If the answer is "probably" or "I think so," you don't have verification. You have hope.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Add one mechanical check for that property.&lt;/strong&gt; A CI check. A contract test. A schema validation. Something that verifies the property on every change, deterministically, regardless of how the code was generated. The property is the specification. The check is the enforcement. The combination is what "correct" means.&lt;/p&gt;

&lt;p&gt;"Looks correct" is how we got here. "IS correct" is how we get out. The difference is one property, mechanically verified, on every change.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series: **Fallacy #3 — "You Can Verify AI Output With Another AI."&lt;/em&gt;* Why wrapping a non-deterministic system with another non-deterministic layer doesn't converge on reliability — and what deterministic verification looks like in practice.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bender, E.M., Gebru, T., et al. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" &lt;em&gt;FAccT '21&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Chen, M., et al. (2021). "Evaluating Large Language Models Trained on Code." &lt;em&gt;arXiv:2107.03374&lt;/em&gt; (OpenAI Codex).&lt;/li&gt;
&lt;li&gt;Dijkstra, E.W. (1970). "Notes on Structured Programming." &lt;em&gt;EWD249&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Naur, P. (1985). "Programming as Theory Building." &lt;em&gt;Microprocessing and Microprogramming&lt;/em&gt;, 15(5).&lt;/li&gt;
&lt;li&gt;Newcombe, C., Cook, B., et al. (2015). "How Amazon Web Services Uses Formal Methods." &lt;em&gt;Communications of the ACM&lt;/em&gt;, 58(4).&lt;/li&gt;
&lt;li&gt;Perry, N., et al. (2023). "Do Users Write More Insecure Code with AI Assistants?" &lt;em&gt;IEEE S&amp;amp;P 2023&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>engineering</category>
    </item>
  </channel>
</rss>
