yunbow

Posted on May 5 • Edited on May 6

Why AI Doesn't Code What You Designed: The Structural Gap Between Specs and Implementation

#ai #architecture #llm #softwareengineering

You write a detailed design doc. You paste it into your AI assistant. You wait.

The output compiles. Tests pass. And yet — it's not quite what you designed. The auth middleware is in the wrong layer. The error handling pattern differs from the rest of the codebase. The field names don't match the schema.

You fix it. Next task, same thing.

This happens constantly, and it's not a model capability problem. It's a structural problem in how we communicate design intent to AI. And "write better prompts" is not the solution — it's a band-aid.

Why "Better Prompts" Doesn't Scale

The instinct when AI misses the mark: add more detail to the prompt. Describe the pattern more explicitly. Give an example.

This works — sometimes, for that one task.

The problem: design intent isn't a single instruction. It's institutional knowledge accumulated over months. It's the reason two experienced engineers on your team write recognizably similar code without coordinating on every PR. It's the unspoken answer to "what does good look like here?"

You can't fit that into a prompt. And even if you could, you'd need to paste it every single time.

Think about onboarding a new engineer. You don't brief them on your entire codebase in the first conversation. You give them documentation, you pair with them, you let them absorb context over weeks. The conventions transfer gradually because there's a persistent learning mechanism.

With AI, there's no persistent learning mechanism by default. Every conversation starts from scratch. Your prompt is one-time context. Your design intent is institutional knowledge that needs to live somewhere the AI can always access.

Three Structural Gaps

Gap 1: Implicit Conventions Stay Implicit

Every team has them. Rules that are enforced in code review but never written down because "everyone knows that."

"We use ActionResult<T> for Server Actions"
"Zod schemas live in lib/schemas/, not colocated with components"
"Don't use useEffect for data fetching — we have Server Components for that"
"Always call auth() before any database access in an API route"

These live in your engineers' heads. When a new hire submits a PR that violates them, a reviewer catches it and leaves a comment. The new hire learns. The knowledge transfers.

AI doesn't get those review comments. It generates code, the review happens, but the AI's context doesn't update. Next task, same violation.

The cost: every AI output needs a manual convention-correction pass. The faster AI can generate code, the more convention-correction work piles up.

Gap 2: Specs Describe WHAT, Not HOW or WHY

Your design doc says "implement user authentication with session management." Good.

What it doesn't say:

Auth middleware belongs in the route handler, not a global Next.js middleware
Sessions are stored in the database, not JWT (because you had a security incident)
Use NextAuth's auth() helper, not direct session inspection
All session-dependent routes return 401, not redirect to login

These "how" and "why" constraints are the engineering culture. They're the decisions your team made, the tradeoffs you've accepted, the patterns you've settled on. A spec that describes the what gives AI enough rope to generate something plausible that violates all of the above.

An AI-effective spec would include the constraints that shape implementation, not just the behavior.

Gap 3: No Feedback Loop for Rule Violations

Human code review creates a feedback loop. Author submits code → reviewer catches violations → author adjusts → over time, author internalizes the rules.

AI-generated code breaks this loop in two ways.

First, the feedback doesn't persist. You can tell Claude "in this project, Server Actions always return ActionResult<T>" in one conversation. Next conversation, you tell it again.

Second, the volume overwhelms the feedback loop. One engineer might submit 3 PRs a week. With AI assistance, they might submit 20. Reviewers who were previously keeping up now face a backlog. Review quality degrades under volume. Violations start slipping through.

At scale, violations accumulate faster than review can catch them. Plausible-but-wrong patterns spread across the codebase. Six months later, you have a refactoring problem that looks like technical debt but is actually accumulated convention drift.

The Compounding Effect

A single convention violation in one file: minor, easy to fix.

10 AI-generated files with the same violation: a pattern.

50 AI-generated files with diverse convention violations: an architectural consistency problem. Different error handling here, different auth patterns there, schemas mixed into component files, raw Prisma objects returned from APIs.

This is the state many teams are in now — not because AI is bad, but because they're using AI without the infrastructure to give it their institutional knowledge.

From retrospectives I've reviewed and conversations with teams adopting AI tools: teams without explicit guidelines consistently report spending significantly more time — often 2–4× — on "cleanup" and "consistency work" than teams with structured rule systems. A common pattern: the team celebrates 3× faster feature generation, then quietly absorbs the same hours back in review cycles, convention-fixing passes, and cross-file inconsistency cleanup. The productivity gains from AI generation are real; they're being partially eaten by convention-drift cleanup.

What "Design Intent" Actually Means

When I say "design intent," I mean more than the features described in a spec. I mean:

Naming conventions and why they exist. Not just "use camelCase" but "function names that fetch data start with get, mutations start with update or create, and no function is named handle because it tells you nothing."

Which patterns are preferred and which are deprecated. "We use Server Components for data fetching. useEffect for data fetching was how we did it two years ago and there's still some in the codebase — don't follow it."

Security invariants. "Every external-facing API route must validate the session. This is not a preference, it's a requirement. There are no exceptions."

The tradeoffs you've accepted. "We verbose-type everything even when TypeScript could infer it. We know this is more code. We've decided explicit types are worth it for this team."

These don't belong in a design doc — they're too foundational, too stable. They belong in a persistent context that every AI interaction can access.

Rethinking What a Spec Is For

A traditional specification document is written for humans. It describes desired behavior, UI flows, data models, and edge cases. The reader is assumed to already know your team's conventions.

An AI-effective specification is different. It still describes behavior. But it also includes:

The constraints that shape implementation ("use NextAuth, not custom JWT")
The non-options ("do not introduce a new state management library")
The patterns that must be followed ("Server Actions must follow the ActionResult<T> pattern")
The why behind significant decisions ("we use Prisma's typed client because we've been burned by SQL injection in the past")

Consider the difference between these two spec sections:

Traditional:

Implement an API endpoint that allows authenticated users to update their profile information.

AI-effective:

Implement an API route at app/api/profile/route.ts that allows authenticated users to update their profile.

Implementation constraints:

Use auth() from @/lib/auth to verify the session

Accept only name and image fields — no other fields should be updatable via this endpoint

Validate input with a Zod schema before any database operation

Return { data: user } on success with only safe fields selected (no passwordHash, emailVerified, or internal IDs)

Return { error: "..." } with appropriate HTTP status on failure

The second version leaves less room for AI to infer its own patterns. It specifies the how alongside the what.

The Solution Direction: Externalizing Tacit Knowledge

The fix isn't writing better prompts. It's building an infrastructure that makes your team's tacit knowledge persistent and accessible.

The practical shape of this:

1. A persistent context file (CLAUDE.md or equivalent)

Keep it to 3 directive lines — a project header, "follow these guidelines," and "upper sections take priority." The rest is a list of file references pointing to your actual rule files. This loads on every interaction, but because it's just an index, it doesn't dilute Claude's attention on your task.

2. Guideline files for L3 specifics

Naming conventions, security invariants, error handling patterns — these live in dedicated files (docs/ai-dev-os/03_guidelines/common/security.md, etc.) and are loaded contextually when relevant. A rule submodule like ai-dev-os-rules-typescript provides these for TypeScript/Next.js projects out of the box.

3. AI-effective specs for larger tasks

For non-trivial features, include implementation constraints in the spec alongside behavioral requirements — which auth helper to use, which patterns to follow, which approaches to avoid. Write the spec like you're briefing a developer who knows nothing about your codebase's conventions.

This is a systems-design problem, not a prompting problem. You design the system once. The AI operates within it on every task.

The next article in this series walks through building this infrastructure for a Next.js project — step by step, with before/after code.

The structural gap between specs and implementation isn't inherent to AI. It's a gap we created by giving AI our "what" without our "how."

Closing it requires treating your team's conventions as infrastructure, not tribal knowledge.

What's the most persistent spec-drift you've hit with AI coding tools? Drop it in the comments — I'm cataloging these to improve guideline templates.

Next in this series: Codifying Tacit Knowledge: The Missing Layer Between Your Team's Conventions and Your AI Assistant

Top comments (2)

PEACEBINFLOW • May 5

The part about the feedback loop being broken in two directions — feedback doesn't persist, and the volume overwhelms review — is what I keep coming back to. Everyone talks about AI generating code faster, but almost nobody talks about what that does to the review pipeline. A human engineer submitting 3 PRs a week creates a manageable review surface. That same engineer with AI assistance submitting 20 PRs a week doesn't just create more work linearly — it changes the nature of review itself. Reviewers stop looking for subtle convention violations because they're triaging. They approve things they would have caught before, not because they're careless, but because the queue is 40 items deep and each one looks plausible at a glance.

What I think this implies, and what your post brushes against but doesn't quite say, is that AI-assisted development might require us to invert the relationship between convention and code. In the old model, conventions were extracted from code — you looked at the codebase, you absorbed the patterns, you internalized them. In the new model, conventions have to exist before the code. They have to be formalized, written down, machine-readable, because the thing generating the code can't absorb them through osmosis. That's a heavier upfront investment than most teams are used to making, and it's not clear to me that most teams even know what all their conventions are until they try to write them down.

The example about "useEffect for data fetching was how we did it two years ago and there's still some in the codebase — don't follow it" is exactly the kind of thing that would never appear in a style guide but is critical context. An AI scraping your existing code would learn the wrong lesson. A human new hire would get warned. The AI needs something equivalent to the warning, and that means someone has to explicitly catalog the deprecated patterns, not just the current ones.

How do you handle the versioning problem when conventions change? If the guideline files say "use ActionResult" today and six months from now the team moves to a different pattern, the AI's persistent context becomes actively harmful until someone updates it. Do you see that as a maintenance burden that teams just have to accept, or is there a lighter-weight way to keep the living document actually living?

yunbow • May 5

Re: The versioning problem — honest answer

This is the question I've been least satisfied with my own answer to, so I'll be direct about what exists and what doesn't.

What's actually in place:

The system uses a three-label model inline in guideline files: [draft] for newly extracted rules, [proven] for patterns validated across multiple reviews, and [deprecated] for retired patterns — kept visible for one release cycle, then removed. The idea is that deprecated patterns stay in the file long enough that any team consuming the guidelines sees them explicitly called out before they disappear.

There's also a layer-based lifespan model baked into the architecture. L1 (philosophy) is expected to hold for 2–5 years; L2 (decision criteria) 1–3 years; L3 (specific coding guidelines) 6–12 months; L4 (AI-facing prompt frames) 2–4 months. The logic is that when a pattern like ActionResult changes, it affects L3 only — you don't have to re-examine everything above it. That localizes the maintenance surface.

What I'll be honest about:

The [deprecated] label and the shelf-life model are defined in spec documents, but the system is at v1.0.0 and hasn't been through a real convention change cycle yet. I don't have empirical data on how much friction the update process actually creates. The labels are currently applied to one file comprehensively; the rollout to the full guideline set is still in progress.

There's no automated staleness detection yet — the audit tooling can surface files that haven't been touched in a given period via git log, but it doesn't fire unprompted. That means the "living document" property depends on someone remembering to run the audit, which is a meaningful dependency on human discipline.

The part I think is unsolved:

The harder problem you're pointing at — propagating a convention change to projects that have already consumed the guidelines — is genuinely not handled well yet. If a team pins to a specific version of the guideline submodule and the upstream marks ActionResult as deprecated, nothing notifies that team. They'd have to actively pull the update and review the diff. That's not lightweight.

My current thinking is that this is partially inherent to the model: the guidelines are consumed by humans (with AI assistance), not by automated systems that could react to upstream changes. So the responsibility for keeping the consuming project's guidelines current sits with that team, the same way it does with any dependency. The difference is that most package dependencies break when they're stale; convention guidelines silently diverge.

The honest answer to your question: yes, teams have to accept a maintenance burden, but the goal is to bound it — by layering so that most changes touch only L3, by making deprecated patterns explicit rather than just absent, and by making the audit low-effort to run. Whether that's actually "lighter-weight enough" in practice is something I'll have more data on in 6 months.

The point about cataloging deprecated patterns explicitly, not just current ones, is well-taken and something I want to write up more directly — it's implicit in the system but not surfaced clearly enough in the article.