DEV Community: Stefan van Egmond

Structure Beats Prose: Specs for Coding Agents That Actually Work

Stefan van Egmond — Wed, 11 Feb 2026 18:35:36 +0000

Part 3: From architectural guardrails to deterministic feature implementation

In Part 1, I showed that AI-generated code drifts, one "working" commit at a time, and built ArchCodex to surface the right architectural constraints at the right time. In Part 2, I dug into the research and explained how boundaries, constraints, and canonical examples create a feedback loop that keeps drift in check.

But ArchCodex answers how code should be structured. It doesn't answer what code should do.

When I asked 20 AI agents to implement the same feature, "Add the ability to duplicate timeline entries," the ones with ArchCodex produced better-structured code. They still disagreed about what "duplicate" means. Should tags be copied? Should the status reset? Should due dates carry over? Each agent made different assumptions, and each assumption was reasonable. That's the problem.

I needed a way to make those decisions explicit, readable for both humans and machines, and verifiable. So I built SpecCodex.

The Problem with Natural Language Specs

The instinct is to write a detailed specification in prose. "When duplicating an entry, the system should copy the title with a '(copy)' suffix, reset the status to 'todo', clear tags, remove the due date, and place the duplicate immediately below the original."

This seems reasonable, and there are tools that take this approach. GitHub's Spec Kit, for example, generates natural language specifications for coding agents. But natural language specs have problems that compound as features grow.

Natural language is ambiguous, for everyone. "Clear tags": does that mean set to an empty array, or remove the field entirely? "Immediately below": does that mean sort order original + 1, or insert at original + 0.5? An LLM picks an interpretation silently and moves on.

Prose specs are hard to scan. A well-written natural-language spec for the duplicate feature runs 800 to 1200 tokens. Much of that is connective tissue: "the system should," "in the case where," "it is important to note that." For an LLM, those wasted tokens compete with file contents, architectural constraints, and conversation history in a limited context window. For a developer, that's two to three pages of text where the actual decisions are buried in paragraphs. When a codebase has dozens of features, natural language specs become a documentation mountain that nobody reads end to end.

Natural language can't be tested. You can review a prose spec. You can't run it. There's no easy deterministic way to verify that the code matches the spec. This is the fundamental limitation: natural language specs look like they solve the "what should this do" problem, but they just move it from "the LLM guesses" to "the LLM interprets." The variance is smaller, but it's still there.

What You Spec Is What You Get

The schema draws from patterns LLMs already know deeply from training data: Pact contracts, Design by Contract invariants, Specification by Example given/then pairs. This isn't just familiar syntax. LLMs have learned the associations between these formal specification patterns and their implementations. When the model sees invariants with @length(0), it doesn't need instructions on what to produce; the mapping from spec to code is already in the weights. The schema exploits that. It's prompt engineering at the architectural level.

The pattern isn't the tool. What matters is: make decisions explicit in a parseable format, co-author with the LLM, and verify deterministically. If you use OpenAPI, your API spec is already a structured specification; generate contract tests from it. If you use Prisma or Drizzle, your schema is a specification; generate integration tests from it. If you use TypeScript interfaces for component contracts, those are specifications too. SpecCodex provides an opinionated full-stack schema that covers backend, frontend, security, and effects in one place. The benchmarks below prove the pattern works. The tool is one way to implement it.

Here's what the schema looks like in practice (abbreviated; the full spec includes 7 touchpoints and additional examples):

spec.timeline.duplicateEntry:
  inherits: spec.mutation
  mixins: [requires_auth, logs_audit, has_timestamps]
  implementation: convex/projects/timeline/mutations.ts#duplicateEntry

  goal: "Create a duplicate of an existing timeline entry"
  intent: "Copy core entry data, position below original,
           provide fresh start for transient fields"

  inputs:
    entryId:
      type: Id<"projectTimelineEntries">
      required: true

  invariants:
    - description: "Title suffixed with (copy)"
      "result.title": "@endsWith(' (copy)')"
    - description: "Same entry type as original"
      "result.entryType": "@equals(original.entryType)"
    - description: "Status reset to todo for tasks"
      condition: "original.entryType === 'task'"
      "result.status": "todo"
    - description: "Duplicate has empty tags (fresh start)"
      "result.tags": "@length(0)"
    - description: "Mentions reset (no re-notifications)"
      "result.mentions": "@length(0)"
    - description: "Sort order places duplicate below original"
      "result.sortOrder": "@gt(original.sortOrder)"

  effects:
    - description: "Creates new timeline entry"
      target: "projectTimelineEntries"
      operation: "insert"
    - description: "Creates junction table entries for linkedResources"
      target: "projectTimelineEntryAttachments"
      operation: "insert"
      condition: "original.linkedResources.length > 0"
    - description: "Logs activity for the new entry"
      target: "projectTimelineEntryActivity"
      operation: "insert"
      metadata:
        duplicatedFrom: "@string(original._id)"

  hooks:
    - hook: useTimelineEntryMutations
      file: src/hooks/projects/useTimelineEntryMutations.ts
      change: "Add duplicateEntry mutation binding"
    - hook: useTimelineEntryHandlers
      file: src/components/projects/planning/useTimelineEntryHandlers.ts
      change: "Add handleDuplicate callback using duplicateEntry mutation"

  touchpoints:
    - component: TaskArchetype
      file: src/components/projects/planning/archetypes/TaskArchetype.tsx
      change: "Wire onDuplicate from handlers to getMenuItems"
      status: TODO
    - component: NoteArchetype
      file: src/components/projects/planning/archetypes/NoteArchetype.tsx
      change: "Add Duplicate menu item with Copy icon to custom menu"
      status: TODO
    # ... 5 more components

  examples:
    success:
      - name: "duplicate task"
        given:
          entryId: "@validEntryId"
          original: { title: "Original Task", entryType: "task", status: "done" }
        then:
          result._id: "@exists"
          result.title: "Original Task (copy)"
          result.status: "todo"
          result.tags: "@length(0)"

Two things to notice.

Notice the invariants: every decision is explicit (@length(0), not "cleared"), conditional logic is visible (condition: entryType === 'task'), and each assertion maps mechanically to a test. This isn't documentation; it's a test specification that hasn't been compiled yet.

Notice the touchpoints: exact file paths, not descriptions. This turned out to be the critical difference between specs that worked for backend only and specs that worked end-to-end.

Writing Specs with the LLM: The Discovery Loop

The schema is designed to be co-authored. You don't sit down and fill it out like a form. You describe what you want, and the LLM drafts the spec, drawing on its knowledge of the codebase.

The workflow looks like this:

You describe the feature in natural language. "I want to duplicate timeline entries."
The LLM drafts a spec in the SpecCodex schema. It uses ArchCodex's entity context to look up the schema, relationships, and existing patterns in your codebase.
You review and refine. "Actually, don't copy tags. Users want a fresh start." "Reset status to todo, but only for tasks."
The LLM updates the spec. Now the decisions are locked in and visible.
The LLM implements from the spec. Not from the original prompt. From the agreed-upon specification.

This is where the LLM's discovery power actually shines. When drafting the spec, the LLM surfaces questions you haven't thought of yet: "What happens when someone duplicates a task that's in a milestone?" "Should the duplicate inherit the parent's position in the Gantt chart?" "The schema shows a linkedResources relation; should those be copied or just the references?" These questions come up at spec-writing time, when answering them is free, instead of at code-review time, when the wrong answer is already baked into the implementation.

Because the spec is structured, you can see exactly what rules the LLM is proposing. If the invariants section doesn't mention sort ordering, you know the LLM hasn't thought about positioning. If there's no conditional on entry type, you know task-specific behavior will be missed. The gaps are visible because the schema defines what a complete spec looks like. A natural language spec can feel complete while omitting entire categories of decisions. A structured spec with an empty effects section is obviously incomplete.

Deterministic Verification: Testing What Was Built

Here's the payoff of making specs parseable rather than prose: you can mechanically verify what the agent built. This is the fundamental difference between structured specs and natural language specs. With prose, the only verification is you reading the code (or tests) and comparing it to the document. With the SpecCodex schema, verification can be deterministic.

Test generation from specs

You can verify what was built, mechanically. That's the fundamental difference. With natural language specs, verification means a human reads the code and judges whether it matches the document. With structured specs, the spec compiles directly to executable tests, no LLM involved in the translation, no interpretation variance.

This works because the schema includes a typed placeholder DSL for both generating test inputs and asserting on outputs. In given: blocks, placeholders like @string(100), @authenticated, and @array(3, { name: '@string(10)' }) generate concrete, deterministic test data. In then: blocks, @exists, @length(0), @gt(N), and @contains('copy') each compile to exactly one expect() call. There's no interpretation step. @length(0) always becomes expect(x).toHaveLength(0), every time, in every project.

Different sections of the spec feed different kinds of tests. Examples become unit tests, one it() block per given/then pair. Invariants become property tests via fast-check, verifying that properties hold for all valid inputs, not just the examples you thought of. Effects become integration tests that verify database writes and audit logs. Touchpoints become UI interaction tests.

Here's a concrete example. This spec fragment:

examples:
  success:
    - name: "duplicate task"
      given:
        entryId: "@validEntryId"
        original: { title: "Original Task", entryType: "task", status: "done" }
      then:
        result._id: "@exists"
        result.title: "Original Task (copy)"
        result.status: "todo"
        result.tags: "@length(0)"
  errors:
    - name: "unauthenticated"
      given:
        user: null
      then:
        error: "NOT_AUTHENTICATED"

Compiles to:

it('duplicate task', async () => {
  const original = await createEntry({
    title: "Original Task", entryType: "task", status: "done"
  });
  const result = await duplicateEntry(original._id);
  expect(result._id).toBeDefined();
  expect(result.title).toBe("Original Task (copy)");
  expect(result.status).toBe("todo");
  expect(result.tags).toHaveLength(0);
});

it('unauthenticated', async () => {
  await expect(duplicateEntry({ user: null }))
    .rejects.toThrow('NOT_AUTHENTICATED');
});

The translation is mechanical. The spec is written collaboratively (with all the benefits of the discovery loop), but the tests are compiled deterministically (with none of the variance of AI-generated test code). This closes the loop: the LLM writes the implementation, the spec generates tests that verify it, and the results are pass/fail.

Static analysis across specs

Because specs are structured, you can also run static analysis across them before any code is written. SpecCodex's analyzer builds a cross-reference graph across your entire spec registry: which specs write to which database tables, which specs read from which tables, which specs depend on each other, which specs share entities. Then it runs 65 checkers across six categories against this graph.

For example, a checker sees a spec with authentication: none combined with a database insert effect and flags it: you're writing to a table without auth. Another sees two specs that both write to the same table with different field assumptions and flags a potential consistency issue. Another sees a CRUD entity with create, read, and delete specs but no update, flagging incomplete coverage. None of this requires running code. It's static analysis for designs, not implementations.

Deep mode: verifying code against specs

The base analyzer reasons about specs in isolation. Deep mode goes further: it reads the actual implementation source files and compares them against what the specs claim. The spec says authentication: required; does the code actually check the user? The spec says permissions: ["bookmark.edit"]; does the code check that permission, or did it drift to checking "admin" instead?

Deep mode uses configurable regex patterns grouped into six categories: auth checks, ownership checks, permission calls, soft-delete filters, database queries, and record fetches. You define these patterns per project because every framework looks different. A Convex project checks for ctx.userId; an Express project checks for req.user; a Django project checks for request.user. The patterns are different, but the security question is the same: does the code verify what the spec requires?

This catches a specific class of bugs that are nearly invisible in code review. When a spec says the user can only update their own records, deep mode checks whether the code both fetches the record and compares ownership. When a spec implies soft-delete semantics, deep mode checks whether queries actually filter out deleted records. When a spec declares a permission, deep mode extracts the permission string from the code and compares it to the spec, catching permission drift.

The full verification stack is layered intentionally. Test generation catches behavioral drift (does the code do what the spec says?). Static analysis catches design gaps (is the spec itself consistent and complete?). Deep mode catches implementation drift (has the code diverged from the spec?). Together, they turn a structured spec into a continuous verification system rather than a document that goes stale.

The Benchmark

To validate this approach, I ran the same feature request (duplicate timeline entries) across 20 AI agents with different configurations.

The feature was chosen because it's deceptively complex. "Duplicate a timeline entry" sounds like a single mutation, but a complete implementation touches 11 files across four layers: the backend mutation and its barrel export, two hook files for binding and handling, a type contract update, a controller wiring change, and five separate UI archetype components that each need menu updates. Most agents discovered the first six files naturally by following imports. The five archetype files, delegated components that don't appear in the obvious import chain, are where implementations diverged.

The results support the claims above, but they also revealed things I didn't expect.

The assumption problem disappears

Without a spec, every agent made reasonable but different decisions:

Decision	A1	A3	A5	A7	A11
Copy tags?	✓	✓	✓	✓	✓
Copy due dates?	✓	✓	✓	✓	✓
Copy assignee?	✗	✗	✗	✗	✓
Reset status?	✗	✓	✗	✓	✓
Copy attachments?	✓	✓	✓	✓	✓

Every answer is defensible. None is what we wanted.

With SpecCodex, backend adherence went to 100%. Not improved. Identical. Every agent with the spec produced the same field handling, the same sort order logic, the same audit logging. The spec didn't guide the agent; it constrained it.

Silent bugs (wrong data copied, missing features, semantic errors) dropped by 75%:

Group	Avg Silent Bugs
No tooling	4.0
ArchCodex only	2.5
ArchCodex + SpecCodex	1.0

The remaining silent bugs in the SpecCodex group were all UI-related, which leads to the next finding.

File paths matter more than descriptions

This was the most surprising finding, and the most actionable. The spec went through three versions. Versions 1 and 2 had invariants, effects, and hooks but described UI changes vaguely. Both produced 0% UI wiring success. Agents with perfect backends didn't touch the right UI files.

The breakthrough came with v3, which added explicit file paths to touchpoints:

Spec Version	Touchpoint Format	UI Success
v1	None	0%
v2	`"Update useTimelineEntryHandlers hook"`	0%
v3	`"file: src/components/.../NoteArchetype.tsx"`	100% (Opus)

Component names weren't enough. Hook names weren't enough. Only full paths worked. If you take one thing from this post for your own specs: when a feature touches multiple files, give the agent the exact path, not a description of where to look.

This also revealed a capability ceiling: with v3, Opus achieved 5/5 UI components wired correctly while Haiku produced a perfect backend but 0/5 UI. The spec format works for both models on the backend; UI wiring across multiple files requires a more capable model.

Lucky outcomes vs. reliable processes

The best agent without specs (Opus + ArchCodex, no spec) scored the same as the best agent with specs on production risk. But the unspecified agent's success was emergent: it happened to explore the right files and make the right assumptions. Run it again and you might get a different result. The specified agent's success was deterministic: the spec locked in every decision. Run it ten times and you get the same outcome. The difference between a lucky result and a reliable process.

The Arc of the Series

The pattern across the three-part series is clear:

Part 1: LLMs write code that works but doesn't fit. Architectural drift is invisible and compounds. ArchCodex makes it visible.
Part 2: The research confirms this at scale. Structured guardrails (boundaries, constraints, canonical examples) reduce drift systematically.
Part 3: What the code does matters as much as how it's structured. A purpose-built specification schema, co-authored with the LLM and verified deterministically, eliminates assumption variance and makes every decision visible before code is written.

The table saw metaphor still holds. ArchCodex is the fence; it keeps the cut straight. SpecCodex is the blueprint; it defines where the cut goes. Without both, you're measuring twice and still cutting wrong, because the LLM and you have different measurements in mind.

Try It

The practice is: structure your specs, make them parseable, verify deterministically. You can apply that with whatever tools fit your stack.

If you want an opinionated implementation that covers the full stack, SpecCodex is part of ArchCodex:

GitHub: github.com/ArchCodexOrg/archcodex

Start with one spec for your next feature. Write it with the LLM. See if the implementation matches. I think you'll find it changes how you think about AI-assisted development: not as "generate and review" but as "specify and verify."

This is Part 3 of a series on AI-assisted development. Part 1 covered the benchmarks and why I built ArchCodex. Part 2 explored what the research reveals about AI coding quality.

The Guardrails Coding Agents Needs.

Stefan van Egmond — Thu, 05 Feb 2026 16:26:05 +0000

Part 2: What the research reveals

In Part 1, I described what 1500 hours of AI-assisted development taught me: LLMs write code that compiles, passes tests, and works for users, but doesn't fit. The pattern has a name: architectural drift. I built a tool to measure and prevent it. I ran benchmarks that showed the gap between "working code" and "good code" was larger than I expected.

But I wanted to know: was my experience typical?

So I dug into the research. The pattern was clearer than I expected.

The Problem at Scale

In Part 1, I measured two things traditional metrics miss: architectural drift (code that works but doesn't fit) and silent bugs (violations that compile, pass tests, and clear review). These became my proxy for production risk, the gap between "code that runs" and "code that belongs."

The research measures the same gap at organizational scale. Developers report feeling 20-30% faster with AI tools. Yet delivery stability drops, complexity rises, and technical debt compounds. The 2024 DORA report found that a 25% increase in AI adoption correlates with a 7.2% decrease in stability correlation, not proof of causation, but a pattern worth noticing. The causal evidence is stronger elsewhere: a Carnegie Mellon study used difference-in-differences analysis across 807 repositories after Cursor adoption, finding a 3-5× spike in output during month one, followed by a 30% increase in static analysis warnings and 41% increase in complexity. A METR randomized controlled trial found developers using AI took 19% longer on real tasks—despite believing they were faster.

The tools aren't broken. The feedback loops are.

Problem	Evidence	Source
Duplication exploding	8× increase in duplicate code blocks	GitClear (Feb 2025)
Context gaps	65% cite "missing context" as top issue	Qodo (June 2025)
Security vulnerabilities	45% contain OWASP Top 10 vulnerabilities	Veracode (Aug 2025)
Quality degradation	Logic errors up 1.75×, XSS up 2.74×	CodeRabbit (Dec 2025)
Invisible drift	25% more AI → 7.2% less stability	Google DORA (2024, 2025)

This is what it looks like when you measure individual velocity instead of system health. The common thread? Coding agents know what's possible, not what's right.

The codebase doesn't drift all at once. It drifts one "working" commit at a time.

ArchCodex is a proof of concept: can we help coding agents get the right context when they need it? The approach combines hints, verifiable rules, and tools to check whether the agent (and the codebase) follows those rules, paired with some prompting techniques.

The Approach

The obvious response to "missing context" is to give the LLM more context. While larger context can help, it is still limited. You can fit your entire codebase in a 1M token window. The bottleneck is what kind of context you provide, and whether it surfaces at the right time.

RAG is getting better and injects documentation at query time. This helps with API signatures and usage examples. It's less effective for architectural boundaries, team conventions, and security patterns, the stuff that lives in people's heads and Slack threads, not docs. And because RAG retrieves from actual code, it can reintroduce old patterns or copy wrong ones. Research on agile teams found that significant portions of code commits result in undocumented knowledge (Saito et al.). You can't retrieve what was never written down.

There's active research on structured RAG, graph-based retrieval, and hybrid approaches that blur these lines. What I'm describing isn't a different category; it is retrieval that's structured around architectural concepts, scoped to what's relevant, and enforced rather than suggested. Think of it as architectural metadata—a machine-readable version of the mental model a developer has.

Why Not Just More Context?

RAG retrieves what exists in the codebase, which may include drifted patterns. Fine-tuning bakes in patterns at training time, which can't adapt to architectural decisions made yesterday.

Architecture-as-code operates differently:

Approach	When it learns	What it knows	Can it enforce?
RAG	At query time	What exists in code	No
Fine-tuning	At training time	Patterns frozen in weights	No
ArchCodex	When you update the registry	What's intended, not just what exists	Yes

The registry can be updated after a single incident, immediately affecting every subsequent generation. It can be applied to existing code to surface violations. And when drift does happen, because it will, the health dashboard makes it visible.

The Four Layers

The approach has four layers:

Boundaries - Tell the LLM what this file is allowed to touch. Import restrictions, layer violations, forbidden dependencies. Example: "Cannot import express into domain layer." These prevent drift before it starts.
Constraints - Encode rules that should rarely be broken. "Always call requireProjectPermission() before database access." "Never import infrastructure into domain." These catch silent bugs before they ship.
Examples - Surface canonical implementations at the right moment. "See UserService.ts for the pattern." "Use the event system, not direct calls." These guide the LLM toward consistency without requiring it to infer patterns from scattered examples.
Validation - Catch what slipped through. Single-file checks before commit. Cross-file analysis for layer violations. Health metrics that surface drift over time.

The key insight: these aren't documentation. They're structured context that surfaces when relevant and can be enforced when violated. The difference between a constraint and a wiki page is that the machine reads the constraint automatically and blocks the PR if it's violated. Documentation gets ignored. Constraints get executed.

ArchCodex is one implementation of this approach. It's not the only way to solve this, and it's not a silver bullet. But it let me test whether structured guardrails could address the gaps the research identifies. The results from part 1 suggest they can.

Here's how it works in practice.

How ArchCodex Works

The core mechanism is simple: you tag source files with an @arch annotation, and ArchCodex injects the relevant constraints when an agent reads the file.

The @arch tag is just a comment. In TypeScript: /** @arch domain.payment.processor */. In Python: # @arch domain.payment.processor. That's it. ArchCodex scans for these tags and does the rest.

Boundaries Surface Before Generation

When an LLM agent reads a file through ArchCodex, via archcodex read --format ai or the MCP server integration, the tool looks up the file's @arch tag, resolves the full inheritance chain, and prepends a structured header with all applicable constraints, hints, and boundaries. The agent sees this header before it sees the code. Without ArchCodex, the agent would just see raw source.

Here's what that looks like:

IMPORT BOUNDARIES

  Can import:
    ✓ src/domain/payments/*
    ✓ src/domain/shared/*
    ✓ src/utils/*

  Cannot import:
    ✗ src/api/* (layer violation)
    ✗ src/infra/* (domain must be infra-agnostic)
    ✗ express, fastify, pg (forbidden frameworks)

A "layer" here is a logical grouping you define—typically mapping to architectural boundaries like domain, api, infrastructure, or utils. You configure which directories belong to which layer and which layers can import from which. The domain layer shouldn't import from api; api shouldn't import from cli. These aren't folder names, they're conceptual boundaries that ArchCodex enforces.

The LLM knows what's allowed before it writes a single line. The "missing context" problem, cited by 65% of developers as their top issue, gets addressed at the source.

In Part 1, I showed Opus 4.5 producing the smallest diff with correct logic, and still ranking 6th because of architectural drift. With boundaries explicit, the drift doesn't happen in the first place.

Constraints Encode Conventions

The registry captures what usually lives in people's heads:

myapp.domain.service:
  constraints:
    - rule: forbid_import
      value: [express, fastify, pg]
      severity: error
      why: Domain must be framework-agnostic
      alternative: Inject dependencies via constructor

    - rule: require_call_before
      call: [requireProjectPermission, checkOwnership]
      before: ["repository.*", "ctx.db.*"]
      severity: error
      why: Verify permissions before database access

  hints:
    - Use requireProjectPermission() for ownership checks
    - See src/domain/user/UserService.ts for the pattern

The Registry Isn't a Code Map

The registry doesn't mirror your folder structure. domain.payment.processor doesn't imply a domain/payment/processor.ts file path—it's a conceptual hierarchy for inheriting rules.

When domain.payment inherits from domain, it means: "payment code follows all domain constraints, plus these extras." The inheritance is about rules, not code. Your file can live at src/billing/StripeProcessor.ts and still be tagged @arch domain.payment.processor.

This has a practical implication: registries are portable. You could create a "Next.js + Convex" registry encoding your team's patterns, then reuse it across projects. The architectural knowledge isn't locked to one codebase.

Canonical Implementations Counter the Xerox Effect

Without guidance, coding agents copy from whatever appeared recently in context, which might itself be a copy of a copy, each iteration drifting further from the original intent. Call it the xerox effect: each copy degrades.

A canonical implementation is a file you designate as "the authoritative way to do X." Add it to the pattern registry, and ArchCodex surfaces it in hints and error messages. Instead of the agent copying the most recent (possibly drifted) example, it sees: "Use src/domain/user/UserService.ts as your reference."

One authoritative example prevents the drift that compounds through successive copies.

The GPT 5.1 Problem

Remember the GPT 5.1 result from Part 1? It produced working code with zero critical bugs—and still ranked dead last in my benchmark, because it didn't use requireProjectPermission(). It did manual ownership checks instead. The code worked. It didn't belong.

The require_call_before constraint prevents exactly this class of silent bug. The pattern is now explicit, not buried in tribal knowledge.

This isn't theoretical. Before ArchCodex, my project NimoNova had files that bypassed sanitizeLLMInput() entirely, passing raw content to the model. The code compiled. It worked in testing. In production, it would have been a prompt injection vector. A constraint on LLM-facing modules now catches this automatically.

Validation Catches What Slipped Through

Even with good context, mistakes happen. Validation operates at two levels:

Single-file checks catch constraint violations on changed code:

src/domain/payments/PaymentService.ts

  ✗ ERROR: forbid_import violated
    Line 3: import { Request } from 'express'
    Why: Domain must be framework-agnostic

  ⚠ WARNING: require_call_before not satisfied
    repository.save() called without prior requireProjectPermission()

Errors don't just say "no" - they say what to do instead. Each violation includes a suggestion and, where relevant, a did_you_mean field with concrete fix guidance:

FAIL: src/core/health/analyzer.ts
  forbid_import: chalk
    → Use: src/utils/logger.ts (LoggerService)
    Did you mean: import { logger } from '../../utils/logger.js'

This comes from the constraint definition in the registry. The agent doesn't have to search for the right alternative - it's handed one.

Cross-file checks catch systemic issues and check the complete project after architecture updates:

archcodex check --project

  Layer violations: 3
  Circular dependencies: 2
  Missing canonical patterns: 7

The Feedback Loop

In Part 1, I showed how Haiku 4.5 improved as the registry evolved. The same pattern held when I measured silent bugs specifically:

Registry State	Silent Bugs	Reduction
No ArchCodex	5	-
Base registry	3	40%
+ Security hints	2	60%
+ Canonical patterns	1	80%

Each iteration of the registry, each constraint added from observing mistakes, made the next run better.

The registry improves through use. I sometimes use these five questions to surface improvements:

What information did ArchCodex provide that helped?
What information was missing?
What was irrelevant or noisy?
Did you update any architecture definitions?
For the next developer, what will ArchCodex help with?

Improvements come from the need to change, tighten or update the architecture, introduce new patterns, new utilities, new ways of doing things, common bugs and errors. The registry is a living document. It helps engineers too, not just coding agents. It is architectural governance or mentorship at scale.

When Constraints Aren't Enough

A fair criticism: doesn't this just create rigidity? Codebases evolve. Good architects and engineers make context-dependent trade-offs.

ArchCodex isn't only constraints. The registry has three layers of flexibility, plus a composition mechanism:

Hard constraints are rules that should rarely be broken. Import boundaries, security patterns, layer violations. These catch the mistakes that compound silently.

Hints are soft guidance. "Prefer X over Y." "See this file for the pattern." The coding agent sees them, weighs them, and makes a judgment call. No error if it chooses differently.

Intents declare known patterns that satisfy constraints in non-obvious ways. For example, your codebase might have a rule: "All database queries must filter soft-deleted records." But what about queries that intentionally need deleted records—like a trash view or audit log? An @intent:includes-deleted annotation tells ArchCodex this query intentionally skips the filter—and satisfies the constraint that would otherwise require it. An @intent:cli-output exempts a file from the "no console.log" rule. Intents are decisions, not exceptions. They document valid alternative patterns.

Mixins are reusable constraint bundles. Instead of repeating "must have test file" and "max 300 lines" across ten architectures, you define a tested mixin once and compose it in the registry: mixins: [tested, srp]. You can also apply mixins per-file using inline syntax: @arch domain.payment.processor +singleton +pure. Mixins keep the registry DRY while allowing file-level flexibility.

And when you encounter an unanticipated exception, the override system makes it explicit:

// @override forbid_import:pg
// reason: Legacy migration script, will be removed by Q2
// expires: 2025-06-01
import { Client } from 'pg';

The violation is acknowledged, documented, and time-boxed. Teams can track how much architectural debt they're carrying and whether it's growing or shrinking.

The goal isn't to prevent all deviation. It's to make deviation visible. When a coding agent breaks a pattern, you want to know whether it's drift (bad) or evolution (good).

Ongoing Health and Keeping the Registry Up to Date

Codebases drift over time. The CMU study showed complexity accumulating even as velocity gains faded. ArchCodex surfaces this before it compounds.

Even with hints and constraints, coding agents still tend to "forget" or say things like "for the sake of time, let me do this quickly", resulting in code duplication, violations, and other drift. Three commands address this:

archcodex check - Linter-like validation for architecture. Run on save, commit, or CI. Catches constraint violations, layer boundary crossings, and forbidden patterns. With --project, it also detects circular dependencies.

archcodex health - Dashboard for architectural debt. Shows:

Override debt: How many overrides exist, which are expiring, which have expired
Coverage: What percentage of files have @arch tags
Registry bloat: Architectures used by only one file, similar sibling architectures that could be consolidated
Type duplicates: Identical or near-identical type definitions across files
Recommendations: Actionable suggestions (e.g., "run archcodex audit --expired")

archcodex garden - Index maintenance and pattern detection. Finds naming conventions that aren't yet captured in the registry, inconsistent @arch usage, and missing keywords for discovery.

The goal isn't perfection. It's visibility. You can't fix drift you can't see.

What This Doesn't Solve

You don't need a perfect registry on Day 1. A common question: "For a brownfield project with 500k lines of code, how do I start?" Start with one architecture definition for your most critical layer. Add constraints as violations surface. The registry grows from real issues, not from trying to document everything upfront. An empty registry doesn't break anything, it just means you're not getting guardrails yet.

ArchCodex doesn't replace security scanners. It catches architectural security issues (missing permission checks, layer violations) but not injection vulnerabilities or cryptographic weaknesses.

It doesn't automatically refactor code. It surfaces problems. You fix them. Or the coding agent fixes them, with the constraints now visible.

It requires investment. You write the registry. The LLM helps, and it grows from real issues rather than from scratch. It's not zero-effort but it might save time.

It doesn't work magic on terrible codebases. If your architecture is genuinely confused, ArchCodex will show you the mess. It won't clean it up for you. But it can guide refactoring.

The debugging overhead is real: 67% of developers spend more time debugging AI-generated code than before (Harness). The security remediation gap is worse: only 21% of serious AI/LLM vulnerabilities are ever fixed (Cobalt).

ArchCodex doesn't eliminate these problems. It addresses their root cause: AI generating code without knowing the rules.

The Bigger Picture

The research is clear: AI is making developers faster at writing code that's harder to maintain. Individual velocity is up; system health is down.

I don't think ArchCodex is the only answer. But I think it points toward an answer: coding agents need structured context that surfaces at the right time. They need to know what's forbidden, not just what's possible. And the teams that figure out how to capture senior expertise and make it executable, through constraints, through guardrails, through whatever comes next, will ship faster and more reliably.

The table saw metaphor from Part 1 still holds. The saw isn't the problem. The missing jig is.

ArchCodex is open source. It's one implementation of these ideas, not the definitive one. If you want to test the approach on your own codebase, or if you find gaps, I'd like to hear about it.

GitHub: github.com/ArchCodexOrg/archcodex

References

Google DORA, "Accelerate State of DevOps Report 2024" (Oct 2024)
Google DORA, "State of AI-assisted Software Development 2025" (Sept 2025)
He et al., "Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor's Impact on Software Projects," Carnegie Mellon University (Nov 2025)
GitClear, "AI Copilot Code Quality 2025" (Feb 2025)
CodeRabbit, "State of AI vs Human Code Generation Report" (Dec 2025)
Qodo, "State of AI Code Quality in 2025" (June 2025)
Veracode, "2025 GenAI Code Security Report" (Aug 2025)
METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" (July 2025)
Cobalt, "State of Pentesting Report 2025" (Oct 2025)
Harness, "State of Software Delivery Report 2025" (Jan 2025)
Saito et al., "Discovering undocumented knowledge through visualization of agile software development activities," Requirements Engineering (2018)

*This is Part 2 of a series on AI-assisted development. Part 1 covered the benchmarks and why I built it.

I Built a 2300-File Codebase with AI. Here’s the Jig I Built to Prevent Architectural Drift.

Stefan van Egmond — Wed, 28 Jan 2026 14:00:00 +0000

What 1500 hours of AI-assisted development taught me about the difference between code that runs and code that belongs.*

TL;DR: ArchCodex prevents architectural drift in AI-generated code by surfacing the right constraints at the right time. Benchmarks showed: 36% lower production risk, 70% less drift, and Opus 4.5 achieved zero drift on vague tasks. Top-tier models need it for consistency. Lower-tier models need it to produce working code at all (+55pp).

GitHub - ArchCodexOrg/archcodex

This is Part 1; deeper dives coming.

Over 1500 hours and roughly €1200 in API costs, I built NimoNova as a side project in evenings and weekends: a 2300-file research workspace with automatic knowledge graphs, fact and timeline extraction, document analysis, and multi-tier RAG. I built it almost entirely with LLM coding assistants.

The code compiled. The tests passed. Users could actually use it.

But I had this nagging feeling: what if it was full of mistakes I couldn't see?

NimoNova: knowledge graphs extracted automatically from research sources

The Problem With "Working Code"

LLMs are good at writing code that seemingly works. They can understand APIs, they can follow syntax, they can implement complex algorithms correctly.

What they're terrible at is writing code that belongs.

This isn't just my experience. Security researchers have identified the same pattern:

"One of the hardest risks to detect is what might be called architectural drift—subtle model-generated design changes that break security invariants without violating syntax. These changes often evade static analysis tools and human reviewers." — Endor Labs, 2025

Every codebase has patterns. Conventions. An implicit architecture that experienced developers learn by working on it, building mental models and through tribal knowledge. When you ask an LLM to add a feature, it doesn't know that your team uses requireProjectPermission() instead of manual ownership checks. It doesn't know you have a mutation-per-operation convention, or that barrel exports go in sibling index.ts files, or that soft-deleted records should be filtered by default (or that soft-delete is a thing).

The LLM will write something that seemingly works. But it won't write something that fits.

Careful prompts, multiple runs, manual reviews. All helped counter it. But when you're pumping out code at scale, things slip through. A big application with many modules and functionality will drift. This happens in human-built codebases too. The difference is that with LLMs, it happens faster and more often.

And here's what made it worse: drift compounds. When there's inconsistency in your codebase: multiple ways of doing the same thing, duplicate utilities, competing patterns, LLMs perform worse. They can't pick the right approach when several exist. They copy the wrong pattern because it appeared more recently in context. The drift accelerates.

One function uses the centralized permission system; another does a manual check. One module follows the established error handling pattern; another invents its own. The codebase doesn't drift all at once, it drifts one "working" commit at a time. And each drift makes the next one more likely.

The analogy I like to use is the table saw. A table saw can cut anything and that's great. However without a fence, without guides, without jigs, you get cuts that are technically correct but practically useless. Each cut is fine in isolation. Together, nothing fits.

LLMs needed a jig. Something to guide the cut toward what should be done, in this codebase, for this architecture. So I started building one. Using LLMs to code it and as my focus group. I call it ArchCodex.

Testing the Hypothesis

The idea behind ArchCodex was simple: LLMs are good at some things and, due to inherent constraints like context windows, quite bad at others. What if I helped them? Give them the right context at the right time. Surface the patterns they should follow, exactly when they need to follow them. Make it easy to check what they've done and see what they didn't do.

But I wanted to measure whether the effectiveness I thought I was experiencing was real and consistent, not just confirmation bias.

So I ran multiple benchmarks. Thirty LLM runs across five models (GPT 5.1, Claude Opus 4.5, Claude Haiku 4.5, Gemini Pro 3, GLM 4.7), two different coding tools, with and without ArchCodex. Two different tasks on my actual codebase.

The baseline wasn't naive. The codebase already had a solid AGENTS.md with guidelines and conventions. The agents I used were Warp.dev with indexed source code (giving the LLM codebase awareness) and Claude Code. These are reasonable conditions and ArchCodex still produced significant improvements on top of them.

The benchmarks covered two types of tasks. The first, a detailed prompt with explicit acceptance criteria. This showed that ArchCodex reduced production risk by 36%, dramatically improved architectural drift for top-tier models (zero-drift rates jumped from 17% to 70%), and increased working code rates by 55 percentage points for lower-tier models. But the high-level task revealed something more interesting.

How I defined Production Risk:

Silent Bugs: Logic errors that pass unit tests but fail requirements (e.g., semantic drift)

Loud Bugs: CI failures, lint errors, broken UI or crashes

Architectural Drift: Violations of project conventions (e.g., not using the right utilities, wrong structure, importing code across boundaries, etc)

The High-Level Task

I gave the models a one-line prompt on NimoNova's actual codebase:

"Add the ability to duplicate timeline entries in projects. Users should be able to duplicate an entry and have it appear right below the original."

No acceptance criteria. No implementation hints. Just a feature request. The catch? Project timelines in NimoNova have five entry types, a chronicle section for completed items, junction tables for linked resources, and UI components across five archetypes.

This is where it got interesting.

Opus 4.5 (no ArchCodex) produced:

✅ Correct sort algorithm
✅ Smallest diff (41 lines)
✅ Working code

GPT 5.1 (no ArchCodex) produced:

✅ Correct sort algorithm
✅ Zero critical bugs
✅ Working code

Sounds great, right? Here's how they actually ranked:

Model	Algorithm	Critical Bugs	Final Rank
Opus 4.5 (no ArchCodex)	✅ Correct	1	6th
GPT 5.1 (no ArchCodex)	✅ Correct	0	8th (LAST)

The model with zero critical (loud) bugs ranked dead last, because my scoring penalized architectural drift and silent bugs. Drift can be a start/source of bugs and unmaintainable code, and silent bugs are much harder to debug when they land in production.

Why "Zero Bugs" Ranked Last

GPT 5.1's code worked. It would pass QA. Users would never notice a problem.

But it had six silent failures:

Copied user mentions to the duplicate (semantically wrong, the duplicate wasn't created by those users)
Placed completed-task duplicates in the chronicle section (wrong, duplicates should start fresh)
Set inProgressSince: undefined for in-progress tasks (breaks duration calculations in the timeline)
Missing UI wiring (the backend existed but no button triggered it across any of the five archetypes)
Copied source markers (creates false backlinks in the knowledge graph)
No centralized permissions (inconsistent with requireProjectPermission() used everywhere else)

None of these would show up in compilation. Most wouldn't show up in testing. They'd ship to production and cause subtle, hard-to-debug problems weeks later.

This is "deceptively correct" code, the most dangerous kind, because it passes most checks except the one that matters. Silent failures don't trigger alerts. They erode trust.

What ArchCodex Changed

With ArchCodex, the same models produced dramatically different results. The vague task showed where ArchCodex helps most:

Metric (High Level Task)	With ArchCodex	Without	Delta
Architectural drift	0.75 avg	2.5 avg	-70%
Loud bugs	0.5 avg	1.5 avg	-67%
Production risk	7.75	11.75	-34%

But the effect varied by model tier:

Model Tier	Primary Benefit	Key Metric
Top-tier (Opus 4.5, GPT 5.1)	Drift prevention	-80% drift, Opus 4.5 achieved zero drift
Lower-tier (Haiku 4.5, GLM 4.7)	Fewer crashes	-50% loud bugs, -23% risk

The key insight: top-tier models don't need ArchCodex to write working code. They need it to write code that belongs.

What the benchmarks revealed about different models:

The value of ArchCodex depends on what you're working with. Top-tier models (Opus 4.5, GPT 5.1) already produce working code. Their problem is drift. Without ArchCodex, they "creatively" deviate from your architecture. With it, zero-drift rates jumped from 17% to 70%.

Lower-tier models (Haiku 4.5, Gemini Pro 3, GLM 4.7) have a different problem: they often don't produce working code at all. ArchCodex increased working code rates from 20% to 75%, a 55 percentage point improvement.

The takeaway: Top-tier models need ArchCodex for consistency. Lower-tier models need it for viability.

Opus 4.5 without ArchCodex extended an existing createEntry function instead of creating a dedicated mutation. Technically clever. Algorithmically correct. But it violated the codebase's mutation-per-operation pattern, a pattern every other operation followed.

With ArchCodex, the same model created a proper dedicated mutation. Not because it was told to, but because the constraints surfaced the pattern.

What It Didn't Fix

ArchCodex isn't magic. The benchmarks revealed clear limitations:

Model capabilities are still model capabilities. Haiku still made algorithm mistakes with ArchCodex. No agent (zero out of eight) discovered they needed to wire up UI components across five archetypes. Source marker filtering was a universal blind spot. ArchCodex can surface patterns; it can't upgrade a model's reasoning.

Hints get ignored—especially by weaker models. Only 31% of runs used requireProjectPermission() even though it was in the hints. The lesson: for weaker models, hints aren't enough. If it matters, make it a constraint.

Things not in the registry don't get caught. Only 18% checked for deleted projects. Only 36% prevented owners from adding themselves as members. Why? Those rules weren't in the registry yet. The benchmarks became the source for new constraints, which is exactly how the system is supposed to work.

The Feedback Loop: Five Questions That Improve the Registry

Before diving into how ArchCodex works, here's the workflow that makes it evolve.

After a complex session, or when the output feels off, I ask the LLM five questions:

What information did you need that you DID get from ArchCodex?

What information did you need that you DID NOT get?

What information did ArchCodex provide that was irrelevant or noisy?

Did you create or update any architectural specs? Why or why not?

For the next agent working on this code, what will ArchCodex help them with?

This isn't every session, maybe once a week, or after a particularly gnarly feature. The answers are gold. Question 2 reveals what constraints or hints to add. Question 3 reveals what to trim. And Question 5? That's where the LLM documents patterns for future LLMs. It leaves breadcrumbs. The system starts to maintain itself.

How ArchCodex Works

*Full documentation on GitHub - ArchCodexOrg/archcodex

ArchCodex is built on three ideas:

1. Just-In-Time Context. When an LLM reads a file, it should see the rules that code should follow. ArchCodex "hydrates" minimal @arch tags into full architectural context: constraints, hints, reference implementations. The context is triggered by location, not by query. Mutation file gets mutation rules; query file gets query rules.

2. Static Enforcement. Constraints are checked automatically: on save, on commit, in CI. Twenty-plus constraint types cover imports, patterns, naming, structure, and cross-file boundaries. When violations occur, error messages are actionable: "here's the alternative, here's why, here's a reference implementation."

3. Broad Analysis. Beyond per-file checks: health metrics (override debt, coverage), garden analysis (duplicate code), type consistency (drifted definitions), and import boundary enforcement.

The @arch tag, @intent annotations, and @override exceptions make the implicit explicit. The registry is a living document that helps software engineers as well as AI agents.

The Registry as Living Documentation

The registry isn't a one-time setup, it's an evolving artifact that grows with your codebase, codifying common mistakes and solutions. Most updates come from mundane sources:

Source	Example	Registry Update
Code review	"Why did you do a manual ownership check here?"	Add constraint: `require_call_before`
Bug in production	Soft-deleted records appeared in a query	Add `require_pattern` for query files
Onboarding friction	"Where do barrel exports go?"	Add hint with example
LLM feedback (the 5 questions)	"I didn't know you had a centralized permission helper"	Add hint pointing to `requireProjectPermission()`

This compounds over time. One benchmark showed the effect clearly. Haiku 4.5, a lower-tier model, started with a base registry and couldn't produce working code on the specified task. As we added constraints based on what it got wrong:

Registry State	Working?	Silent Bugs	Score vs Baseline
No ArchCodex	❌	5	—
Base Registry	✅	3	+40%
+ Security Hints	✅	2	+48%
+ Fixed Patterns	✅	2	+68%

Each iteration of the registry, each constraint added from observing mistakes, made the next run better. And will surface similar issues in the codebase when archcodex check --project is used.

This is fundamentally different from traditional linters, which are typically set once, maintained by a platform team, binary pass/fail, and focused on syntax rather than architecture. It shares some ideas with semantic linters, but you have fine-grained control and it adds context among other things.

The registry is more like executable architecture decision records, decisions that are enforced, not just documented. When you decide "all queries must filter soft-deleted records for specific types of classes, models, or frontend components," that decision becomes a constraint. When you decide "use the event system for this module instead of direct database calls," that becomes a pattern with reference implementations. The architecture isn't in a wiki that nobody reads; it's in the tool that LLMs consult on every file.

Arch tags provide the architectural "why" and "what"; the code itself is the specific implementation. If you change something in the architecture (replacing utilities, strengthening constraints, etc.), running check --project shows the impact of those changes and what code needs to be refactored to be compliant again. It serves as a guide not just for new functionality but also for refactoring.

You're Not Starting From Scratch

A reasonable objection: "So I have to define all these rules for my specific codebase?"

Yes, and that's the point. Every codebase has an architecture. Conventions, patterns, boundaries, the implicit "how we do things here." The problem is that this architecture lives in tribal knowledge, in code review comments, in the senior engineer's head, in that onboarding doc nobody updates. LLMs can't read tribal knowledge. But you don't have to write it all at once—you improve it over time. In addition, there are commands available that make setting up an initial registry easy.

In practice, registries have three layers:

Layer 1: Universal principles. Things like SOLID, separation of concerns, basic hygiene. These ship with ArchCodex or are trivially shared. Inherit them and forget about them.

Layer 2: Stack idioms. Convex mutation patterns. Next.js App Router conventions. tRPC procedure structure. These can be community-maintained, shared YAML files that capture best practices for your stack.

Layer 3: Your architecture. The stuff unique to your codebase. Your permission system. Your event patterns. Your module boundaries. This is what you define and what the LLM helps you write.

Your architecture already exists. It's just scattered. ArchCodex gives you a place to put it, and the LLM helps document it. Every rule you add prevents a class of drift.

What Happened When I Applied It At Scale

Applying ArchCodex to NimoNova's ~2200 files took a couple of evenings and a weekend. The initial scan was sobering, many hundreds of warnings. Drift everywhere. Duplicate utilities, diverged type definitions, inconsistent permission checks.

ArchCodex guided major refactoring: event-driven migration for excessive database calls, security hardening for inconsistent permissions, code duplication cleanup via garden and types analysis, and target architecture enforcement to show where reality diverged from intent.

After the benchmarks, the registry got updated based on the common mistakes the agents made, patterns that hadn't been checked for or that didn't emerge before. Running it again on the already-refactored codebase:

archcodex check --project

15 errors. 225 warnings.

In code that had already been cleaned up. The benchmarks had revealed what to look for, and now a whole new category of issues was visible.

Now when an LLM adds a feature, it sees the constraints. It follows the patterns. Not because of a longer prompt, but because the architecture is explicit.

The Real Lesson

Here's what 1500 hours of AI-assisted development taught me:

LLMs are power tools. Power tools are dangerous without jigs.

ArchCodex is the fence, the guide, the jig. It doesn't limit what the LLM can do, it guides the cut toward what should be done, in this codebase, for this architecture. And it helps software engineers and architects maintain a shared understanding of the architecture, navigate refactoring, and find architectural issues.

The benchmarks proved something I suspected but wanted to confirm: the gap between "working code" and "good code" is hard to enforce and guide with traditional tools. Compilation, tests, even manual QA, they catch the loud failures. The silent ones compound until your codebase becomes the thing everyone dreads touching. Of course, this isn't unique to AI coding; anyone who's worked on large enterprise applications will recognize this pattern.

Try It Yourself

ArchCodex is released as open source, for anyone to test, change, fork, benchmark and use it. Let me know the results :)

GitHub: GitHub - ArchCodexOrg/archcodex