<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stefan van Egmond </title>
    <description>The latest articles on DEV Community by Stefan van Egmond  (@stefanve).</description>
    <link>https://dev.to/stefanve</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3737428%2Fd53c40ad-0cd3-4936-941e-ae8a7eec1d70.png</url>
      <title>DEV Community: Stefan van Egmond </title>
      <link>https://dev.to/stefanve</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/stefanve"/>
    <language>en</language>
    <item>
      <title>Structure Beats Prose: Specs for Coding Agents That Actually Work</title>
      <dc:creator>Stefan van Egmond </dc:creator>
      <pubDate>Wed, 11 Feb 2026 18:35:36 +0000</pubDate>
      <link>https://dev.to/stefanve/structure-beats-prose-specs-for-coding-agents-that-actually-work-eln</link>
      <guid>https://dev.to/stefanve/structure-beats-prose-specs-for-coding-agents-that-actually-work-eln</guid>
      <description>&lt;p&gt;&lt;strong&gt;Part 3: From architectural guardrails to deterministic feature implementation&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;In &lt;a href="https://medium.com/@stefanvanegmond/i-built-a-2300-file-codebase-with-ai-heres-the-jig-i-built-to-prevent-architectural-drift-56453fe2d5b4" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt;, I showed that AI-generated code drifts, one "working" commit at a time, and built ArchCodex to surface the right architectural constraints at the right time. In &lt;a href="https://medium.com/@stefanvanegmond/coding-agents-need-more-than-examples-they-need-guardrails-1b3f71bc2c1d" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt;, I dug into the research and explained how boundaries, constraints, and canonical examples create a feedback loop that keeps drift in check.&lt;/p&gt;

&lt;p&gt;But ArchCodex answers &lt;em&gt;how&lt;/em&gt; code should be structured. It doesn't answer &lt;em&gt;what&lt;/em&gt; code should do.&lt;/p&gt;

&lt;p&gt;When I asked 20 AI agents to implement the same feature, "Add the ability to duplicate timeline entries," the ones with ArchCodex produced better-structured code. They still disagreed about what "duplicate" means. Should tags be copied? Should the status reset? Should due dates carry over? Each agent made different assumptions, and each assumption was reasonable. That's the problem.&lt;/p&gt;

&lt;p&gt;I needed a way to make those decisions explicit, readable for both humans and machines, and verifiable. So I built SpecCodex.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with Natural Language Specs
&lt;/h2&gt;

&lt;p&gt;The instinct is to write a detailed specification in prose. "When duplicating an entry, the system should copy the title with a '(copy)' suffix, reset the status to 'todo', clear tags, remove the due date, and place the duplicate immediately below the original."&lt;/p&gt;

&lt;p&gt;This seems reasonable, and there are tools that take this approach. GitHub's Spec Kit, for example, generates natural language specifications for coding agents. But natural language specs have problems that compound as features grow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natural language is ambiguous, for everyone.&lt;/strong&gt; "Clear tags": does that mean set to an empty array, or remove the field entirely? "Immediately below": does that mean sort order &lt;code&gt;original + 1&lt;/code&gt;, or insert at &lt;code&gt;original + 0.5&lt;/code&gt;? An LLM picks an interpretation silently and moves on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prose specs are hard to scan.&lt;/strong&gt; A well-written natural-language spec for the duplicate feature runs 800 to 1200 tokens. Much of that is connective tissue: "the system should," "in the case where," "it is important to note that." For an LLM, those wasted tokens compete with file contents, architectural constraints, and conversation history in a limited context window. For a developer, that's two to three pages of text where the actual decisions are buried in paragraphs. When a codebase has dozens of features, natural language specs become a documentation mountain that nobody reads end to end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natural language can't be tested.&lt;/strong&gt; You can review a prose spec. You can't run it. There's no easy deterministic way to verify that the code matches the spec. This is the fundamental limitation: natural language specs look like they solve the "what should this do" problem, but they just move it from "the LLM guesses" to "the LLM interprets." The variance is smaller, but it's still there.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Spec Is What You Get
&lt;/h2&gt;

&lt;p&gt;The schema draws from patterns LLMs already know deeply from training data: &lt;a href="https://docs.pact.io/" rel="noopener noreferrer"&gt;Pact&lt;/a&gt; contracts, &lt;a href="https://en.wikipedia.org/wiki/Design_by_contract" rel="noopener noreferrer"&gt;Design by Contract&lt;/a&gt; invariants, &lt;a href="https://en.wikipedia.org/wiki/Specification_by_example" rel="noopener noreferrer"&gt;Specification by Example&lt;/a&gt; given/then pairs. This isn't just familiar syntax. LLMs have learned the associations between these formal specification patterns and their implementations. When the model sees &lt;code&gt;invariants&lt;/code&gt; with &lt;code&gt;@length(0)&lt;/code&gt;, it doesn't need instructions on what to produce; the mapping from spec to code is already in the weights. The schema exploits that. It's prompt engineering at the architectural level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pattern isn't the tool.&lt;/strong&gt; What matters is: make decisions explicit in a parseable format, co-author with the LLM, and verify deterministically. If you use OpenAPI, your API spec is already a structured specification; generate contract tests from it. If you use Prisma or Drizzle, your schema is a specification; generate integration tests from it. If you use TypeScript interfaces for component contracts, those are specifications too. SpecCodex provides an opinionated full-stack schema that covers backend, frontend, security, and effects in one place. The benchmarks below prove the pattern works. The tool is one way to implement it.&lt;/p&gt;

&lt;p&gt;Here's what the schema looks like in practice (abbreviated; the full spec includes 7 touchpoints and additional examples):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec.timeline.duplicateEntry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;inherits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spec.mutation&lt;/span&gt;
  &lt;span class="na"&gt;mixins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;requires_auth&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;logs_audit&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;has_timestamps&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;implementation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;convex/projects/timeline/mutations.ts#duplicateEntry&lt;/span&gt;

  &lt;span class="na"&gt;goal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;duplicate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;an&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;existing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeline&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;entry"&lt;/span&gt;
  &lt;span class="na"&gt;intent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Copy&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;core&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;entry&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;position&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;below&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;original,&lt;/span&gt;
           &lt;span class="s"&gt;provide&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fresh&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;transient&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fields"&lt;/span&gt;

  &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;entryId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Id&amp;lt;"projectTimelineEntries"&amp;gt;&lt;/span&gt;
      &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="na"&gt;invariants&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Title&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;suffixed&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(copy)"&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result.title"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@endsWith('&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(copy)')"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Same&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;entry&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;as&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;original"&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result.entryType"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@equals(original.entryType)"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reset&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;todo&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tasks"&lt;/span&gt;
      &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original.entryType&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;===&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'task'"&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result.status"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;todo"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Duplicate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;empty&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(fresh&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;start)"&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result.tags"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@length(0)"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mentions&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reset&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(no&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;re-notifications)"&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result.mentions"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@length(0)"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sort&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;order&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;places&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;duplicate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;below&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;original"&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result.sortOrder"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@gt(original.sortOrder)"&lt;/span&gt;

  &lt;span class="na"&gt;effects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Creates&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeline&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;entry"&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;projectTimelineEntries"&lt;/span&gt;
      &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insert"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Creates&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;junction&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;entries&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;linkedResources"&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;projectTimelineEntryAttachments"&lt;/span&gt;
      &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insert"&lt;/span&gt;
      &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original.linkedResources.length&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Logs&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;activity&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;entry"&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;projectTimelineEntryActivity"&lt;/span&gt;
      &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insert"&lt;/span&gt;
      &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;duplicatedFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@string(original._id)"&lt;/span&gt;

  &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;useTimelineEntryMutations&lt;/span&gt;
      &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;src/hooks/projects/useTimelineEntryMutations.ts&lt;/span&gt;
      &lt;span class="na"&gt;change&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;duplicateEntry&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;mutation&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;binding"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;useTimelineEntryHandlers&lt;/span&gt;
      &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;src/components/projects/planning/useTimelineEntryHandlers.ts&lt;/span&gt;
      &lt;span class="na"&gt;change&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;handleDuplicate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;callback&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;using&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;duplicateEntry&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;mutation"&lt;/span&gt;

  &lt;span class="na"&gt;touchpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TaskArchetype&lt;/span&gt;
      &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;src/components/projects/planning/archetypes/TaskArchetype.tsx&lt;/span&gt;
      &lt;span class="na"&gt;change&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Wire&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;onDuplicate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;handlers&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;getMenuItems"&lt;/span&gt;
      &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TODO&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoteArchetype&lt;/span&gt;
      &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;src/components/projects/planning/archetypes/NoteArchetype.tsx&lt;/span&gt;
      &lt;span class="na"&gt;change&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Duplicate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;menu&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;item&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Copy&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;icon&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;custom&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;menu"&lt;/span&gt;
      &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TODO&lt;/span&gt;
    &lt;span class="c1"&gt;# ... 5 more components&lt;/span&gt;

  &lt;span class="na"&gt;examples&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;task"&lt;/span&gt;
        &lt;span class="na"&gt;given&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;entryId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@validEntryId"&lt;/span&gt;
          &lt;span class="na"&gt;original&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Original&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Task"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;entryType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="na"&gt;then&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;result._id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@exists"&lt;/span&gt;
          &lt;span class="na"&gt;result.title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Original&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Task&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(copy)"&lt;/span&gt;
          &lt;span class="na"&gt;result.status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;todo"&lt;/span&gt;
          &lt;span class="na"&gt;result.tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@length(0)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to notice.&lt;/p&gt;

&lt;p&gt;Notice the invariants: every decision is explicit (&lt;code&gt;@length(0)&lt;/code&gt;, not "cleared"), conditional logic is visible (&lt;code&gt;condition: entryType === 'task'&lt;/code&gt;), and each assertion maps mechanically to a test. This isn't documentation; it's a test specification that hasn't been compiled yet.&lt;/p&gt;

&lt;p&gt;Notice the touchpoints: exact file paths, not descriptions. This turned out to be the critical difference between specs that worked for backend only and specs that worked end-to-end.&lt;/p&gt;




&lt;h2&gt;
  
  
  Writing Specs with the LLM: The Discovery Loop
&lt;/h2&gt;

&lt;p&gt;The schema is designed to be &lt;em&gt;co-authored&lt;/em&gt;. You don't sit down and fill it out like a form. You describe what you want, and the LLM drafts the spec, drawing on its knowledge of the codebase.&lt;/p&gt;

&lt;p&gt;The workflow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You describe the feature&lt;/strong&gt; in natural language. "I want to duplicate timeline entries."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The LLM drafts a spec&lt;/strong&gt; in the SpecCodex schema. It uses ArchCodex's entity context to look up the schema, relationships, and existing patterns in your codebase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You review and refine.&lt;/strong&gt; "Actually, don't copy tags. Users want a fresh start." "Reset status to todo, but only for tasks."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The LLM updates the spec.&lt;/strong&gt; Now the decisions are locked in and visible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The LLM implements from the spec.&lt;/strong&gt; Not from the original prompt. From the agreed-upon specification.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is where the LLM's discovery power actually shines. When drafting the spec, the LLM surfaces questions you haven't thought of yet: "What happens when someone duplicates a task that's in a milestone?" "Should the duplicate inherit the parent's position in the Gantt chart?" "The schema shows a &lt;code&gt;linkedResources&lt;/code&gt; relation; should those be copied or just the references?" These questions come up at spec-writing time, when answering them is free, instead of at code-review time, when the wrong answer is already baked into the implementation.&lt;/p&gt;

&lt;p&gt;Because the spec is structured, you can see exactly what rules the LLM is proposing. If the invariants section doesn't mention sort ordering, you know the LLM hasn't thought about positioning. If there's no conditional on entry type, you know task-specific behavior will be missed. The gaps are visible because the schema defines what a complete spec &lt;em&gt;looks like&lt;/em&gt;. A natural language spec can feel complete while omitting entire categories of decisions. A structured spec with an empty &lt;code&gt;effects&lt;/code&gt; section is obviously incomplete.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deterministic Verification: Testing What Was Built
&lt;/h2&gt;

&lt;p&gt;Here's the payoff of making specs parseable rather than prose: you can &lt;em&gt;mechanically verify&lt;/em&gt; what the agent built. This is the fundamental difference between structured specs and natural language specs. With prose, the only verification is you reading the code (or tests) and comparing it to the document. With the SpecCodex schema, verification can be deterministic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test generation from specs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;You can verify what was built, mechanically.&lt;/strong&gt; That's the fundamental difference. With natural language specs, verification means a human reads the code and judges whether it matches the document. With structured specs, the spec compiles directly to executable tests, no LLM involved in the translation, no interpretation variance.&lt;/p&gt;

&lt;p&gt;This works because the schema includes a typed placeholder DSL for both generating test inputs and asserting on outputs. In &lt;code&gt;given:&lt;/code&gt; blocks, placeholders like &lt;code&gt;@string(100)&lt;/code&gt;, &lt;code&gt;@authenticated&lt;/code&gt;, and &lt;code&gt;@array(3, { name: '@string(10)' })&lt;/code&gt; generate concrete, deterministic test data. In &lt;code&gt;then:&lt;/code&gt; blocks, &lt;code&gt;@exists&lt;/code&gt;, &lt;code&gt;@length(0)&lt;/code&gt;, &lt;code&gt;@gt(N)&lt;/code&gt;, and &lt;code&gt;@contains('copy')&lt;/code&gt; each compile to exactly one &lt;code&gt;expect()&lt;/code&gt; call. There's no interpretation step. &lt;code&gt;@length(0)&lt;/code&gt; always becomes &lt;code&gt;expect(x).toHaveLength(0)&lt;/code&gt;, every time, in every project.&lt;/p&gt;

&lt;p&gt;Different sections of the spec feed different kinds of tests. Examples become unit tests, one &lt;code&gt;it()&lt;/code&gt; block per &lt;code&gt;given/then&lt;/code&gt; pair. Invariants become property tests via fast-check, verifying that properties hold for all valid inputs, not just the examples you thought of. Effects become integration tests that verify database writes and audit logs. Touchpoints become UI interaction tests.&lt;/p&gt;

&lt;p&gt;Here's a concrete example. This spec fragment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;examples&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;task"&lt;/span&gt;
      &lt;span class="na"&gt;given&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;entryId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@validEntryId"&lt;/span&gt;
        &lt;span class="na"&gt;original&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Original&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Task"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;entryType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;then&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;result._id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@exists"&lt;/span&gt;
        &lt;span class="na"&gt;result.title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Original&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Task&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(copy)"&lt;/span&gt;
        &lt;span class="na"&gt;result.status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;todo"&lt;/span&gt;
        &lt;span class="na"&gt;result.tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@length(0)"&lt;/span&gt;
  &lt;span class="na"&gt;errors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unauthenticated"&lt;/span&gt;
      &lt;span class="na"&gt;given&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
      &lt;span class="na"&gt;then&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NOT_AUTHENTICATED"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compiles to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;it&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;duplicate task&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;original&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;createEntry&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Original Task&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;entryType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;task&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;done&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;duplicateEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;original&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_id&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeDefined&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Original Task (copy)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;todo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveLength&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nf"&gt;it&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;unauthenticated&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;duplicateEntry&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rejects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toThrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;NOT_AUTHENTICATED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The translation is mechanical. The spec is written collaboratively (with all the benefits of the discovery loop), but the tests are compiled deterministically (with none of the variance of AI-generated test code). This closes the loop: the LLM writes the implementation, the spec generates tests that verify it, and the results are pass/fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Static analysis across specs
&lt;/h3&gt;

&lt;p&gt;Because specs are structured, you can also run static analysis across them &lt;em&gt;before any code is written&lt;/em&gt;. SpecCodex's analyzer builds a cross-reference graph across your entire spec registry: which specs write to which database tables, which specs read from which tables, which specs depend on each other, which specs share entities. Then it runs 65 checkers across six categories against this graph.&lt;/p&gt;

&lt;p&gt;For example, a checker sees a spec with &lt;code&gt;authentication: none&lt;/code&gt; combined with a database &lt;code&gt;insert&lt;/code&gt; effect and flags it: you're writing to a table without auth. Another sees two specs that both write to the same table with different field assumptions and flags a potential consistency issue. Another sees a CRUD entity with create, read, and delete specs but no update, flagging incomplete coverage. None of this requires running code. It's static analysis for &lt;em&gt;designs&lt;/em&gt;, not implementations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deep mode: verifying code against specs
&lt;/h3&gt;

&lt;p&gt;The base analyzer reasons about specs in isolation. Deep mode goes further: it reads the actual implementation source files and compares them against what the specs claim. The spec says &lt;code&gt;authentication: required&lt;/code&gt;; does the code actually check the user? The spec says &lt;code&gt;permissions: ["bookmark.edit"]&lt;/code&gt;; does the code check that permission, or did it drift to checking &lt;code&gt;"admin"&lt;/code&gt; instead?&lt;/p&gt;

&lt;p&gt;Deep mode uses configurable regex patterns grouped into six categories: auth checks, ownership checks, permission calls, soft-delete filters, database queries, and record fetches. You define these patterns per project because every framework looks different. A Convex project checks for &lt;code&gt;ctx.userId&lt;/code&gt;; an Express project checks for &lt;code&gt;req.user&lt;/code&gt;; a Django project checks for &lt;code&gt;request.user&lt;/code&gt;. The patterns are different, but the security question is the same: does the code verify what the spec requires?&lt;/p&gt;

&lt;p&gt;This catches a specific class of bugs that are nearly invisible in code review. When a spec says the user can only update their own records, deep mode checks whether the code both fetches the record &lt;em&gt;and&lt;/em&gt; compares ownership. When a spec implies soft-delete semantics, deep mode checks whether queries actually filter out deleted records. When a spec declares a permission, deep mode extracts the permission string from the code and compares it to the spec, catching permission drift.&lt;/p&gt;

&lt;p&gt;The full verification stack is layered intentionally. Test generation catches behavioral drift (does the code do what the spec says?). Static analysis catches design gaps (is the spec itself consistent and complete?). Deep mode catches implementation drift (has the code diverged from the spec?). Together, they turn a structured spec into a continuous verification system rather than a document that goes stale.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Benchmark
&lt;/h2&gt;

&lt;p&gt;To validate this approach, I ran the same feature request (duplicate timeline entries) across 20 AI agents with different configurations.&lt;/p&gt;

&lt;p&gt;The feature was chosen because it's deceptively complex. "Duplicate a timeline entry" sounds like a single mutation, but a complete implementation touches 11 files across four layers: the backend mutation and its barrel export, two hook files for binding and handling, a type contract update, a controller wiring change, and five separate UI archetype components that each need menu updates. Most agents discovered the first six files naturally by following imports. The five archetype files, delegated components that don't appear in the obvious import chain, are where implementations diverged.&lt;/p&gt;

&lt;p&gt;The results support the claims above, but they also revealed things I didn't expect.&lt;/p&gt;

&lt;h3&gt;
  
  
  The assumption problem disappears
&lt;/h3&gt;

&lt;p&gt;Without a spec, every agent made reasonable but different decisions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;A1&lt;/th&gt;
&lt;th&gt;A3&lt;/th&gt;
&lt;th&gt;A5&lt;/th&gt;
&lt;th&gt;A7&lt;/th&gt;
&lt;th&gt;A11&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Copy tags?&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copy due dates?&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copy assignee?&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reset status?&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copy attachments?&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every answer is defensible. None is what we wanted.&lt;/p&gt;

&lt;p&gt;With SpecCodex, backend adherence went to 100%. Not improved. &lt;strong&gt;Identical.&lt;/strong&gt; Every agent with the spec produced the same field handling, the same sort order logic, the same audit logging. The spec didn't guide the agent; it constrained it.&lt;/p&gt;

&lt;p&gt;Silent bugs (wrong data copied, missing features, semantic errors) dropped by 75%:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Group&lt;/th&gt;
&lt;th&gt;Avg Silent Bugs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No tooling&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ArchCodex only&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ArchCodex + SpecCodex&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The remaining silent bugs in the SpecCodex group were all UI-related, which leads to the next finding.&lt;/p&gt;

&lt;h3&gt;
  
  
  File paths matter more than descriptions
&lt;/h3&gt;

&lt;p&gt;This was the most surprising finding, and the most actionable. The spec went through three versions. Versions 1 and 2 had invariants, effects, and hooks but described UI changes vaguely. Both produced 0% UI wiring success. Agents with perfect backends didn't touch the right UI files.&lt;/p&gt;

&lt;p&gt;The breakthrough came with v3, which added explicit file paths to touchpoints:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec Version&lt;/th&gt;
&lt;th&gt;Touchpoint Format&lt;/th&gt;
&lt;th&gt;UI Success&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"Update useTimelineEntryHandlers hook"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"file: src/components/.../NoteArchetype.tsx"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;100% (Opus)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Component names weren't enough. Hook names weren't enough. Only full paths worked. If you take one thing from this post for your own specs: when a feature touches multiple files, give the agent the exact path, not a description of where to look.&lt;/p&gt;

&lt;p&gt;This also revealed a capability ceiling: with v3, Opus achieved 5/5 UI components wired correctly while Haiku produced a perfect backend but 0/5 UI. The spec format works for both models on the backend; UI wiring across multiple files requires a more capable model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lucky outcomes vs. reliable processes
&lt;/h3&gt;

&lt;p&gt;The best agent without specs (Opus + ArchCodex, no spec) scored the same as the best agent with specs on production risk. But the unspecified agent's success was &lt;em&gt;emergent&lt;/em&gt;: it happened to explore the right files and make the right assumptions. Run it again and you might get a different result. The specified agent's success was &lt;em&gt;deterministic&lt;/em&gt;: the spec locked in every decision. Run it ten times and you get the same outcome. The difference between a lucky result and a reliable process.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Arc of the Series
&lt;/h2&gt;

&lt;p&gt;The pattern across the three-part series is clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1&lt;/strong&gt;: LLMs write code that works but doesn't &lt;em&gt;fit&lt;/em&gt;. Architectural drift is invisible and compounds. ArchCodex makes it visible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: The research confirms this at scale. Structured guardrails (boundaries, constraints, canonical examples) reduce drift systematically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: &lt;em&gt;What&lt;/em&gt; the code does matters as much as &lt;em&gt;how&lt;/em&gt; it's structured. A purpose-built specification schema, co-authored with the LLM and verified deterministically, eliminates assumption variance and makes every decision visible before code is written.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The table saw metaphor still holds. ArchCodex is the fence; it keeps the cut straight. SpecCodex is the blueprint; it defines where the cut goes. Without both, you're measuring twice and still cutting wrong, because the LLM and you have different measurements in mind.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The practice is: structure your specs, make them parseable, verify deterministically. You can apply that with whatever tools fit your stack.&lt;/p&gt;

&lt;p&gt;If you want an opinionated implementation that covers the full stack, SpecCodex is part of ArchCodex:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ArchCodexOrg/archcodex" rel="noopener noreferrer"&gt;github.com/ArchCodexOrg/archcodex&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Start with one spec for your next feature. Write it with the LLM. See if the implementation matches. I think you'll find it changes how you think about AI-assisted development: not as "generate and review" but as "specify and verify."&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 3 of a series on AI-assisted development. &lt;a href="https://medium.com/@stefanvanegmond/i-built-a-2300-file-codebase-with-ai-heres-the-jig-i-built-to-prevent-architectural-drift-56453fe2d5b4" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt; covered the benchmarks and why I built ArchCodex. &lt;a href="https://medium.com/@stefanvanegmond/coding-agents-need-more-than-examples-they-need-guardrails-1b3f71bc2c1d" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt; explored what the research reveals about AI coding quality.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>architecture</category>
      <category>agents</category>
    </item>
    <item>
      <title>The Guardrails Coding Agents Needs.</title>
      <dc:creator>Stefan van Egmond </dc:creator>
      <pubDate>Thu, 05 Feb 2026 16:26:05 +0000</pubDate>
      <link>https://dev.to/stefanve/the-guardrails-coding-agents-needs-3f0b</link>
      <guid>https://dev.to/stefanve/the-guardrails-coding-agents-needs-3f0b</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 2: What the research reveals&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In &lt;a href="https://medium.com/@stefanvanegmond/i-built-a-2300-file-codebase-with-ai-heres-the-jig-i-built-to-prevent-architectural-drift-56453fe2d5b4" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt;, I described what 1500 hours of AI-assisted development taught me: LLMs write code that compiles, passes tests, and works for users, but doesn't &lt;em&gt;fit&lt;/em&gt;. The pattern has a name: architectural drift. I built a tool to measure and prevent it. I ran benchmarks that showed the gap between "working code" and "good code" was larger than I expected.&lt;/p&gt;

&lt;p&gt;But I wanted to know: was my experience typical?&lt;/p&gt;

&lt;p&gt;So I dug into the research. The pattern was clearer than I expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem at Scale
&lt;/h2&gt;

&lt;p&gt;In Part 1, I measured two things traditional metrics miss: architectural drift (code that works but doesn't fit) and silent bugs (violations that compile, pass tests, and clear review). These became my proxy for production risk, the gap between "code that runs" and "code that belongs."&lt;/p&gt;

&lt;p&gt;The research measures the same gap at organizational scale. Developers report feeling 20-30% faster with AI tools. Yet delivery stability drops, complexity rises, and technical debt compounds. The 2024 DORA report found that a 25% increase in AI adoption correlates with a 7.2% decrease in stability correlation, not proof of causation, but a pattern worth noticing. The causal evidence is stronger elsewhere: a Carnegie Mellon study used difference-in-differences analysis across 807 repositories after Cursor adoption, finding a 3-5× spike in output during month one, followed by a 30% increase in static analysis warnings and 41% increase in complexity. A METR randomized controlled trial found developers using AI took 19% longer on real tasks—despite believing they were faster.&lt;/p&gt;

&lt;p&gt;The tools aren't broken. The feedback loops are.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Evidence&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Duplication exploding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8× increase in duplicate code blocks&lt;/td&gt;
&lt;td&gt;GitClear (Feb 2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context gaps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;65% cite "missing context" as top issue&lt;/td&gt;
&lt;td&gt;Qodo (June 2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security vulnerabilities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;45% contain OWASP Top 10 vulnerabilities&lt;/td&gt;
&lt;td&gt;Veracode (Aug 2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quality degradation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Logic errors up 1.75×, XSS up 2.74×&lt;/td&gt;
&lt;td&gt;CodeRabbit (Dec 2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Invisible drift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;25% more AI → 7.2% less stability&lt;/td&gt;
&lt;td&gt;Google DORA (2024, 2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is what it looks like when you measure individual velocity instead of system health. The common thread? Coding agents know what's &lt;em&gt;possible&lt;/em&gt;, not what's &lt;em&gt;right&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The codebase doesn't drift all at once. It drifts one "working" commit at a time.&lt;/p&gt;

&lt;p&gt;ArchCodex is a proof of concept: can we help coding agents get the right context when they need it? The approach combines hints, verifiable rules, and tools to check whether the agent (and the codebase) follows those rules, paired with some prompting techniques.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Approach
&lt;/h2&gt;

&lt;p&gt;The obvious response to "missing context" is to give the LLM more context. While larger context can help, it is still limited. You can fit your entire codebase in a 1M token window. The bottleneck is &lt;em&gt;what kind&lt;/em&gt; of context you provide, and whether it surfaces at the right time.&lt;/p&gt;

&lt;p&gt;RAG is getting better and injects documentation at query time. This helps with API signatures and usage examples. It's less effective for architectural boundaries, team conventions, and security patterns, the stuff that lives in people's heads and Slack threads, not docs. And because RAG retrieves from actual code, it can reintroduce old patterns or copy wrong ones. Research on agile teams found that significant portions of code commits result in undocumented knowledge (Saito et al.). You can't retrieve what was never written down.&lt;/p&gt;

&lt;p&gt;There's active research on structured RAG, graph-based retrieval, and hybrid approaches that blur these lines. What I'm describing isn't a different category; it is retrieval that's structured around architectural concepts, scoped to what's relevant, and enforced rather than suggested. Think of it as architectural metadata—a machine-readable version of the mental model a developer has.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Not Just More Context?
&lt;/h3&gt;

&lt;p&gt;RAG retrieves what exists in the codebase, which may include drifted patterns. Fine-tuning bakes in patterns at training time, which can't adapt to architectural decisions made yesterday.&lt;/p&gt;

&lt;p&gt;Architecture-as-code operates differently:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;When it learns&lt;/th&gt;
&lt;th&gt;What it knows&lt;/th&gt;
&lt;th&gt;Can it enforce?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;At query time&lt;/td&gt;
&lt;td&gt;What exists in code&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-tuning&lt;/td&gt;
&lt;td&gt;At training time&lt;/td&gt;
&lt;td&gt;Patterns frozen in weights&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ArchCodex&lt;/td&gt;
&lt;td&gt;When you update the registry&lt;/td&gt;
&lt;td&gt;What's &lt;em&gt;intended&lt;/em&gt;, not just what &lt;em&gt;exists&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The registry can be updated after a single incident, immediately affecting every subsequent generation. It can be applied to existing code to surface violations. And when drift &lt;em&gt;does&lt;/em&gt; happen, because it will, the health dashboard makes it visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Four Layers
&lt;/h3&gt;

&lt;p&gt;The approach has four layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Boundaries&lt;/strong&gt; - Tell the LLM what this file is allowed to touch. Import restrictions, layer violations, forbidden dependencies. Example: "Cannot import &lt;code&gt;express&lt;/code&gt; into domain layer." These prevent drift before it starts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Constraints&lt;/strong&gt; - Encode rules that should rarely be broken. "Always call &lt;code&gt;requireProjectPermission()&lt;/code&gt; before database access." "Never import infrastructure into domain." These catch silent bugs before they ship.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Examples&lt;/strong&gt; - Surface canonical implementations at the right moment. "See &lt;code&gt;UserService.ts&lt;/code&gt; for the pattern." "Use the event system, not direct calls." These guide the LLM toward consistency without requiring it to infer patterns from scattered examples.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Validation&lt;/strong&gt; - Catch what slipped through. Single-file checks before commit. Cross-file analysis for layer violations. Health metrics that surface drift over time.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight: these aren't documentation. They're structured context that surfaces when relevant and can be enforced when violated. The difference between a constraint and a wiki page is that the machine reads the constraint automatically and blocks the PR if it's violated. Documentation gets ignored. Constraints get executed.&lt;/p&gt;

&lt;p&gt;ArchCodex is one implementation of this approach. It's not the only way to solve this, and it's not a silver bullet. But it let me test whether structured guardrails could address the gaps the research identifies. The results from part 1 suggest they can.&lt;/p&gt;

&lt;p&gt;Here's how it works in practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  How ArchCodex Works
&lt;/h2&gt;

&lt;p&gt;The core mechanism is simple: you tag source files with an &lt;code&gt;@arch&lt;/code&gt; annotation, and ArchCodex injects the relevant constraints when an agent reads the file.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;@arch&lt;/code&gt; tag is just a comment. In TypeScript: &lt;code&gt;/** @arch domain.payment.processor */&lt;/code&gt;. In Python: &lt;code&gt;# @arch domain.payment.processor&lt;/code&gt;. That's it. ArchCodex scans for these tags and does the rest.&lt;/p&gt;

&lt;h3&gt;
  
  
  Boundaries Surface Before Generation
&lt;/h3&gt;

&lt;p&gt;When an LLM agent reads a file through ArchCodex, via &lt;code&gt;archcodex read --format ai&lt;/code&gt; or the MCP server integration, the tool looks up the file's &lt;code&gt;@arch&lt;/code&gt; tag, resolves the full inheritance chain, and prepends a structured header with all applicable constraints, hints, and boundaries. The agent sees this header &lt;em&gt;before&lt;/em&gt; it sees the code. Without ArchCodex, the agent would just see raw source.&lt;/p&gt;

&lt;p&gt;Here's what that looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IMPORT BOUNDARIES

  Can import:
    ✓ src/domain/payments/*
    ✓ src/domain/shared/*
    ✓ src/utils/*

  Cannot import:
    ✗ src/api/* (layer violation)
    ✗ src/infra/* (domain must be infra-agnostic)
    ✗ express, fastify, pg (forbidden frameworks)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A "layer" here is a logical grouping you define—typically mapping to architectural boundaries like &lt;code&gt;domain&lt;/code&gt;, &lt;code&gt;api&lt;/code&gt;, &lt;code&gt;infrastructure&lt;/code&gt;, or &lt;code&gt;utils&lt;/code&gt;. You configure which directories belong to which layer and which layers can import from which. The domain layer shouldn't import from &lt;code&gt;api&lt;/code&gt;; &lt;code&gt;api&lt;/code&gt; shouldn't import from &lt;code&gt;cli&lt;/code&gt;. These aren't folder names, they're conceptual boundaries that ArchCodex enforces.&lt;/p&gt;

&lt;p&gt;The LLM knows what's allowed before it writes a single line. The "missing context" problem, cited by 65% of developers as their top issue, gets addressed at the source.&lt;/p&gt;

&lt;p&gt;In Part 1, I showed Opus 4.5 producing the smallest diff with correct logic, and still ranking 6th because of architectural drift. With boundaries explicit, the drift doesn't happen in the first place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Constraints Encode Conventions
&lt;/h3&gt;

&lt;p&gt;The registry captures what usually lives in people's heads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;myapp.domain.service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;constraints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;forbid_import&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;express&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;fastify&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pg&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error&lt;/span&gt;
      &lt;span class="na"&gt;why&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Domain must be framework-agnostic&lt;/span&gt;
      &lt;span class="na"&gt;alternative&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Inject dependencies via constructor&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require_call_before&lt;/span&gt;
      &lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;requireProjectPermission&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;checkOwnership&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;before&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repository.*"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ctx.db.*"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error&lt;/span&gt;
      &lt;span class="na"&gt;why&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify permissions before database access&lt;/span&gt;

  &lt;span class="na"&gt;hints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Use requireProjectPermission() for ownership checks&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;See src/domain/user/UserService.ts for the pattern&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Registry Isn't a Code Map
&lt;/h3&gt;

&lt;p&gt;The registry doesn't mirror your folder structure. &lt;code&gt;domain.payment.processor&lt;/code&gt; doesn't imply a &lt;code&gt;domain/payment/processor.ts&lt;/code&gt; file path—it's a &lt;em&gt;conceptual&lt;/em&gt; hierarchy for inheriting rules.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;domain.payment&lt;/code&gt; inherits from &lt;code&gt;domain&lt;/code&gt;, it means: "payment code follows all domain constraints, plus these extras." The inheritance is about &lt;em&gt;rules&lt;/em&gt;, not code. Your file can live at &lt;code&gt;src/billing/StripeProcessor.ts&lt;/code&gt; and still be tagged &lt;code&gt;@arch domain.payment.processor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This has a practical implication: &lt;strong&gt;registries are portable&lt;/strong&gt;. You could create a "Next.js + Convex" registry encoding your team's patterns, then reuse it across projects. The architectural knowledge isn't locked to one codebase.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canonical Implementations Counter the Xerox Effect
&lt;/h3&gt;

&lt;p&gt;Without guidance, coding agents copy from whatever appeared recently in context, which might itself be a copy of a copy, each iteration drifting further from the original intent. Call it the xerox effect: each copy degrades.&lt;/p&gt;

&lt;p&gt;A canonical implementation is a file you designate as "the authoritative way to do X." Add it to the pattern registry, and ArchCodex surfaces it in hints and error messages. Instead of the agent copying the most recent (possibly drifted) example, it sees: "Use &lt;code&gt;src/domain/user/UserService.ts&lt;/code&gt; as your reference."&lt;/p&gt;

&lt;p&gt;One authoritative example prevents the drift that compounds through successive copies.&lt;/p&gt;

&lt;h3&gt;
  
  
  The GPT 5.1 Problem
&lt;/h3&gt;

&lt;p&gt;Remember the GPT 5.1 result from Part 1? It produced working code with zero critical bugs—and still ranked dead last in my benchmark, because it didn't use &lt;code&gt;requireProjectPermission()&lt;/code&gt;. It did manual ownership checks instead. The code worked. It didn't belong.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;require_call_before&lt;/code&gt; constraint prevents exactly this class of silent bug. The pattern is now explicit, not buried in tribal knowledge.&lt;/p&gt;

&lt;p&gt;This isn't theoretical. Before ArchCodex, my project NimoNova had files that bypassed &lt;code&gt;sanitizeLLMInput()&lt;/code&gt; entirely, passing raw content to the model. The code compiled. It worked in testing. In production, it would have been a prompt injection vector. A constraint on LLM-facing modules now catches this automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validation Catches What Slipped Through
&lt;/h3&gt;

&lt;p&gt;Even with good context, mistakes happen. Validation operates at two levels:&lt;/p&gt;

&lt;p&gt;Single-file checks catch constraint violations on changed code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/domain/payments/PaymentService.ts

  ✗ ERROR: forbid_import violated
    Line 3: import { Request } from 'express'
    Why: Domain must be framework-agnostic

  ⚠ WARNING: require_call_before not satisfied
    repository.save() called without prior requireProjectPermission()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Errors don't just say "no" - they say what to do instead. Each violation includes a suggestion and, where relevant, a &lt;code&gt;did_you_mean&lt;/code&gt; field with concrete fix guidance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAIL: src/core/health/analyzer.ts
  forbid_import: chalk
    → Use: src/utils/logger.ts (LoggerService)
    Did you mean: import { logger } from '../../utils/logger.js'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This comes from the constraint definition in the registry. The agent doesn't have to search for the right alternative - it's handed one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-file checks&lt;/strong&gt; catch systemic issues and check the complete project after architecture updates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;archcodex check --project

  Layer violations: 3
  Circular dependencies: 2
  Missing canonical patterns: 7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Feedback Loop
&lt;/h2&gt;

&lt;p&gt;In Part 1, I showed how Haiku 4.5 improved as the registry evolved. The same pattern held when I measured silent bugs specifically:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Registry State&lt;/th&gt;
&lt;th&gt;Silent Bugs&lt;/th&gt;
&lt;th&gt;Reduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No ArchCodex&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Base registry&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ Security hints&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ Canonical patterns&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each iteration of the registry, each constraint added from observing mistakes, made the next run better.&lt;/p&gt;

&lt;p&gt;The registry improves through use. I sometimes use these five questions to surface improvements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What information did ArchCodex provide that helped?&lt;/li&gt;
&lt;li&gt;What information was missing?&lt;/li&gt;
&lt;li&gt;What was irrelevant or noisy?&lt;/li&gt;
&lt;li&gt;Did you update any architecture definitions?&lt;/li&gt;
&lt;li&gt;For the next developer, what will ArchCodex help with?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Improvements come from the need to change, tighten or update the architecture, introduce new patterns, new utilities, new ways of doing things, common bugs and errors. The registry is a living document. It helps engineers too, not just coding agents. It is architectural governance or mentorship at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Constraints Aren't Enough
&lt;/h2&gt;

&lt;p&gt;A fair criticism: doesn't this just create rigidity? Codebases evolve. Good architects and engineers make context-dependent trade-offs.&lt;/p&gt;

&lt;p&gt;ArchCodex isn't only constraints. The registry has three layers of flexibility, plus a composition mechanism:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard constraints&lt;/strong&gt; are rules that should rarely be broken. Import boundaries, security patterns, layer violations. These catch the mistakes that compound silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hints&lt;/strong&gt; are soft guidance. "Prefer X over Y." "See this file for the pattern." The coding agent sees them, weighs them, and makes a judgment call. No error if it chooses differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intents&lt;/strong&gt; declare known patterns that satisfy constraints in non-obvious ways. For example, your codebase might have a rule: "All database queries must filter soft-deleted records." But what about queries that intentionally need deleted records—like a trash view or audit log? An &lt;code&gt;@intent:includes-deleted&lt;/code&gt; annotation tells ArchCodex this query intentionally skips the filter—and satisfies the constraint that would otherwise require it. An &lt;code&gt;@intent:cli-output&lt;/code&gt; exempts a file from the "no console.log" rule. Intents are decisions, not exceptions. They document valid alternative patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixins&lt;/strong&gt; are reusable constraint bundles. Instead of repeating "must have test file" and "max 300 lines" across ten architectures, you define a &lt;code&gt;tested&lt;/code&gt; mixin once and compose it in the registry: &lt;code&gt;mixins: [tested, srp]&lt;/code&gt;. You can also apply mixins per-file using inline syntax: &lt;code&gt;@arch domain.payment.processor +singleton +pure&lt;/code&gt;. Mixins keep the registry DRY while allowing file-level flexibility.&lt;/p&gt;

&lt;p&gt;And when you encounter an &lt;em&gt;unanticipated&lt;/em&gt; exception, the override system makes it explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// @override forbid_import:pg&lt;/span&gt;
&lt;span class="c1"&gt;// reason: Legacy migration script, will be removed by Q2&lt;/span&gt;
&lt;span class="c1"&gt;// expires: 2025-06-01&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Client&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The violation is acknowledged, documented, and time-boxed. Teams can track how much architectural debt they're carrying and whether it's growing or shrinking.&lt;/p&gt;

&lt;p&gt;The goal isn't to prevent all deviation. It's to make deviation visible. When a coding agent breaks a pattern, you want to know whether it's drift (bad) or evolution (good).&lt;/p&gt;




&lt;h2&gt;
  
  
  Ongoing Health and Keeping the Registry Up to Date
&lt;/h2&gt;

&lt;p&gt;Codebases drift over time. The CMU study showed complexity accumulating even as velocity gains faded. ArchCodex surfaces this before it compounds.&lt;/p&gt;

&lt;p&gt;Even with hints and constraints, coding agents still tend to "forget" or say things like "for the sake of time, let me do this quickly", resulting in code duplication, violations, and other drift. Three commands address this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;archcodex check&lt;/code&gt;&lt;/strong&gt; - Linter-like validation for architecture. Run on save, commit, or CI. Catches constraint violations, layer boundary crossings, and forbidden patterns. With &lt;code&gt;--project&lt;/code&gt;, it also detects circular dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;archcodex health&lt;/code&gt;&lt;/strong&gt; - Dashboard for architectural debt. Shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Override debt&lt;/strong&gt;: How many overrides exist, which are expiring, which have expired&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage&lt;/strong&gt;: What percentage of files have &lt;code&gt;@arch&lt;/code&gt; tags&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Registry bloat&lt;/strong&gt;: Architectures used by only one file, similar sibling architectures that could be consolidated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type duplicates&lt;/strong&gt;: Identical or near-identical type definitions across files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommendations&lt;/strong&gt;: Actionable suggestions (e.g., "run &lt;code&gt;archcodex audit --expired&lt;/code&gt;")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;archcodex garden&lt;/code&gt;&lt;/strong&gt; - Index maintenance and pattern detection. Finds naming conventions that aren't yet captured in the registry, inconsistent &lt;code&gt;@arch&lt;/code&gt; usage, and missing keywords for discovery.&lt;/p&gt;

&lt;p&gt;The goal isn't perfection. It's visibility. You can't fix drift you can't see.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Doesn't Solve
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;You don't need a perfect registry on Day 1.&lt;/strong&gt; A common question: "For a brownfield project with 500k lines of code, how do I start?" Start with one architecture definition for your most critical layer. Add constraints as violations surface. The registry grows from real issues, not from trying to document everything upfront. An empty registry doesn't break anything, it just means you're not getting guardrails yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArchCodex doesn't replace security scanners.&lt;/strong&gt; It catches architectural security issues (missing permission checks, layer violations) but not injection vulnerabilities or cryptographic weaknesses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't automatically refactor code.&lt;/strong&gt; It surfaces problems. You fix them. Or the coding agent fixes them, with the constraints now visible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It requires investment.&lt;/strong&gt; You write the registry. The LLM helps, and it grows from real issues rather than from scratch. It's not zero-effort but it might save time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't work magic on terrible codebases.&lt;/strong&gt; If your architecture is genuinely confused, ArchCodex will show you the mess. It won't clean it up for you. But it can guide refactoring.&lt;/p&gt;

&lt;p&gt;The debugging overhead is real: 67% of developers spend more time debugging AI-generated code than before (Harness). The security remediation gap is worse: only 21% of serious AI/LLM vulnerabilities are ever fixed (Cobalt).&lt;/p&gt;

&lt;p&gt;ArchCodex doesn't eliminate these problems. It addresses their root cause: AI generating code without knowing the rules.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The research is clear: AI is making developers faster at writing code that's harder to maintain. Individual velocity is up; system health is down.&lt;/p&gt;

&lt;p&gt;I don't think ArchCodex is the only answer. But I think it points toward &lt;em&gt;an&lt;/em&gt; answer: coding agents need structured context that surfaces at the right time. They need to know what's forbidden, not just what's possible. And the teams that figure out how to capture senior expertise and make it executable, through constraints, through guardrails, through whatever comes next, will ship faster &lt;em&gt;and&lt;/em&gt; more reliably.&lt;/p&gt;

&lt;p&gt;The table saw metaphor from Part 1 still holds. The saw isn't the problem. The missing jig is.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;ArchCodex is open source.&lt;/strong&gt; It's one implementation of these ideas, not the definitive one. If you want to test the approach on your own codebase, or if you find gaps, I'd like to hear about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ArchCodexOrg/archcodex" rel="noopener noreferrer"&gt;github.com/ArchCodexOrg/archcodex&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Google DORA, &lt;a href="https://dora.dev/research/2024/dora-report/" rel="noopener noreferrer"&gt;"Accelerate State of DevOps Report 2024"&lt;/a&gt; (Oct 2024)&lt;/li&gt;
&lt;li&gt;Google DORA, &lt;a href="https://dora.dev/research/2025/dora-report/" rel="noopener noreferrer"&gt;"State of AI-assisted Software Development 2025"&lt;/a&gt; (Sept 2025)&lt;/li&gt;
&lt;li&gt;He et al., &lt;a href="https://arxiv.org/abs/2511.04427" rel="noopener noreferrer"&gt;"Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor's Impact on Software Projects,"&lt;/a&gt; Carnegie Mellon University (Nov 2025)&lt;/li&gt;
&lt;li&gt;GitClear, &lt;a href="https://www.gitclear.com/ai_assistant_code_quality_2025_research" rel="noopener noreferrer"&gt;"AI Copilot Code Quality 2025"&lt;/a&gt; (Feb 2025)&lt;/li&gt;
&lt;li&gt;CodeRabbit, &lt;a href="https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report" rel="noopener noreferrer"&gt;"State of AI vs Human Code Generation Report"&lt;/a&gt; (Dec 2025)&lt;/li&gt;
&lt;li&gt;Qodo, &lt;a href="https://www.qodo.ai/reports/state-of-ai-code-quality/" rel="noopener noreferrer"&gt;"State of AI Code Quality in 2025"&lt;/a&gt; (June 2025)&lt;/li&gt;
&lt;li&gt;Veracode, &lt;a href="https://www.veracode.com/resources/analyst-reports/2025-genai-code-security-report/" rel="noopener noreferrer"&gt;"2025 GenAI Code Security Report"&lt;/a&gt; (Aug 2025)&lt;/li&gt;
&lt;li&gt;METR, &lt;a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/" rel="noopener noreferrer"&gt;"Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity"&lt;/a&gt; (July 2025)&lt;/li&gt;
&lt;li&gt;Cobalt, &lt;a href="https://resource.cobalt.io/state-of-pentesting-2025" rel="noopener noreferrer"&gt;"State of Pentesting Report 2025"&lt;/a&gt; (Oct 2025)&lt;/li&gt;
&lt;li&gt;Harness, &lt;a href="https://www.harness.io/state-of-software-delivery" rel="noopener noreferrer"&gt;"State of Software Delivery Report 2025"&lt;/a&gt; (Jan 2025)&lt;/li&gt;
&lt;li&gt;Saito et al., &lt;a href="https://link.springer.com/article/10.1007/s00766-018-0291-4" rel="noopener noreferrer"&gt;"Discovering undocumented knowledge through visualization of agile software development activities,"&lt;/a&gt; Requirements Engineering (2018)&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;*This is Part 2 of a series on AI-assisted development. Part 1 covered the benchmarks and why I built it.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>architecture</category>
      <category>agents</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Built a 2300-File Codebase with AI. Here’s the Jig I Built to Prevent Architectural Drift.</title>
      <dc:creator>Stefan van Egmond </dc:creator>
      <pubDate>Wed, 28 Jan 2026 14:00:00 +0000</pubDate>
      <link>https://dev.to/stefanve/i-built-a-2300-file-codebase-with-ai-heres-the-jig-i-built-to-prevent-architectural-drift-2dk3</link>
      <guid>https://dev.to/stefanve/i-built-a-2300-file-codebase-with-ai-heres-the-jig-i-built-to-prevent-architectural-drift-2dk3</guid>
      <description>&lt;p&gt;What 1500 hours of AI-assisted development taught me about the difference between code that runs and code that belongs.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; ArchCodex prevents architectural drift in AI-generated code by surfacing the right constraints at the right time. Benchmarks showed: 36% lower production risk, 70% less drift, and Opus 4.5 achieved zero drift on vague tasks. Top-tier models need it for consistency. Lower-tier models need it to produce working code at all (+55pp).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ArchCodexOrg/archcodex" rel="noopener noreferrer"&gt;GitHub - ArchCodexOrg/archcodex&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is Part 1; deeper dives coming.&lt;/p&gt;




&lt;p&gt;Over 1500 hours and roughly €1200 in API costs, I built NimoNova as a side project in evenings and weekends: a 2300-file research workspace with automatic knowledge graphs, fact and timeline extraction, document analysis, and multi-tier RAG. I built it almost entirely with LLM coding assistants.&lt;/p&gt;

&lt;p&gt;The code compiled. The tests passed. Users could actually use it.&lt;/p&gt;

&lt;p&gt;But I had this nagging feeling: what if it was full of mistakes I couldn't see?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5w1bm46o9dic4e4vh9o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5w1bm46o9dic4e4vh9o.png" alt=" " width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;NimoNova: knowledge graphs extracted automatically from research sources&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With "Working Code"
&lt;/h2&gt;

&lt;p&gt;LLMs are good at writing code that seemingly &lt;em&gt;works&lt;/em&gt;. They can understand APIs, they can follow syntax, they can implement complex algorithms correctly.&lt;/p&gt;

&lt;p&gt;What they're terrible at is writing code that &lt;em&gt;belongs&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This isn't just my experience. Security researchers have identified the same pattern:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"One of the hardest risks to detect is what might be called architectural drift—subtle model-generated design changes that break security invariants without violating syntax. These changes often evade static analysis tools and human reviewers." — &lt;a href="https://www.endorlabs.com/learn/the-most-common-security-vulnerabilities-in-ai-generated-code" rel="noopener noreferrer"&gt;Endor Labs, 2025&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every codebase has patterns. Conventions. An implicit architecture that experienced developers learn by working on it, building mental models and through tribal knowledge. When you ask an LLM to add a feature, it doesn't know that your team uses &lt;code&gt;requireProjectPermission()&lt;/code&gt; instead of manual ownership checks. It doesn't know you have a mutation-per-operation convention, or that barrel exports go in sibling &lt;code&gt;index.ts&lt;/code&gt; files, or that soft-deleted records should be filtered by default (or that soft-delete is a thing).&lt;/p&gt;

&lt;p&gt;The LLM will write something that seemingly works. But it won't write something that fits.&lt;/p&gt;

&lt;p&gt;Careful prompts, multiple runs, manual reviews. All helped counter it. But when you're pumping out code at scale, things slip through. A big application with many modules and functionality will drift. This happens in human-built codebases too. The difference is that with LLMs, it happens faster and more often.&lt;/p&gt;

&lt;p&gt;And here's what made it worse: &lt;strong&gt;drift compounds&lt;/strong&gt;. When there's inconsistency in your codebase: multiple ways of doing the same thing, duplicate utilities, competing patterns, LLMs perform &lt;em&gt;worse&lt;/em&gt;. They can't pick the right approach when several exist. They copy the wrong pattern because it appeared more recently in context. The drift accelerates.&lt;/p&gt;

&lt;p&gt;One function uses the centralized permission system; another does a manual check. One module follows the established error handling pattern; another invents its own. The codebase doesn't drift all at once, it drifts one "working" commit at a time. And each drift makes the next one more likely.&lt;/p&gt;

&lt;p&gt;The analogy I like to use is the table saw. A table saw can cut anything and that's great. However without a fence, without guides, without jigs, you get cuts that are technically correct but practically useless. Each cut is fine in isolation. Together, nothing fits.&lt;/p&gt;

&lt;p&gt;LLMs needed a jig. Something to guide the cut toward what should be done, in this codebase, for this architecture. So I started building one. Using LLMs to code it and as my focus group. I call it ArchCodex.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing the Hypothesis
&lt;/h2&gt;

&lt;p&gt;The idea behind ArchCodex was simple: LLMs are good at some things and, due to inherent constraints like context windows, quite bad at others. What if I helped them? Give them the right context at the right time. Surface the patterns they should follow, exactly when they need to follow them. Make it easy to check what they've done and see what they didn't do.&lt;/p&gt;

&lt;p&gt;But I wanted to measure whether the effectiveness I thought I was experiencing was real and consistent, not just confirmation bias.&lt;/p&gt;

&lt;p&gt;So I ran multiple benchmarks. Thirty LLM runs across five models (GPT 5.1, Claude Opus 4.5, Claude Haiku 4.5, Gemini Pro 3, GLM 4.7), two different coding tools, with and without ArchCodex. Two different tasks on my actual codebase.&lt;/p&gt;

&lt;p&gt;The baseline wasn't naive. The codebase already had a solid &lt;code&gt;AGENTS.md&lt;/code&gt; with guidelines and conventions. The agents I used were Warp.dev with indexed source code (giving the LLM codebase awareness) and Claude Code. These are reasonable conditions and ArchCodex still produced significant improvements on top of them.&lt;/p&gt;

&lt;p&gt;The benchmarks covered two types of tasks. The first, a detailed prompt with explicit acceptance criteria. This showed that ArchCodex reduced production risk by 36%, dramatically improved architectural drift for top-tier models (zero-drift rates jumped from 17% to 70%), and increased working code rates by 55 percentage points for lower-tier models. But the high-level task revealed something more interesting.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How I defined Production Risk:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Silent Bugs:&lt;/strong&gt; Logic errors that pass unit tests but fail requirements (e.g., semantic drift)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loud Bugs:&lt;/strong&gt; CI failures, lint errors, broken UI or crashes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architectural Drift:&lt;/strong&gt; Violations of project conventions (e.g., not using the right utilities, wrong structure, importing code across boundaries, etc)&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The High-Level Task
&lt;/h3&gt;

&lt;p&gt;I gave the models a one-line prompt on NimoNova's actual codebase:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Add the ability to duplicate timeline entries in projects. Users should be able to duplicate an entry and have it appear right below the original."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No acceptance criteria. No implementation hints. Just a feature request. The catch? Project timelines in NimoNova have five entry types, a chronicle section for completed items, junction tables for linked resources, and UI components across five archetypes.&lt;/p&gt;

&lt;p&gt;This is where it got interesting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.5 (no ArchCodex)&lt;/strong&gt; produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Correct sort algorithm&lt;/li&gt;
&lt;li&gt;✅ Smallest diff (41 lines)&lt;/li&gt;
&lt;li&gt;✅ Working code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GPT 5.1 (no ArchCodex)&lt;/strong&gt; produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Correct sort algorithm&lt;/li&gt;
&lt;li&gt;✅ Zero critical bugs&lt;/li&gt;
&lt;li&gt;✅ Working code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sounds great, right? Here's how they actually ranked:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;Critical Bugs&lt;/th&gt;
&lt;th&gt;Final Rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4.5 (no ArchCodex)&lt;/td&gt;
&lt;td&gt;✅ Correct&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6th&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT 5.1 (no ArchCodex)&lt;/td&gt;
&lt;td&gt;✅ Correct&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8th (LAST)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The model with zero critical (loud) bugs ranked &lt;em&gt;dead last&lt;/em&gt;, because my scoring penalized architectural drift and silent bugs. Drift can be a start/source of bugs and unmaintainable code, and silent bugs are much harder to debug when they land in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why "Zero Bugs" Ranked Last
&lt;/h3&gt;

&lt;p&gt;GPT 5.1's code worked. It would pass QA. Users would never notice a problem.&lt;/p&gt;

&lt;p&gt;But it had six &lt;strong&gt;silent failures&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copied user mentions to the duplicate (semantically wrong, the duplicate wasn't created by those users)&lt;/li&gt;
&lt;li&gt;Placed completed-task duplicates in the chronicle section (wrong, duplicates should start fresh)&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;inProgressSince: undefined&lt;/code&gt; for in-progress tasks (breaks duration calculations in the timeline)&lt;/li&gt;
&lt;li&gt;Missing UI wiring (the backend existed but no button triggered it across any of the five archetypes)&lt;/li&gt;
&lt;li&gt;Copied source markers (creates false backlinks in the knowledge graph)&lt;/li&gt;
&lt;li&gt;No centralized permissions (inconsistent with &lt;code&gt;requireProjectPermission()&lt;/code&gt; used everywhere else)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these would show up in compilation. Most wouldn't show up in testing. They'd ship to production and cause subtle, hard-to-debug problems weeks later.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;"deceptively correct"&lt;/strong&gt; code, the most dangerous kind, because it passes most checks except the one that matters. Silent failures don't trigger alerts. They erode trust.&lt;/p&gt;

&lt;h3&gt;
  
  
  What ArchCodex Changed
&lt;/h3&gt;

&lt;p&gt;With ArchCodex, the same models produced dramatically different results. The vague task showed where ArchCodex helps most:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric (High Level Task)&lt;/th&gt;
&lt;th&gt;With ArchCodex&lt;/th&gt;
&lt;th&gt;Without&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Architectural drift&lt;/td&gt;
&lt;td&gt;0.75 avg&lt;/td&gt;
&lt;td&gt;2.5 avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-70%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Loud bugs&lt;/td&gt;
&lt;td&gt;0.5 avg&lt;/td&gt;
&lt;td&gt;1.5 avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-67%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production risk&lt;/td&gt;
&lt;td&gt;7.75&lt;/td&gt;
&lt;td&gt;11.75&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-34%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But the effect varied by model tier:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model Tier&lt;/th&gt;
&lt;th&gt;Primary Benefit&lt;/th&gt;
&lt;th&gt;Key Metric&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Top-tier (Opus 4.5, GPT 5.1)&lt;/td&gt;
&lt;td&gt;Drift prevention&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;-80% drift&lt;/strong&gt;, Opus 4.5 achieved zero drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lower-tier (Haiku 4.5, GLM 4.7)&lt;/td&gt;
&lt;td&gt;Fewer crashes&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;-50% loud bugs&lt;/strong&gt;, -23% risk&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;top-tier models don't need ArchCodex to write working code. They need it to write code that belongs.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What the benchmarks revealed about different models:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The value of ArchCodex depends on what you're working with. Top-tier models (Opus 4.5, GPT 5.1) already produce working code. Their problem is drift. Without ArchCodex, they "creatively" deviate from your architecture. With it, zero-drift rates jumped from 17% to 70%.&lt;/p&gt;

&lt;p&gt;Lower-tier models (Haiku 4.5, Gemini Pro 3, GLM 4.7) have a different problem: they often don't produce working code at all. ArchCodex increased working code rates from 20% to 75%, a 55 percentage point improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The takeaway:&lt;/strong&gt; Top-tier models need ArchCodex for &lt;em&gt;consistency&lt;/em&gt;. Lower-tier models need it for &lt;em&gt;viability&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Opus 4.5 without ArchCodex extended an existing &lt;code&gt;createEntry&lt;/code&gt; function instead of creating a dedicated mutation. Technically clever. Algorithmically correct. But it violated the codebase's mutation-per-operation pattern, a pattern every other operation followed.&lt;/p&gt;

&lt;p&gt;With ArchCodex, the same model created a proper dedicated mutation. Not because it was told to, but because the constraints surfaced the pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  What It Didn't Fix
&lt;/h3&gt;

&lt;p&gt;ArchCodex isn't magic. The benchmarks revealed clear limitations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model capabilities are still model capabilities.&lt;/strong&gt; Haiku still made algorithm mistakes with ArchCodex. No agent (zero out of eight) discovered they needed to wire up UI components across five archetypes. Source marker filtering was a universal blind spot. ArchCodex can surface patterns; it can't upgrade a model's reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hints get ignored—especially by weaker models.&lt;/strong&gt; Only 31% of runs used &lt;code&gt;requireProjectPermission()&lt;/code&gt; even though it was in the hints. The lesson: for weaker models, hints aren't enough. If it matters, make it a constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Things not in the registry don't get caught.&lt;/strong&gt; Only 18% checked for deleted projects. Only 36% prevented owners from adding themselves as members. Why? Those rules weren't in the registry yet. The benchmarks became the source for new constraints, which is exactly how the system is supposed to work.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Feedback Loop: Five Questions That Improve the Registry
&lt;/h2&gt;

&lt;p&gt;Before diving into how ArchCodex works, here's the workflow that makes it evolve.&lt;/p&gt;

&lt;p&gt;After a complex session, or when the output feels off, I ask the LLM five questions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;What information did you need that you DID get from ArchCodex?&lt;/li&gt;
&lt;li&gt;What information did you need that you DID NOT get?&lt;/li&gt;
&lt;li&gt;What information did ArchCodex provide that was irrelevant or noisy?&lt;/li&gt;
&lt;li&gt;Did you create or update any architectural specs? Why or why not?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;For the next agent working on this code, what will ArchCodex help them with?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;This isn't every session, maybe once a week, or after a particularly gnarly feature. The answers are gold. Question 2 reveals what constraints or hints to add. Question 3 reveals what to trim. And Question 5? That's where the LLM documents patterns for &lt;em&gt;future LLMs&lt;/em&gt;. It leaves breadcrumbs. The system starts to maintain itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  How ArchCodex Works
&lt;/h2&gt;

&lt;p&gt;*Full documentation on &lt;a href="https://github.com/ArchCodexOrg/archcodex" rel="noopener noreferrer"&gt;GitHub - ArchCodexOrg/archcodex&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ArchCodex is built on three ideas:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Just-In-Time Context.&lt;/strong&gt; When an LLM reads a file, it should see the rules that code should follow. ArchCodex "hydrates" minimal &lt;code&gt;@arch&lt;/code&gt; tags into full architectural context: constraints, hints, reference implementations. The context is triggered by location, not by query. Mutation file gets mutation rules; query file gets query rules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Static Enforcement.&lt;/strong&gt; Constraints are checked automatically: on save, on commit, in CI. Twenty-plus constraint types cover imports, patterns, naming, structure, and cross-file boundaries. When violations occur, error messages are actionable: "here's the alternative, here's why, here's a reference implementation."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Broad Analysis.&lt;/strong&gt; Beyond per-file checks: health metrics (override debt, coverage), garden analysis (duplicate code), type consistency (drifted definitions), and import boundary enforcement.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;@arch&lt;/code&gt; tag, &lt;code&gt;@intent&lt;/code&gt; annotations, and &lt;code&gt;@override&lt;/code&gt; exceptions make the implicit explicit. The registry is a living document that helps software engineers as well as AI agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Registry as Living Documentation
&lt;/h3&gt;

&lt;p&gt;The registry isn't a one-time setup, it's an evolving artifact that grows with your codebase, codifying common mistakes and solutions. Most updates come from mundane sources:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Registry Update&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;"Why did you do a manual ownership check here?"&lt;/td&gt;
&lt;td&gt;Add constraint: &lt;code&gt;require_call_before&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bug in production&lt;/td&gt;
&lt;td&gt;Soft-deleted records appeared in a query&lt;/td&gt;
&lt;td&gt;Add &lt;code&gt;require_pattern&lt;/code&gt; for query files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Onboarding friction&lt;/td&gt;
&lt;td&gt;"Where do barrel exports go?"&lt;/td&gt;
&lt;td&gt;Add hint with example&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM feedback (the 5 questions)&lt;/td&gt;
&lt;td&gt;"I didn't know you had a centralized permission helper"&lt;/td&gt;
&lt;td&gt;Add hint pointing to &lt;code&gt;requireProjectPermission()&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This compounds over time. One benchmark showed the effect clearly. Haiku 4.5, a lower-tier model, started with a base registry and couldn't produce working code on the specified task. As we added constraints based on what it got wrong:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Registry State&lt;/th&gt;
&lt;th&gt;Working?&lt;/th&gt;
&lt;th&gt;Silent Bugs&lt;/th&gt;
&lt;th&gt;Score vs Baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No ArchCodex&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Base Registry&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;+40%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ Security Hints&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;+48%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ Fixed Patterns&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;+68%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each iteration of the registry, each constraint added from observing mistakes, made the next run better. And will surface similar issues in the codebase when &lt;code&gt;archcodex check --project&lt;/code&gt; is used.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from traditional linters, which are typically set once, maintained by a platform team, binary pass/fail, and focused on syntax rather than architecture. It shares some ideas with semantic linters, but you have fine-grained control and it adds context among other things.&lt;/p&gt;

&lt;p&gt;The registry is more like executable architecture decision records, decisions that are &lt;em&gt;enforced&lt;/em&gt;, not just documented. When you decide "all queries must filter soft-deleted records for specific types of classes, models, or frontend components," that decision becomes a constraint. When you decide "use the event system for this module instead of direct database calls," that becomes a pattern with reference implementations. The architecture isn't in a wiki that nobody reads; it's in the tool that LLMs consult on every file.&lt;/p&gt;

&lt;p&gt;Arch tags provide the architectural "why" and "what"; the code itself is the specific implementation. If you change something in the architecture (replacing utilities, strengthening constraints, etc.), running &lt;code&gt;check --project&lt;/code&gt; shows the impact of those changes and what code needs to be refactored to be compliant again. It serves as a guide not just for new functionality but also for refactoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  You're Not Starting From Scratch
&lt;/h3&gt;

&lt;p&gt;A reasonable objection: "So I have to define all these rules for my specific codebase?"&lt;/p&gt;

&lt;p&gt;Yes, and that's the point. Every codebase has an architecture. Conventions, patterns, boundaries, the implicit "how we do things here." The problem is that this architecture lives in tribal knowledge, in code review comments, in the senior engineer's head, in that onboarding doc nobody updates. LLMs can't read tribal knowledge. But you don't have to write it all at once—you improve it over time. In addition, there are commands available that make setting up an initial registry easy.&lt;/p&gt;

&lt;p&gt;In practice, registries have three layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Universal principles.&lt;/strong&gt; Things like SOLID, separation of concerns, basic hygiene. These ship with ArchCodex or are trivially shared. Inherit them and forget about them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Stack idioms.&lt;/strong&gt; Convex mutation patterns. Next.js App Router conventions. tRPC procedure structure. These can be community-maintained, shared YAML files that capture best practices for your stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Your architecture.&lt;/strong&gt; The stuff unique to your codebase. Your permission system. Your event patterns. Your module boundaries. This is what you define and what the LLM helps you write.&lt;/p&gt;

&lt;p&gt;Your architecture already exists. It's just scattered. ArchCodex gives you a place to put it, and the LLM helps document it. Every rule you add prevents a class of drift.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happened When I Applied It At Scale
&lt;/h2&gt;

&lt;p&gt;Applying ArchCodex to NimoNova's ~2200 files took a couple of evenings and a weekend. The initial scan was sobering, many hundreds of warnings. Drift everywhere. Duplicate utilities, diverged type definitions, inconsistent permission checks.&lt;/p&gt;

&lt;p&gt;ArchCodex guided major refactoring: event-driven migration for excessive database calls, security hardening for inconsistent permissions, code duplication cleanup via &lt;code&gt;garden&lt;/code&gt; and &lt;code&gt;types&lt;/code&gt; analysis, and target architecture enforcement to show where reality diverged from intent.&lt;/p&gt;

&lt;p&gt;After the benchmarks, the registry got updated based on the common mistakes the agents made, patterns that hadn't been checked for or that didn't emerge before. Running it again on the already-refactored codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archcodex check &lt;span class="nt"&gt;--project&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;15 errors. 225 warnings.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In code that had already been cleaned up. The benchmarks had revealed what to look for, and now a whole new category of issues was visible.&lt;/p&gt;

&lt;p&gt;Now when an LLM adds a feature, it sees the constraints. It follows the patterns. Not because of a longer prompt, but because the architecture is explicit.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;Here's what 1500 hours of AI-assisted development taught me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLMs are power tools. Power tools are dangerous without jigs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ArchCodex is the fence, the guide, the jig. It doesn't limit what the LLM can do, it guides the cut toward what should be done, in this codebase, for this architecture. And it helps software engineers and architects maintain a shared understanding of the architecture, navigate refactoring, and find architectural issues.&lt;/p&gt;

&lt;p&gt;The benchmarks proved something I suspected but wanted to confirm: &lt;strong&gt;the gap between "working code" and "good code" is hard to enforce and guide with traditional tools.&lt;/strong&gt; Compilation, tests, even manual QA, they catch the loud failures. The silent ones compound until your codebase becomes the thing everyone dreads touching. Of course, this isn't unique to AI coding; anyone who's worked on large enterprise applications will recognize this pattern.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;ArchCodex is released as open source, for anyone to test, change, fork, benchmark and use it. Let me know the results :)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ArchCodexOrg/archcodex" rel="noopener noreferrer"&gt;GitHub - ArchCodexOrg/archcodex&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
