Stefan van Egmond

Posted on Jan 28 • Originally published at Medium

I Built a 2300-File Codebase with AI. Here’s the Jig I Built to Prevent Architectural Drift.

#programming #ai #architecture #agents

What 1500 hours of AI-assisted development taught me about the difference between code that runs and code that belongs.*

TL;DR: ArchCodex prevents architectural drift in AI-generated code by surfacing the right constraints at the right time. Benchmarks showed: 36% lower production risk, 70% less drift, and Opus 4.5 achieved zero drift on vague tasks. Top-tier models need it for consistency. Lower-tier models need it to produce working code at all (+55pp).

GitHub - ArchCodexOrg/archcodex

This is Part 1; deeper dives coming.

Over 1500 hours and roughly €1200 in API costs, I built NimoNova as a side project in evenings and weekends: a 2300-file research workspace with automatic knowledge graphs, fact and timeline extraction, document analysis, and multi-tier RAG. I built it almost entirely with LLM coding assistants.

The code compiled. The tests passed. Users could actually use it.

But I had this nagging feeling: what if it was full of mistakes I couldn't see?

NimoNova: knowledge graphs extracted automatically from research sources

The Problem With "Working Code"

LLMs are good at writing code that seemingly works. They can understand APIs, they can follow syntax, they can implement complex algorithms correctly.

What they're terrible at is writing code that belongs.

This isn't just my experience. Security researchers have identified the same pattern:

"One of the hardest risks to detect is what might be called architectural drift—subtle model-generated design changes that break security invariants without violating syntax. These changes often evade static analysis tools and human reviewers." — Endor Labs, 2025

Every codebase has patterns. Conventions. An implicit architecture that experienced developers learn by working on it, building mental models and through tribal knowledge. When you ask an LLM to add a feature, it doesn't know that your team uses requireProjectPermission() instead of manual ownership checks. It doesn't know you have a mutation-per-operation convention, or that barrel exports go in sibling index.ts files, or that soft-deleted records should be filtered by default (or that soft-delete is a thing).

The LLM will write something that seemingly works. But it won't write something that fits.

Careful prompts, multiple runs, manual reviews. All helped counter it. But when you're pumping out code at scale, things slip through. A big application with many modules and functionality will drift. This happens in human-built codebases too. The difference is that with LLMs, it happens faster and more often.

And here's what made it worse: drift compounds. When there's inconsistency in your codebase: multiple ways of doing the same thing, duplicate utilities, competing patterns, LLMs perform worse. They can't pick the right approach when several exist. They copy the wrong pattern because it appeared more recently in context. The drift accelerates.

One function uses the centralized permission system; another does a manual check. One module follows the established error handling pattern; another invents its own. The codebase doesn't drift all at once, it drifts one "working" commit at a time. And each drift makes the next one more likely.

The analogy I like to use is the table saw. A table saw can cut anything and that's great. However without a fence, without guides, without jigs, you get cuts that are technically correct but practically useless. Each cut is fine in isolation. Together, nothing fits.

LLMs needed a jig. Something to guide the cut toward what should be done, in this codebase, for this architecture. So I started building one. Using LLMs to code it and as my focus group. I call it ArchCodex.

Testing the Hypothesis

The idea behind ArchCodex was simple: LLMs are good at some things and, due to inherent constraints like context windows, quite bad at others. What if I helped them? Give them the right context at the right time. Surface the patterns they should follow, exactly when they need to follow them. Make it easy to check what they've done and see what they didn't do.

But I wanted to measure whether the effectiveness I thought I was experiencing was real and consistent, not just confirmation bias.

So I ran multiple benchmarks. Thirty LLM runs across five models (GPT 5.1, Claude Opus 4.5, Claude Haiku 4.5, Gemini Pro 3, GLM 4.7), two different coding tools, with and without ArchCodex. Two different tasks on my actual codebase.

The baseline wasn't naive. The codebase already had a solid AGENTS.md with guidelines and conventions. The agents I used were Warp.dev with indexed source code (giving the LLM codebase awareness) and Claude Code. These are reasonable conditions and ArchCodex still produced significant improvements on top of them.

The benchmarks covered two types of tasks. The first, a detailed prompt with explicit acceptance criteria. This showed that ArchCodex reduced production risk by 36%, dramatically improved architectural drift for top-tier models (zero-drift rates jumped from 17% to 70%), and increased working code rates by 55 percentage points for lower-tier models. But the high-level task revealed something more interesting.

How I defined Production Risk:

Silent Bugs: Logic errors that pass unit tests but fail requirements (e.g., semantic drift)

Loud Bugs: CI failures, lint errors, broken UI or crashes

Architectural Drift: Violations of project conventions (e.g., not using the right utilities, wrong structure, importing code across boundaries, etc)

The High-Level Task

I gave the models a one-line prompt on NimoNova's actual codebase:

"Add the ability to duplicate timeline entries in projects. Users should be able to duplicate an entry and have it appear right below the original."

No acceptance criteria. No implementation hints. Just a feature request. The catch? Project timelines in NimoNova have five entry types, a chronicle section for completed items, junction tables for linked resources, and UI components across five archetypes.

This is where it got interesting.

Opus 4.5 (no ArchCodex) produced:

✅ Correct sort algorithm
✅ Smallest diff (41 lines)
✅ Working code

GPT 5.1 (no ArchCodex) produced:

✅ Correct sort algorithm
✅ Zero critical bugs
✅ Working code

Sounds great, right? Here's how they actually ranked:

Model	Algorithm	Critical Bugs	Final Rank
Opus 4.5 (no ArchCodex)	✅ Correct	1	6th
GPT 5.1 (no ArchCodex)	✅ Correct	0	8th (LAST)

The model with zero critical (loud) bugs ranked dead last, because my scoring penalized architectural drift and silent bugs. Drift can be a start/source of bugs and unmaintainable code, and silent bugs are much harder to debug when they land in production.

Why "Zero Bugs" Ranked Last

GPT 5.1's code worked. It would pass QA. Users would never notice a problem.

But it had six silent failures:

Copied user mentions to the duplicate (semantically wrong, the duplicate wasn't created by those users)
Placed completed-task duplicates in the chronicle section (wrong, duplicates should start fresh)
Set inProgressSince: undefined for in-progress tasks (breaks duration calculations in the timeline)
Missing UI wiring (the backend existed but no button triggered it across any of the five archetypes)
Copied source markers (creates false backlinks in the knowledge graph)
No centralized permissions (inconsistent with requireProjectPermission() used everywhere else)

None of these would show up in compilation. Most wouldn't show up in testing. They'd ship to production and cause subtle, hard-to-debug problems weeks later.

This is "deceptively correct" code, the most dangerous kind, because it passes most checks except the one that matters. Silent failures don't trigger alerts. They erode trust.

What ArchCodex Changed

With ArchCodex, the same models produced dramatically different results. The vague task showed where ArchCodex helps most:

Metric (High Level Task)	With ArchCodex	Without	Delta
Architectural drift	0.75 avg	2.5 avg	-70%
Loud bugs	0.5 avg	1.5 avg	-67%
Production risk	7.75	11.75	-34%

But the effect varied by model tier:

Model Tier	Primary Benefit	Key Metric
Top-tier (Opus 4.5, GPT 5.1)	Drift prevention	-80% drift, Opus 4.5 achieved zero drift
Lower-tier (Haiku 4.5, GLM 4.7)	Fewer crashes	-50% loud bugs, -23% risk

The key insight: top-tier models don't need ArchCodex to write working code. They need it to write code that belongs.

What the benchmarks revealed about different models:

The value of ArchCodex depends on what you're working with. Top-tier models (Opus 4.5, GPT 5.1) already produce working code. Their problem is drift. Without ArchCodex, they "creatively" deviate from your architecture. With it, zero-drift rates jumped from 17% to 70%.

Lower-tier models (Haiku 4.5, Gemini Pro 3, GLM 4.7) have a different problem: they often don't produce working code at all. ArchCodex increased working code rates from 20% to 75%, a 55 percentage point improvement.

The takeaway: Top-tier models need ArchCodex for consistency. Lower-tier models need it for viability.

Opus 4.5 without ArchCodex extended an existing createEntry function instead of creating a dedicated mutation. Technically clever. Algorithmically correct. But it violated the codebase's mutation-per-operation pattern, a pattern every other operation followed.

With ArchCodex, the same model created a proper dedicated mutation. Not because it was told to, but because the constraints surfaced the pattern.

What It Didn't Fix

ArchCodex isn't magic. The benchmarks revealed clear limitations:

Model capabilities are still model capabilities. Haiku still made algorithm mistakes with ArchCodex. No agent (zero out of eight) discovered they needed to wire up UI components across five archetypes. Source marker filtering was a universal blind spot. ArchCodex can surface patterns; it can't upgrade a model's reasoning.

Hints get ignored—especially by weaker models. Only 31% of runs used requireProjectPermission() even though it was in the hints. The lesson: for weaker models, hints aren't enough. If it matters, make it a constraint.

Things not in the registry don't get caught. Only 18% checked for deleted projects. Only 36% prevented owners from adding themselves as members. Why? Those rules weren't in the registry yet. The benchmarks became the source for new constraints, which is exactly how the system is supposed to work.

The Feedback Loop: Five Questions That Improve the Registry

Before diving into how ArchCodex works, here's the workflow that makes it evolve.

After a complex session, or when the output feels off, I ask the LLM five questions:

What information did you need that you DID get from ArchCodex?

What information did you need that you DID NOT get?

What information did ArchCodex provide that was irrelevant or noisy?

Did you create or update any architectural specs? Why or why not?

For the next agent working on this code, what will ArchCodex help them with?

This isn't every session, maybe once a week, or after a particularly gnarly feature. The answers are gold. Question 2 reveals what constraints or hints to add. Question 3 reveals what to trim. And Question 5? That's where the LLM documents patterns for future LLMs. It leaves breadcrumbs. The system starts to maintain itself.

How ArchCodex Works

*Full documentation on GitHub - ArchCodexOrg/archcodex

ArchCodex is built on three ideas:

1. Just-In-Time Context. When an LLM reads a file, it should see the rules that code should follow. ArchCodex "hydrates" minimal @arch tags into full architectural context: constraints, hints, reference implementations. The context is triggered by location, not by query. Mutation file gets mutation rules; query file gets query rules.

2. Static Enforcement. Constraints are checked automatically: on save, on commit, in CI. Twenty-plus constraint types cover imports, patterns, naming, structure, and cross-file boundaries. When violations occur, error messages are actionable: "here's the alternative, here's why, here's a reference implementation."

3. Broad Analysis. Beyond per-file checks: health metrics (override debt, coverage), garden analysis (duplicate code), type consistency (drifted definitions), and import boundary enforcement.

The @arch tag, @intent annotations, and @override exceptions make the implicit explicit. The registry is a living document that helps software engineers as well as AI agents.

The Registry as Living Documentation

The registry isn't a one-time setup, it's an evolving artifact that grows with your codebase, codifying common mistakes and solutions. Most updates come from mundane sources:

Source	Example	Registry Update
Code review	"Why did you do a manual ownership check here?"	Add constraint: `require_call_before`
Bug in production	Soft-deleted records appeared in a query	Add `require_pattern` for query files
Onboarding friction	"Where do barrel exports go?"	Add hint with example
LLM feedback (the 5 questions)	"I didn't know you had a centralized permission helper"	Add hint pointing to `requireProjectPermission()`

This compounds over time. One benchmark showed the effect clearly. Haiku 4.5, a lower-tier model, started with a base registry and couldn't produce working code on the specified task. As we added constraints based on what it got wrong:

Registry State	Working?	Silent Bugs	Score vs Baseline
No ArchCodex	❌	5	—
Base Registry	✅	3	+40%
+ Security Hints	✅	2	+48%
+ Fixed Patterns	✅	2	+68%

Each iteration of the registry, each constraint added from observing mistakes, made the next run better. And will surface similar issues in the codebase when archcodex check --project is used.

This is fundamentally different from traditional linters, which are typically set once, maintained by a platform team, binary pass/fail, and focused on syntax rather than architecture. It shares some ideas with semantic linters, but you have fine-grained control and it adds context among other things.

The registry is more like executable architecture decision records, decisions that are enforced, not just documented. When you decide "all queries must filter soft-deleted records for specific types of classes, models, or frontend components," that decision becomes a constraint. When you decide "use the event system for this module instead of direct database calls," that becomes a pattern with reference implementations. The architecture isn't in a wiki that nobody reads; it's in the tool that LLMs consult on every file.

Arch tags provide the architectural "why" and "what"; the code itself is the specific implementation. If you change something in the architecture (replacing utilities, strengthening constraints, etc.), running check --project shows the impact of those changes and what code needs to be refactored to be compliant again. It serves as a guide not just for new functionality but also for refactoring.

You're Not Starting From Scratch

A reasonable objection: "So I have to define all these rules for my specific codebase?"

Yes, and that's the point. Every codebase has an architecture. Conventions, patterns, boundaries, the implicit "how we do things here." The problem is that this architecture lives in tribal knowledge, in code review comments, in the senior engineer's head, in that onboarding doc nobody updates. LLMs can't read tribal knowledge. But you don't have to write it all at once—you improve it over time. In addition, there are commands available that make setting up an initial registry easy.

In practice, registries have three layers:

Layer 1: Universal principles. Things like SOLID, separation of concerns, basic hygiene. These ship with ArchCodex or are trivially shared. Inherit them and forget about them.

Layer 2: Stack idioms. Convex mutation patterns. Next.js App Router conventions. tRPC procedure structure. These can be community-maintained, shared YAML files that capture best practices for your stack.

Layer 3: Your architecture. The stuff unique to your codebase. Your permission system. Your event patterns. Your module boundaries. This is what you define and what the LLM helps you write.

Your architecture already exists. It's just scattered. ArchCodex gives you a place to put it, and the LLM helps document it. Every rule you add prevents a class of drift.

What Happened When I Applied It At Scale

Applying ArchCodex to NimoNova's ~2200 files took a couple of evenings and a weekend. The initial scan was sobering, many hundreds of warnings. Drift everywhere. Duplicate utilities, diverged type definitions, inconsistent permission checks.

ArchCodex guided major refactoring: event-driven migration for excessive database calls, security hardening for inconsistent permissions, code duplication cleanup via garden and types analysis, and target architecture enforcement to show where reality diverged from intent.

After the benchmarks, the registry got updated based on the common mistakes the agents made, patterns that hadn't been checked for or that didn't emerge before. Running it again on the already-refactored codebase:

archcodex check --project

15 errors. 225 warnings.

In code that had already been cleaned up. The benchmarks had revealed what to look for, and now a whole new category of issues was visible.

Now when an LLM adds a feature, it sees the constraints. It follows the patterns. Not because of a longer prompt, but because the architecture is explicit.

The Real Lesson

Here's what 1500 hours of AI-assisted development taught me:

LLMs are power tools. Power tools are dangerous without jigs.

ArchCodex is the fence, the guide, the jig. It doesn't limit what the LLM can do, it guides the cut toward what should be done, in this codebase, for this architecture. And it helps software engineers and architects maintain a shared understanding of the architecture, navigate refactoring, and find architectural issues.

The benchmarks proved something I suspected but wanted to confirm: the gap between "working code" and "good code" is hard to enforce and guide with traditional tools. Compilation, tests, even manual QA, they catch the loud failures. The silent ones compound until your codebase becomes the thing everyone dreads touching. Of course, this isn't unique to AI coding; anyone who's worked on large enterprise applications will recognize this pattern.

Try It Yourself

ArchCodex is released as open source, for anyone to test, change, fork, benchmark and use it. Let me know the results :)

GitHub: GitHub - ArchCodexOrg/archcodex

DEV Community