Theo Valmis

Posted on May 14 • Originally published at mnemehq.com

Why RAG Fails for Architectural Governance

#programming #architecture #ai #devtools

Retrieval-augmented generation is an excellent tool for knowledge lookup. It is the wrong tool for enforcing architectural decisions. The distinction matters — and most teams building AI coding workflows haven't confronted it yet.

When teams first encounter the problem of governing AI-generated code, RAG is the intuitive answer. You have a set of architectural decisions — ADRs, style guides, internal wikis, team conventions — and you want your AI coding assistant to respect them. RAG can retrieve relevant documents and inject them into the prompt. Problem solved, apparently.

It isn't. The mismatch between what RAG provides and what architectural governance requires is deep, and fixing it requires a different kind of system entirely.

What RAG is actually good at

RAG excels when the task is: given a query, find the most semantically relevant passages from a corpus and surface them to the model. It works well for:

Documentation lookup — "How does our auth middleware work?" retrieves the relevant design doc.
FAQ / support — surface the right answer from a knowledge base.
Context injection — prime the model with background it wouldn't otherwise have.
Summarization — condense a retrieved document for downstream consumption.

The common thread: RAG is a retrieval and suggestion mechanism. It finds relevant information. It does not enforce anything.

What architectural governance actually requires

Architectural governance is a different problem category. When you need to prevent an AI agent from making a decision that violates your service boundaries, your decisions need to be:

Authoritative — not "here's something relevant," but "this is the rule that applies here."
Precedence-aware — when two decisions conflict, the system must resolve the conflict deterministically, not leave it to the model's judgment.
Scope-aware — a decision about the payments service should not fire when the model is editing the analytics pipeline.
Enforcement-capable — the system needs to block or flag violations, not merely suggest alternatives.
Structurally validated — decisions need a schema, not just free-form text, so violations can be detected consistently.

RAG addresses exactly none of these requirements natively.

Failure mode 1: Semantic similarity ≠ decision authority

RAG retrieves documents based on embedding similarity to the query. "Which cache library should I use?" might retrieve three documents: a blog post about Redis, an ADR mandating Valkey, and a benchmark comparing both. The ADR is authoritative. The blog post is noise. RAG has no mechanism to distinguish them.

You can try to patch this by tagging documents with metadata and filtering by source type. But you've now built a lightweight decision registry on top of your RAG system — which is a separate architectural layer. And you still haven't solved the next problem.

The core issue: RAG ranks by similarity. Governance requires ranking by authority. These are orthogonal dimensions that need separate systems.

Failure mode 2: No precedence resolution

Real architectural decision sets contain conflicts. An org-level decision from 2022 says "use PostgreSQL for all relational storage." A project-level decision from 2024 says "this service uses SQLite for simplicity — approved by the platform team." Which wins?

The correct answer depends on scope, recency, and the explicit precedence relationship between the two decisions. A RAG system doesn't model any of this. It retrieves both, injects both, and leaves the model to interpret the contradiction. In practice, the model will pick whichever it finds more convincing in context — which is non-deterministic and ungovernable.

A proper governance system needs a precedence engine: an explicit, deterministic function that takes a set of retrieved decisions and produces a single authoritative answer for a given context.

Failure mode 3: Retrieval quality degrades at scale

RAG retrieval quality is a function of embedding model quality, corpus size, and query construction. In small corpora (under 100 documents), RAG works reasonably well. As your decision corpus grows to hundreds of ADRs, style guides, runbooks, and policy documents, retrieval precision drops and recall degrades.

More importantly, architectural decisions have precise scope signals — file patterns, service names, module boundaries — that embedding-based retrieval handles poorly. Scope-aware retrieval requires structured matching, not vector similarity.

Failure mode 4: Suggestion, not enforcement

Even when RAG retrieves the right decision, the model can ignore it. Whether it respects the retrieved constraint depends on prompt construction, model behavior, and context window dynamics — none of which are deterministic.

In practice, models under instruction to complete a coding task will prioritize task completion over constraint adherence when the two are in tension. A suggestion system isn't a governance system.

Enforcement requires a layer that can inspect generated output against structured constraints and block or flag violations before they reach review. This is architecturally separate from the generation layer, and RAG has no role in it.

What a proper governance layer looks like

The distinction between RAG-based context injection and governance enforcement:

Dimension	RAG approach	Governance layer
Retrieval basis	Embedding similarity	Scope + keyword + recency
Authority model	None — all documents equal	Explicit precedence hierarchy
Conflict resolution	Left to the model	Deterministic precedence engine
Enforcement	Suggestion only	Block / flag at generation time
Decision schema	Free-form text	Structured with typed fields
Scope handling	Implicit / approximate	Explicit scope patterns per decision

A governance layer needs a structured decision schema — typed fields for scope, rationale, status, superseded-by, and the constraint itself. An example structured decision record:

# ADR-012: Payment service storage backend
id: ADR-012
status: active
scope: services/payments/**
supersedes: ADR-004
precedence: project   # beats org-level if conflict
constraint: Use PostgreSQL with SQLAlchemy ORM. No direct SQL. No SQLite.
rationale: Consistency with audit logging requirements (SOC 2 TR-7).

When the model targets a file matching services/payments/, this decision fires. Its precedence level is checked against any conflicting org-level rules. The resolved constraint is injected authoritatively. If the model's output violates it, the enforcement layer blocks the write.

When RAG is still useful in this context

RAG is useful for surfacing relevant prior art, similar code patterns, and background context during generation. It's appropriate when you want to enrich the model's knowledge without binding it to specific constraints. It pairs well with a governance layer: RAG provides context, the governance layer provides constraints.

The error is conflating the two — assuming that injecting decision documents via RAG is equivalent to enforcing those decisions. It isn't.

The underlying principle

The distinction maps to a simple principle: suggestion systems and enforcement systems are architecturally incompatible. You cannot make a suggestion system into an enforcement system by improving retrieval quality. The enforcement guarantee requires a structural property — a deterministic check against structured constraints — that retrieval-based systems cannot provide by design.

Architectural governance for AI coding is an enforcement problem. It should be solved with enforcement architecture.

Originally published at mnemehq.com

DEV Community