How Mneme governs AI-generated code before the model writes a line

#llm #ai #architecture #python

LLMs start every call from zero. They reintroduce a library you dropped six months ago, rebuild a component you chose to keep small, and contradict decisions your team already settled. Each violation reads as reasonable on its own. Stack them across a week of agent sessions and you get architectural drift.

Mneme works at the prompt boundary. It reads the decisions your project already made and checks the task against them before the model generates anything. The repo ships Layer 1: local-repo, single-developer, project-scoped governance. Here is the shape.

The pipeline

Five stages, running locally in under two minutes:

project_memory.json → MemoryStore → Retriever → ContextBuilder → LLMAdapter → Evaluator

project_memory.json holds the corpus: rules, constraints, anti-patterns, and decision records as structured, human-editable JSON. You write it by hand or compile it from ADRs.
MemoryStore loads the file and migrates legacy item shapes so older corpora still parse.
Retriever picks the decisions relevant to the current task. It scores on keyword overlap, tag match, and priority weight. No embeddings, no vector database.
ContextBuilder formats the top matches into a compact context packet.
LLMAdapter injects that packet as the system prompt and calls the model, or dry-runs with no API key.
Evaluator scores the response against the injected decisions and reports an alignment number.

A second path adds conflict_detector, which scans the response after generation, and an ADR compiler (adr_parser then adr_compiler) that turns ADR files with YAML frontmatter into the corpus and resolves precedence between decisions that disagree.

The demo runs each task twice, once with no governance and once with the corpus enforced, so you read the delta yourself.

Three principles hold the design in place

Deterministic over clever. Same corpus and same query produce byte-identical retrieval order on every run. A simple retriever that returns the same answer twice beats a smart one that does not.
Auditable over autonomous. Every block records which decision matched, which rule fired, and which term in the input triggered it. You can rebuild any verdict from the artifacts.
Prevention before review. The check lands before generation. By the time a reviewer opens the pull request, the drift already shipped into the branch.

Why this is not RAG

RAG retrieves documents to inform an answer. Mneme retrieves decisions to constrain one.

	RAG	Mneme
Input	Documents, chunks, embeddings	Rules, constraints, decision records
Goal	Inform the response	Shape the response
Output	The model knows more	The model follows what you decided
Test	"Did it cite the right source?"	"Did it respect the constraint?"

No vector store, no agent loop. The corpus stays small, structured, and yours.

What it is not, by design

The freeze pins the retrieval mechanics, enforcement semantics, and benchmark methodology at commit e73ff7d. The open exit criterion is real-world validation with design partners. Several things sit outside the wedge on purpose, not on a backlog:

Not generalized agent memory or a conversation-history store
Not autonomous planning or tool-use orchestration
Not prompt rewriting. Mneme blocks a violating prompt, it does not polish one.
Not auto-fixing. Mneme blocks, and the human or model fixes.

The benchmark carries the same restraint. It is a regression instrument, not a generalization claim: canned model responses, fixed retrieval, two-layer scoring, today at 7/7 scenarios and recall@3 = 1.00. Its job is to make any change to retrieval or enforcement visible, so no regression lands unseen.

Read the code

Layer 1, the benchmark suite, and an example corpus are public at https://github.com/MnemeHQ/mneme. The concepts behind the design (governance before generation, architectural drift, verification contracts) are defined at mnemehq.com/concepts.

Top comments (1)

Mike Czerwinski • Jun 30

The RAG-vs-Mneme framing is the right cut: retrieving decisions to constrain, not documents to inform. I run a close cousin of this in production for a company's ops, a decision ledger with a proposed/accepted/locked lifecycle plus anti-pattern records, so two questions from the trenches.

First, the corpus is the bottleneck, not the retriever. Prevention-before-generation only fires if the decision is already captured. My hardest failure mode isn't bad retrieval, it's a decision that was never written down or has quietly gone stale, and the gate waving through a violation because the corpus didn't know yet. Curation discipline ends up being the whole game. Curious how design partners are handling corpus freshness.

Second, the no-embeddings call. You buy byte-identical auditability, but keyword overlap is blind to vocabulary drift: a task that violates a decision without sharing its terms sails right past. I'm building the opposite bet right now, a semantic index, and the cost is exactly the auditability you're protecting. Have you hit cases where lexical retrieval missed a real conflict, or has tag discipline been enough to cover it?