<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Theo Valmis</title>
    <description>The latest articles on DEV Community by Theo Valmis (@mnemehq).</description>
    <link>https://dev.to/mnemehq</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3920377%2F602224a9-60d2-47cc-83f5-66295d08db90.png</url>
      <title>DEV Community: Theo Valmis</title>
      <link>https://dev.to/mnemehq</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mnemehq"/>
    <language>en</language>
    <item>
      <title>Why I Built Mneme HQ: Preventing AI Agent Architectural Drift</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Fri, 22 May 2026 14:18:43 +0000</pubDate>
      <link>https://dev.to/mnemehq/why-i-built-mneme-hq-preventing-ai-agent-architectural-drift-64m</link>
      <guid>https://dev.to/mnemehq/why-i-built-mneme-hq-preventing-ai-agent-architectural-drift-64m</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://www.theovalmis.com/writing/why-i-built-mneme.html" rel="noopener noreferrer"&gt;theovalmis.com&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every time you start a new session with an AI coding agent, it has forgotten everything. Not just the small things — the names, the syntax, the last error message. It has forgotten the decisions you made three weeks ago about why you chose Postgres over MongoDB. It has forgotten the auth pattern you locked in after a security review. It has forgotten that you explicitly ruled out a particular library because it caused problems in production.&lt;/p&gt;

&lt;p&gt;The model starts cold. You start explaining. Again.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The problem with AI coding agents isn't intelligence. It's memory. Every session is the first session."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I spent months working on a long-running project with AI assistants — Cursor, Claude Code, GPT-4 — and I noticed the same pattern repeating. The models were brilliant at individual tasks. But they had no continuity. No awareness of the constraints we had already resolved. No sense that this codebase had a history, a set of deliberate architectural choices, things that were decided and should stay decided.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Drift Problem
&lt;/h2&gt;

&lt;p&gt;I started calling it architectural drift. It happens gradually. The agent suggests a new dependency — one you already evaluated and rejected for good reason. You either catch it (friction, lost time) or you don't (technical debt, inconsistency). The agent reaches for a different database adapter because it's seen it in training more often than yours. The agent refactors a module using a pattern you explicitly moved away from six months ago.&lt;/p&gt;

&lt;p&gt;None of this is the model's fault. These systems don't have access to your decision history. They can't know what you know. They see the code in front of them, not the conversation that led to it.&lt;/p&gt;

&lt;p&gt;The standard advice is: put everything in your system prompt. Write a long CLAUDE.md. Add comments everywhere. Re-explain at the start of every session.&lt;/p&gt;

&lt;p&gt;That's not a solution. That's manual memory management for a tool that's supposed to reduce cognitive load.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Wanted
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Store decisions where the code lives.&lt;/strong&gt; Not in a separate wiki. Not in a Notion database that nobody checks. Right in the repository, version-controlled, co-located with the work they govern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inject those decisions automatically.&lt;/strong&gt; Not by copying and pasting into every new chat. By having the tool surface the right constraints at the right moment — before the model has a chance to drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enforce, not just remind.&lt;/strong&gt; A decision that can be ignored isn't a decision. It's a suggestion. I wanted something that could validate whether the current state of the codebase actually respects the decisions we've recorded — and flag the ones that don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Mneme HQ
&lt;/h2&gt;

&lt;p&gt;I built Mneme HQ to solve this. The name comes from the Greek goddess of memory and remembrance — one of the original Muses. The idea is simple: your project has memory. Your AI assistant should too.&lt;/p&gt;

&lt;p&gt;At its core, Mneme HQ stores decisions in structured files that travel with your codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;mneme add &lt;span class="s2"&gt;"Use Postgres — no new databases without ADR"&lt;/span&gt;
Decision recorded. ID: ADR-001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those decisions get committed with your code. They're in git. They're reviewable. They're auditable. When you start a new session, Mneme HQ injects the relevant decisions into context — not all of them, just the ones that matter for what you're working on now.&lt;/p&gt;

&lt;p&gt;And before you ship, you can run a pre-flight check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;mneme check &lt;span class="nt"&gt;--mode&lt;/span&gt; strict
Checking decisions against current context...

PASS  Storage decision enforced — Postgres locked, no new DBs
PASS  Auth pattern respected — JWT middleware unchanged
WARN  New dependency introduced — prisma not &lt;span class="k"&gt;in &lt;/span&gt;approved list
FAIL  Violates ADR-004 — Repository pattern bypassed &lt;span class="k"&gt;in &lt;/span&gt;user.service.ts

2 passed · 1 warning · 1 failure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PASS. WARN. FAIL. Clear signals you can act on before the model's suggestion becomes production code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cursor and Claude Code Integration
&lt;/h2&gt;

&lt;p&gt;Because most AI-assisted development happens inside editors, Mneme HQ can generate Cursor rules directly from your stored decisions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;mneme cursor generate
Generated .cursor/rules/mneme.mdc from 7 decisions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those rules become editor-level guardrails. Every suggestion Cursor or Claude Code makes inside your project gets filtered through the constraints you've already established. You stop explaining. The model already knows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Now
&lt;/h2&gt;

&lt;p&gt;AI coding agents are getting better at an extraordinary rate. But the gap between what they can do in a single session and what they can maintain across a long-running project is growing, not shrinking. Better models don't fix the memory problem. Longer context windows help at the margins but don't solve continuity.&lt;/p&gt;

&lt;p&gt;The projects that will get the most out of AI-assisted development aren't the ones with the best prompts. They're the ones with the best decision infrastructure — the ones where the codebase itself carries the memory of its own architectural intent.&lt;/p&gt;

&lt;p&gt;That's what Mneme HQ is for. Open source, repo-native, and CI-ready. If you're using Cursor, Claude Code, or any other AI coding tool on a project that has history — decisions made, patterns locked in, constraints established — I think you'll find it useful.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Try it:&lt;/strong&gt; &lt;a href="https://mnemehq.com" rel="noopener noreferrer"&gt;mnemehq.com&lt;/a&gt; · &lt;a href="https://github.com/TheoV823/mneme" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.theovalmis.com/writing/why-i-built-mneme.html" rel="noopener noreferrer"&gt;theovalmis.com&lt;/a&gt;. I write about AI governance, architectural drift, and the systems that make AI-assisted development sustainable at scale.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>cursor</category>
      <category>architecture</category>
    </item>
    <item>
      <title>AI-Native Engineering Has an Intent Debt Problem</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Tue, 19 May 2026 15:44:01 +0000</pubDate>
      <link>https://dev.to/mnemehq/ai-native-engineering-has-an-intent-debt-problem-4jgo</link>
      <guid>https://dev.to/mnemehq/ai-native-engineering-has-an-intent-debt-problem-4jgo</guid>
      <description>&lt;p&gt;AI-native engineering is not failing because agents cannot write code. It is straining because agents can write code faster than organizations can preserve, update, and enforce intent.&lt;/p&gt;

&lt;p&gt;Industry signals are starting to name the pattern. Augment's State of AI-Native Engineering 2026 reports that nearly half of new code is now AI-generated, while confidence, comprehension, role definitions, onboarding, and process changes lag behind. The framing has shifted from "can AI write code" to "what is happening to the work around the code." The survey separates the strain into compounding debt layers — and the layer that matters most is also the least named.&lt;/p&gt;

&lt;p&gt;The problem is not only technical debt. It is not only cognitive debt. It is intent debt — and naming it is the precondition for fixing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is intent debt?
&lt;/h2&gt;

&lt;p&gt;Intent debt is the gap between what a system is supposed to preserve and what its AI agents are actually constrained to follow.&lt;/p&gt;

&lt;p&gt;It is the residue of every decision that was made but never written down in a form a machine can read. It is the architecture review that happened in a meeting room. The "we tried that and it failed" lesson that lives in one principal engineer's head. The dependency rule everyone follows because they were yelled at once. The spec that was true six months ago and nobody has reread since.&lt;/p&gt;

&lt;p&gt;Intent shows up in many forms: architectural decisions and ADRs, system boundaries and ownership lines, dependency and licensing rules, security and data-handling constraints, product assumptions baked into the design, engineering conventions and patterns, specs, requirements, and acceptance criteria, migration decisions and deprecation paths, "do not do this again" lessons from past incidents, and tradeoffs behind previous choices.&lt;/p&gt;

&lt;p&gt;Technical debt lives in code. Cognitive debt lives in people. Intent debt lives in the space between decisions and execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why implicit intent used to work
&lt;/h2&gt;

&lt;p&gt;Before agentic development, intent could survive informally. Senior engineers remembered past decisions and routed work around old landmines. Reviewers caught violations on the way in. Onboarding transmitted conventions over months of pairing and PR feedback. Architecture docs were imperfect, but humans filled the gaps in real time. Implementation speed was slow enough for review to absorb drift.&lt;/p&gt;

&lt;p&gt;Human execution masked weak intent infrastructure. When humans were writing most of the code, they carried context into implementation. The intent lived in their heads, and their hands obeyed it.&lt;/p&gt;

&lt;p&gt;Agents do not carry that context. They execute on whatever is in the prompt window — and the things humans used to fill in silently are now load-bearing absences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why agents make intent debt visible
&lt;/h2&gt;

&lt;p&gt;Agents do not inherit implicit judgment. They inherit whatever context happens to be in scope. So when execution accelerates without a corresponding upgrade in how intent is captured and surfaced, the gap stops being theoretical.&lt;/p&gt;

&lt;p&gt;The same patterns show up across teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An agent chooses a new dependency because it solves the immediate task — and nobody told it the team already standardized on a different one.&lt;/li&gt;
&lt;li&gt;An agent rewrites a subsystem instead of extending an approved abstraction, because the abstraction was never made discoverable.&lt;/li&gt;
&lt;li&gt;An agent bypasses an ADR because the ADR was not in its context window.&lt;/li&gt;
&lt;li&gt;An agent creates a parallel pattern next to an existing one because the existing convention was undocumented.&lt;/li&gt;
&lt;li&gt;An agent optimizes locally while violating a global architectural direction nobody has written down.&lt;/li&gt;
&lt;li&gt;An agent reads stale docs as current truth and propagates an old decision forward.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one is the same failure mode: a real decision exists, but it is not enforceable at the moment of generation. The agent does the most reasonable thing it can with the information available, and the result is drift at machine speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI does not create intent debt. It reveals it by executing faster than the organization can correct it.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Review is too late
&lt;/h2&gt;

&lt;p&gt;Most teams' first instinct is to harden review. Add a PR bot. Train reviewers on AI-generated code. Tighten the checks at the merge gate.&lt;/p&gt;

&lt;p&gt;This helps, but it is structurally reactive. If intent only enters the workflow at PR review, the organization has already allowed the wrong work to be generated. Every flagged violation represents a generation cycle that produced the wrong thing, and a review cycle now spent unwinding it. At agent-velocity output, that arithmetic stops working.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;POST-GENERATION — Code review&lt;/strong&gt;: Asks "Is this change acceptable?" Operates after the code exists. Catches some violations. Costs a full generation cycle per miss.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PRE-GENERATION — Intent governance&lt;/strong&gt;: Asks "Should this change have been generated this way in the first place?" Operates before the diff exists. Prevents the violation from being written.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Memory and orchestration are not governance
&lt;/h2&gt;

&lt;p&gt;The other instinct is to throw more context at the agent. Add a memory layer. Add retrieval. Add an orchestrator that routes tasks across tools. All of these are real, useful capabilities. None of them are governance.&lt;/p&gt;

&lt;p&gt;Memory preserves context. Orchestration coordinates work. Observability shows what happened. Review inspects the result. Governance is the layer that constrains what is allowed before and during execution — and that property does not emerge automatically from any of the others.&lt;/p&gt;

&lt;p&gt;Memory can retrieve stale or conflicting information. Orchestration can route a task to the wrong tool with full fidelity. Observability tells you the violation happened. Review catches a fraction of what surfaces. The missing layer is not more context. It is enforceable context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The missing layer: intent governance
&lt;/h2&gt;

&lt;p&gt;Intent governance is the practice of turning architectural decisions, constraints, and operating rules into enforceable contracts that agents and humans must respect during software delivery.&lt;/p&gt;

&lt;p&gt;The properties that make it work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Explicit.&lt;/strong&gt; Decisions are written down, not transmitted by osmosis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Versioned.&lt;/strong&gt; Intent changes are tracked the same way code changes are.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repo-native.&lt;/strong&gt; It lives next to the code it governs, not in a separate wiki nobody reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scoped.&lt;/strong&gt; A rule that applies to one subsystem does not pollute another.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforceable.&lt;/strong&gt; Violations are surfaced — or blocked — at generation time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditable.&lt;/strong&gt; Every enforcement decision has a trace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Available before generation.&lt;/strong&gt; Not just at review.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last property is the one that distinguishes governance from everything adjacent to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What intent governance looks like in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dependency governance.&lt;/strong&gt; Intent: "Do not introduce a second auth library." Without governance, an agent adds one because it solves the immediate task; the violation is caught in review, if it is caught at all. With governance, the constraint is surfaced before the agent writes the import, and the alternative path — the approved library — is part of the context the agent generates against.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural migration.&lt;/strong&gt; Intent: "Extend the current retrieval layer; do not rebuild with a new vector database." Without governance, an agent proposes a rewrite because rewrites are often the locally cleanest answer to a complex change. With governance, the agent sees the active architectural decision and works inside the approved path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generated artifact drift.&lt;/strong&gt; Intent: "All public artifacts follow naming and provenance conventions." Without governance, automation emits branches, PR titles, release notes, and generated docs outside repo standards. With governance, those surfaces are treated as first-class governance targets.&lt;/p&gt;

&lt;h2&gt;
  
  
  What engineering leaders should do now
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Inventory where architectural intent currently lives — ADRs, wikis, Slack threads, individual heads.&lt;/li&gt;
&lt;li&gt;Triage which decisions are still active, which are stale, and which have been silently superseded.&lt;/li&gt;
&lt;li&gt;Convert high-risk decisions into explicit, repo-native constraints.&lt;/li&gt;
&lt;li&gt;Apply those constraints before generation, not only during review.&lt;/li&gt;
&lt;li&gt;Treat agent workflows, CI pipelines, PRs, docs, and generated artifacts as governance surfaces — not just the code in &lt;code&gt;src/&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The reframing:&lt;/strong&gt; Do not start by asking "which agent should we use?" Ask: "what intent must every agent preserve?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The future of AI-native engineering is intent enforcement
&lt;/h2&gt;

&lt;p&gt;AI-native engineering changes the execution layer. That means governance has to move closer to execution too. The teams that adapt will not be the ones generating the most code. They will be the ones that can preserve intent while code generation accelerates.&lt;/p&gt;

&lt;p&gt;When agents become the execution layer, architectural intent can no longer live only in human memory.&lt;/p&gt;

&lt;p&gt;The next bottleneck in AI-native engineering is not code generation. It is intent enforcement.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/ai-native-engineering-intent-debt/" rel="noopener noreferrer"&gt;Mneme HQ&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>governance</category>
      <category>agentic</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>Start Here: Why AI-Assisted Development Needs a Governance Layer</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Tue, 19 May 2026 15:41:07 +0000</pubDate>
      <link>https://dev.to/mnemehq/start-here-why-ai-assisted-development-needs-a-governance-layer-4mfl</link>
      <guid>https://dev.to/mnemehq/start-here-why-ai-assisted-development-needs-a-governance-layer-4mfl</guid>
      <description>&lt;p&gt;Every major shift in software engineering produces a new infrastructure requirement. The cloud era gave us orchestration. The DevOps era gave us CI/CD. The generative AI era is producing its own requirement — and most teams haven't named it yet.&lt;/p&gt;

&lt;p&gt;That unnamed requirement is governance.&lt;/p&gt;

&lt;p&gt;Not memory. Not context injection. Not longer system prompts. Governance: the layer that enforces architectural constraints, preserves decision provenance, and maintains behavioral boundaries across autonomous agents operating at scale.&lt;/p&gt;

&lt;p&gt;The AI Governance Layer is where Mneme HQ publishes its thinking on this category — what it is, why it's missing, and what it looks like when it's built correctly.&lt;/p&gt;

&lt;p&gt;Read these four pieces in order:&lt;/p&gt;

&lt;h2&gt;
  
  
  1. &lt;a href="https://dev.to/mnemehq/harness-engineering-still-needs-governance-d3i"&gt;The Generative AI Software Engineering Stack&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;The full seven-layer architecture of AI-assisted engineering. Governance lives at Layer 5 — between memory and orchestration — and almost no one is building it.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. &lt;a href="https://dev.to/mnemehq/why-code-review-cannot-scale-with-ai-output-4a1h"&gt;Why Code Review Cannot Scale With AI Output&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Generative AI broke the ratio of code production to human review capacity. Code review was the last gate before code entered a shared codebase. That gate no longer holds.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. &lt;a href="https://dev.to/mnemehq/why-claudemd-stops-scaling-58ol"&gt;Why CLAUDE.md Stops Scaling&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;CLAUDE.md is useful early. It stops working at scale because context injection is not enforcement. A file that tells the model what to do is not the same as a system that verifies it did.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. &lt;a href="https://dev.to/mnemehq/memory-is-not-governance-5end"&gt;Memory Is Not Governance&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;The AI coding category has conflated four distinct systems: memory, context management, retrieval, and governance. Each does something different. Governance is the only one that constrains behavior — and the only one most teams don't have.&lt;/p&gt;




&lt;p&gt;Together, these pieces argue that AI-assisted development does not only need better models or longer context windows. It needs enforceable architectural memory.&lt;/p&gt;

&lt;p&gt;If you are building AI-assisted development infrastructure, or thinking about where the tooling gaps are, this is the right starting point.&lt;/p&gt;

&lt;p&gt;New pieces publish weekly, usually on Tuesdays.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.substack.com/p/start-here-why-ai-assisted-development" rel="noopener noreferrer"&gt;Mneme HQ on Substack&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>governance</category>
      <category>agentic</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>Long-running agents need more than memory</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Mon, 18 May 2026 15:16:54 +0000</pubDate>
      <link>https://dev.to/mnemehq/long-running-agents-need-more-than-memory-23j3</link>
      <guid>https://dev.to/mnemehq/long-running-agents-need-more-than-memory-23j3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Anthropic's managed-agent harness solves one hard problem: continuity. Progress logs, feature lists, git checkpoints, and startup scripts give each new session a map of what happened. But continuity is not governance. As agents work across more sessions, the question changes from "did the agent remember?" to "did the agent stay within its architectural constraints?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In May 2026, Anthropic published a detailed look at how their internal engineering teams use Claude Code as a long-running managed agent. The infrastructure pattern they describe is worth reading carefully: initializer agents that prepare the workspace, feature lists that define remaining work, progress files that record what happened, git commits that preserve recoverable state, startup checks that orient each new session, and end-to-end tests that stop agents from declaring victory prematurely.&lt;/p&gt;

&lt;p&gt;This is not prompt engineering. It is operational infrastructure for agents working across many sessions on the same codebase. The problems it solves are real, and the solutions are well-reasoned.&lt;/p&gt;

&lt;p&gt;But the pattern solves continuity. It does not solve governance. Those are two different problems, and conflating them is the most expensive mistake a team can make when designing long-running agent workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agents as shift workers
&lt;/h2&gt;

&lt;p&gt;The framing that makes the managed-agent pattern click is the relay team metaphor. A long-running agent workflow looks less like one developer with a prompt and more like a team of engineers handing work across shifts.&lt;/p&gt;

&lt;p&gt;Each shift worker arrives, reads the handoff notes, picks up where the last person stopped, makes progress, and leaves a record for the next person. The work continues across interruptions. The codebase evolves across sessions. No single session owns the full context.&lt;/p&gt;

&lt;p&gt;That framing makes the continuity infrastructure obvious. You need handoff notes that are authoritative (progress files), a work queue that persists across shifts (feature lists), recoverable state at every checkpoint (git commits), orientation scripts so each shift starts correctly (startup checks), and pass/fail criteria that the work must satisfy (E2E tests).&lt;/p&gt;

&lt;p&gt;Anthropic's harness provides all five. What it does not provide is the architectural contract that defines what kind of work each shift is allowed to do.&lt;/p&gt;

&lt;p&gt;In a real engineering team, that contract exists in ADRs, architecture review boards, code review standards, and the accumulated institutional knowledge of senior engineers. In a long-running agent loop, none of that is automatically present. The harness tells the agent what happened. It does not tell the agent what must remain true.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the harness gets right
&lt;/h2&gt;

&lt;p&gt;Before addressing the gap, it is worth being precise about what the harness actually solves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Initializer agent&lt;/strong&gt; — prepares the workspace before the main agent session begins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature list&lt;/strong&gt; — a durable queue of remaining work, written as discrete, completable items.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progress file&lt;/strong&gt; — a running record of what each session changed, decided, and left incomplete.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git commits as checkpoints&lt;/strong&gt; — every meaningful unit of work lands as a recoverable commit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E2E tests as the victory condition&lt;/strong&gt; — agents cannot declare a feature complete until the tests pass.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern is good engineering. Each piece of infrastructure corresponds to a real failure mode that long-running agents encounter in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The remaining gap: continuity is not governance
&lt;/h2&gt;

&lt;p&gt;A progress file can tell the next agent: &lt;em&gt;"Here is what I changed."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It cannot reliably tell the agent: &lt;em&gt;"This architecture boundary must not be crossed. This dependency is forbidden. This ADR supersedes that older decision. This pattern is allowed only in this scope."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That distinction matters in practice because the questions a progress file answers and the questions a governance layer answers are different in kind, not just degree:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Question it answers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Progress log&lt;/td&gt;
&lt;td&gt;What happened?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature list&lt;/td&gt;
&lt;td&gt;What remains?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Git history&lt;/td&gt;
&lt;td&gt;What changed?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test harness&lt;/td&gt;
&lt;td&gt;Does it work?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance layer&lt;/td&gt;
&lt;td&gt;Is this allowed?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first four layers are all answered by the managed-agent harness. The fifth is not. A test suite can verify that the output is functionally correct. It cannot verify that the output is architecturally compliant. Those are different properties, and a codebase can be full of passing tests while being full of architectural violations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Agent harnesses preserve continuity.&lt;/em&gt; Governance preserves intent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this gets harder as agents run longer
&lt;/h2&gt;

&lt;p&gt;Over many sessions, a long-running agent loop may:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Infer outdated patterns from old code.&lt;/strong&gt; If earlier sessions used a deprecated pattern, the new session infers that pattern is correct and continues it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reintroduce forbidden dependencies.&lt;/strong&gt; A dependency was removed for a documented architectural reason. A later session adds it back because it solves the immediate problem and the prohibition is not in any artifact the agent reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bypass undocumented conventions.&lt;/strong&gt; Architecture that exists in institutional memory but not in enforceable documents is invisible to the agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize locally while violating system-level constraints.&lt;/strong&gt; Each session makes a locally reasonable change. The cumulative effect crosses an architectural boundary that no single session was responsible for maintaining.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these failures show up in a progress file. None of them cause a test suite to fail. They accumulate silently across sessions and become visible only when the codebase is far enough from its architectural intent that the cost of correction is high.&lt;/p&gt;

&lt;h2&gt;
  
  
  The role of governance
&lt;/h2&gt;

&lt;p&gt;Governance sits beside the harness. It does not replace progress logs, tests, or git. It gives the agent a deterministic way to check architectural compatibility at each session boundary and at each commit boundary.&lt;/p&gt;

&lt;p&gt;The managed-agent startup sequence, extended with governance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;pwd
&lt;/span&gt;git log &lt;span class="nt"&gt;--oneline&lt;/span&gt; &lt;span class="nt"&gt;-20&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;claude-progress.txt
&lt;span class="nb"&gt;cat &lt;/span&gt;feature_list.json
mneme check &lt;span class="nt"&gt;--mode&lt;/span&gt; warn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before commit or PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mneme check &lt;span class="nt"&gt;--mode&lt;/span&gt; strict
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In CI, on every push:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mneme check &lt;span class="nt"&gt;--mode&lt;/span&gt; strict &lt;span class="nt"&gt;--ci&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framing is important: the harness tells the agent where it is. Governance tells it what boundaries it must respect. Both are necessary. Neither substitutes for the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  ADRs as durable intent, not documentation
&lt;/h2&gt;

&lt;p&gt;The governance layer requires a source of architectural authority. In well-run engineering teams, that source is the ADR corpus: Architecture Decision Records that capture not just what was decided, but why, what alternatives were rejected, and what constraints the decision implies.&lt;/p&gt;

&lt;p&gt;For most teams, ADRs sit in &lt;code&gt;/docs/adr&lt;/code&gt; and are read only when someone thinks to look. They are documentation, not enforcement. A long-running agent will not read them at session start. A commit hook will not check against them.&lt;/p&gt;

&lt;p&gt;A governance layer changes this. Rather than reading the ADR folder as a documentation corpus, it compiles the ADR corpus into a decision graph with declared properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which decisions are active, superseded, or deprecated?&lt;/li&gt;
&lt;li&gt;Which decision applies to which file, service, or scope?&lt;/li&gt;
&lt;li&gt;Which decision is newer and overrides an older one?&lt;/li&gt;
&lt;li&gt;Which dependencies or patterns does each decision forbid or require?&lt;/li&gt;
&lt;li&gt;When two decisions conflict on the same scope, which one wins?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A long-running agent operating under that system can answer: &lt;em&gt;which decision applies to this change, and am I compliant with it?&lt;/em&gt; That is a different question from &lt;em&gt;what did the progress file say?&lt;/em&gt; and it requires a different infrastructure to answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where governance checkpoints belong
&lt;/h2&gt;

&lt;p&gt;Governance is not a single check at a single moment. The right enforcement points correspond to the moments of highest leverage:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Session start (warn mode)&lt;/strong&gt; — before any code is written, load constraints and surface existing violations without blocking work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-tool execution&lt;/strong&gt; — block actions that are obviously forbidden before they happen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-commit (strict mode)&lt;/strong&gt; — the primary enforcement gate, catching architectural drift before it becomes branch history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-PR&lt;/strong&gt; — produces an explainable report of which rules applied, which passed, which failed, and why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI&lt;/strong&gt; — the backstop that enforces team-level architectural contracts on every push.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The harness ensures the agent knows where it is. Governance ensures the agent knows where it must not go. Both are infrastructure. Neither is a nice-to-have for long-running loops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: memory is not enough
&lt;/h2&gt;

&lt;p&gt;Anthropic's managed-agent harness is well-designed infrastructure for a real problem. Teams building on Claude Code or similar agent systems should study and adopt this pattern.&lt;/p&gt;

&lt;p&gt;But a progress file is descriptive, not prescriptive. It records what happened. It does not enforce what must remain true. And as agent loops grow longer, the gap between those two things grows wider.&lt;/p&gt;

&lt;p&gt;The next phase of agent infrastructure needs a governance layer — one that resolves competing ADRs deterministically, produces explainable audit traces, and enforces architectural contracts at the boundaries where agents make consequential changes.&lt;/p&gt;

&lt;p&gt;Long-running agents need memory to continue work. They need governance to continue work safely. The next generation of agent infrastructure will not just preserve context. It will preserve intent.&lt;/p&gt;

&lt;p&gt;That is the layer &lt;a href="https://github.com/TheoV823/mneme" rel="noopener noreferrer"&gt;Mneme&lt;/a&gt; is built for.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/long-running-agents-need-governance/" rel="noopener noreferrer"&gt;https://mnemehq.com/insights/long-running-agents-need-governance/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>aigovernance</category>
      <category>claudecode</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Why Code Review Cannot Scale With AI Output</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Mon, 18 May 2026 15:15:29 +0000</pubDate>
      <link>https://dev.to/mnemehq/why-code-review-cannot-scale-with-ai-output-4a1h</link>
      <guid>https://dev.to/mnemehq/why-code-review-cannot-scale-with-ai-output-4a1h</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;AI coding assistants have made generation cheap. They haven't made review cheap. The result is a compounding bottleneck that most engineering teams are only beginning to feel — and that no amount of hiring will resolve.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For most of software engineering history, writing code was the bottleneck. Senior engineers reviewed what juniors wrote, and the review burden was proportional to human writing speed. The ratio was manageable.&lt;/p&gt;

&lt;p&gt;AI coding assistants break this assumption. When a single engineer can generate a thousand lines of plausible, compilable code in an hour, the bottleneck shifts. Generation is no longer scarce. Review is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The volume problem is already here
&lt;/h2&gt;

&lt;p&gt;Teams that have adopted AI coding assistants at scale — Claude Code, Cursor, Copilot, Devin — consistently report the same pattern: PR volume increases faster than review capacity. A single AI agent operating on a well-scoped task can produce a multi-file changeset in minutes that would take a human engineer half a day to write.&lt;/p&gt;

&lt;p&gt;The arithmetic is straightforward. If your team generates 10x more code, someone still has to review 10x more code. You cannot hire 10x more reviewers to compensate — even if you could find them, reviewers themselves become AI-assisted and generate more output. The bottleneck tightens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reviewing AI output is harder than reviewing human output
&lt;/h2&gt;

&lt;p&gt;The volume problem would be difficult enough. But AI-generated code is also &lt;em&gt;harder to review&lt;/em&gt; than human-written code in several specific ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No institutional memory between sessions.&lt;/strong&gt; A human engineer writing a service that interacts with the payments pipeline carries context from prior code reviews, architecture discussions, and incident postmortems. An AI agent starting a new session has none of this unless it's explicitly provided. The result is code that is syntactically correct and passes tests but violates architectural invariants that aren't written down anywhere the model can see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plausible violations are harder to catch than obvious ones.&lt;/strong&gt; Human engineers tend to violate architectural rules either obviously or not at all. AI agents produce a third category: code that looks architecturally correct but violates a constraint in a subtle way — a service reaching across a boundary via a shared utility function, or a new table added to a database that was supposed to be read-only from this service. These violations pass automated tests and linters. They require a reviewer who understands the architectural intent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Style and convention drift.&lt;/strong&gt; AI coding assistants have their own implicit style conventions, drawn from training data rather than your codebase's evolution. Without explicit constraint injection, they drift toward generic patterns — which means reviewers must also police style consistency that previously self-enforced via team osmosis.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "just review more carefully" response doesn't work
&lt;/h2&gt;

&lt;p&gt;The intuitive organizational response is to tighten review requirements: require two reviewers, require an architect sign-off on structural changes, institute more thorough checklists. This approach fails for a predictable reason: it reduces velocity precisely when AI-assisted development is supposed to be increasing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fundamental tension:&lt;/strong&gt; tighter review processes reduce the speed advantage that AI coding provides. Looser review processes allow architectural violations to compound. There is no review-process solution to a generation-speed problem.&lt;/p&gt;

&lt;p&gt;Teams that try to solve this with review tooling — AI-assisted code review, static analysis, architecture rule checkers — observe partial improvement. These tools can catch mechanical violations: undefined variables, type errors, obvious anti-patterns. They cannot catch violations that require understanding your team's specific architectural decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shift-left argument
&lt;/h2&gt;

&lt;p&gt;Security engineering confronted a structurally identical problem a decade ago. When application development accelerated and security testing was relegated to the end of the pipeline, the volume of vulnerabilities reaching production exceeded the security team's capacity to address them. The response was the shift-left movement: move security checks earlier in the development process, so violations are caught before they accumulate.&lt;/p&gt;

&lt;p&gt;The same logic applies to architectural governance. If you move constraint enforcement to before the AI agent writes the file — rather than after the PR is opened — you eliminate the violation before it needs to be reviewed. No review time consumed. No back-and-forth on the PR. No accumulated drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  What pre-generation enforcement actually requires
&lt;/h2&gt;

&lt;p&gt;Shifting enforcement left is the right direction. Implementing it correctly requires more than telling the AI agent "follow our rules" in the system prompt.&lt;/p&gt;

&lt;p&gt;Effective pre-generation enforcement needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A structured decision corpus&lt;/strong&gt; — architectural decisions captured in a machine-readable schema, not free-form documentation. Decisions must have explicit scope, status, and constraint fields.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope-aware retrieval&lt;/strong&gt; — the ability to retrieve only the decisions relevant to the specific file or module being modified, not a semantic-similarity approximation of what might be relevant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hook-level integration&lt;/strong&gt; — enforcement must happen at the tool-use layer, before the write completes, not in the prompt or post-hoc in review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A precedence engine&lt;/strong&gt; — when multiple decisions apply, the system must resolve conflicts deterministically rather than leaving the model to interpret contradictions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these requirements are met by a system prompt containing your ADR documents, or by a RAG pipeline that retrieves them. They require a governance architecture that treats decisions as structured, executable constraints rather than advisory text.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost of not solving this
&lt;/h2&gt;

&lt;p&gt;Teams that adopt AI coding assistants without addressing the review bottleneck converge on one of two failure modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Velocity collapse&lt;/strong&gt; — review requirements tighten to the point that AI-generated PRs queue for days, negating the generation speed advantage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural debt accumulation&lt;/strong&gt; — review is loosened or overwhelmed, violations merge, and the codebase drifts away from its intended architecture over months.&lt;/p&gt;

&lt;p&gt;Both outcomes are predictable. Both are avoidable if the governance problem is addressed at the generation layer rather than the review layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The structural conclusion
&lt;/h2&gt;

&lt;p&gt;Code review is a human-time-bounded process. AI code generation is not. You cannot solve a generation-speed problem with a review-speed solution. The governance layer must operate at generation time, enforcing architectural constraints before the code is written — not after it's merged.&lt;/p&gt;

&lt;p&gt;This is the architectural shift that the current generation of AI coding tools hasn't yet made. It's also the gap that &lt;a href="https://github.com/TheoV823/mneme" rel="noopener noreferrer"&gt;Mneme&lt;/a&gt; is built to close.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/why-code-review-cannot-scale-with-ai-output/" rel="noopener noreferrer"&gt;https://mnemehq.com/insights/why-code-review-cannot-scale-with-ai-output/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aigovernance</category>
      <category>codereview</category>
      <category>ai</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Harness Engineering Still Needs Governance</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Mon, 18 May 2026 15:14:41 +0000</pubDate>
      <link>https://dev.to/mnemehq/harness-engineering-still-needs-governance-d3i</link>
      <guid>https://dev.to/mnemehq/harness-engineering-still-needs-governance-d3i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;The industry has moved from prompt engineering to harness engineering: execution systems that coordinate models, tools, memory, and retries across long-running agent loops. Harnesses solve how agents act. They do not solve whether the actions stay within architectural boundaries. As autonomous workflows scale, the missing layer is governance infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The shift: prompts, harnesses, execution systems
&lt;/h2&gt;

&lt;p&gt;Three years ago, the prevailing question was how to phrase a prompt. The model was the product, and prompt engineering was the surface where teams competed.&lt;/p&gt;

&lt;p&gt;That framing no longer matches what is being built. OpenAI's recent harness work on Codex, Anthropic's pattern for long-running managed agents, Cursor's background-agent runtime, Claude Code's session-aware workflow, and the open-source frameworks — LangGraph, CrewAI, AutoGen — all converge on the same architectural move: the model is wrapped inside an execution system that handles tools, memory, retries, planning, and continuity.&lt;/p&gt;

&lt;p&gt;Single calls have given way to execution loops. Assistants have given way to autonomous workflows. The model is no longer the product boundary. The execution environment is.&lt;/p&gt;

&lt;p&gt;That is the right move. Anyone building agentic systems in 2026 needs a harness. But it is also incomplete. As soon as agents are doing real work across many sessions, on real codebases, the harness reveals what it does not cover: &lt;em&gt;architectural intent&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What harness engineering solves
&lt;/h2&gt;

&lt;p&gt;Harness engineering handles context lifecycle management, tool orchestration, retries and error recovery, planning and execution loops, memory injection and continuity, and observability and execution coordination.&lt;/p&gt;

&lt;p&gt;This is a major architectural advancement over prompt engineering. Treating the execution environment as the product boundary lets autonomous workflows survive contact with real systems: rate limits, partial failures, long horizons, multi-tool coordination, and cross-session continuity.&lt;/p&gt;

&lt;p&gt;None of what follows is an argument against any of that. The argument is that this layer alone is not enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The governance gap
&lt;/h2&gt;

&lt;p&gt;A harness can make an agent faster, more persistent, more autonomous, and more capable. It cannot, by construction, make the agent &lt;em&gt;architecturally aligned&lt;/em&gt;. Continuity is not constraint. Orchestration is not enforcement. Memory of past decisions is not authority to refuse a future one.&lt;/p&gt;

&lt;p&gt;The failures look like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;An ADR is bypassed.&lt;/strong&gt; The repo has a recorded decision — "do not introduce a runtime ORM" — that the agent does not read at session start. The agent introduces an ORM because it solves the immediate ticket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A forbidden dependency reappears.&lt;/strong&gt; A package was removed for a documented reason. A later session reintroduces it because the prohibition lives only in a stale doc, not in an enforcement hook.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A governed system is rewritten.&lt;/strong&gt; The agent refactors a module that had a specific layering contract. The new version is functionally equivalent and passes tests, but violates the layering rule that was the entire point of the original design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layering boundaries are crossed.&lt;/strong&gt; A controller starts calling into a data layer that the architecture forbids it from touching directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Naming conventions drift.&lt;/strong&gt; Each session is internally consistent. Across sessions, the naming gradually changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure patterns mutate.&lt;/strong&gt; A standard for how services are exposed, configured, or deployed is silently replaced by a sensible-looking alternative that the rest of the system does not expect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these failures are caused by the harness being bad. They are caused by the harness being the wrong place to enforce architectural decisions. The harness's job is to keep work moving. Its incentives are continuity and throughput, not refusal.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Harnesses preserve execution continuity.&lt;/em&gt; They do not preserve architectural intent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why observability is insufficient
&lt;/h2&gt;

&lt;p&gt;The most common response to the governance gap is to lean harder on observability. Trace every tool call. Log every diff. Pipe agent activity into a dashboard. If we can see what the agent did, we can correct it.&lt;/p&gt;

&lt;p&gt;That argument confuses two different questions.&lt;/p&gt;

&lt;p&gt;Observability answers &lt;em&gt;what happened&lt;/em&gt;. Governance answers &lt;em&gt;what should have been allowed&lt;/em&gt;. These are not the same problem.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logs are not policy.&lt;/strong&gt; A log records that a forbidden dependency was added. It does not refuse the add.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traces are not invariants.&lt;/strong&gt; A trace shows the call graph. It does not declare which call graphs are valid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visibility is not enforcement.&lt;/strong&gt; A dashboard surfaces drift after it occurs. It does not block the change that produced the drift.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observability is necessary — you cannot govern what you cannot see — but it sits on the wrong side of the action. By the time the trace reaches the dashboard, the commit has already happened. Governance has to sit &lt;em&gt;in front of&lt;/em&gt; the action it constrains, with a deterministic rule about whether to allow it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance propagation across execution surfaces
&lt;/h2&gt;

&lt;p&gt;Long-running, autonomous agents do not only write source code. They write everywhere the workflow touches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Branch names and PR titles&lt;/strong&gt; — auto-generated by the harness, often outside the team's branch and title taxonomy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Commit messages and tags&lt;/strong&gt; — workflow-generated commits accumulate in history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI metadata and pipeline config&lt;/strong&gt; — written by agents the same way they write code, but with stricter governance constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment artifacts and release notes&lt;/strong&gt; — manifests, container tags, generated changelogs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generated configuration&lt;/strong&gt; — feature flags, routing rules, scaling policies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-produced documentation&lt;/strong&gt; — READMEs, ADR drafts, runbooks that become the next agent's training context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Governance must propagate across every surface touched by autonomous execution. A governance layer that enforces ADR compliance in &lt;code&gt;src/&lt;/code&gt; but ignores commit messages, PR titles, CI config, and generated docs is governing a fraction of the agent's output.&lt;/p&gt;

&lt;h2&gt;
  
  
  The next layer: governance infrastructure
&lt;/h2&gt;

&lt;p&gt;The clean way to think about the emerging stack is to stop treating it as a model-plus-tooling problem and start treating it as a layered system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Models           — produce candidate output
  ↓
Harnesses        — coordinate execution, retries, tools
  ↓
Execution        — long-running loops, sessions, memory
  ↓
Governance       — defines and enforces architectural constraints
  ↓
Verification     — tests, builds, deploy-time checks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer answers a question the layer above it cannot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Harnesses&lt;/strong&gt; answer &lt;em&gt;how does the agent act?&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution systems&lt;/strong&gt; answer &lt;em&gt;how does the agent keep working across time?&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance&lt;/strong&gt; answers &lt;em&gt;which actions are allowed, and according to which decisions?&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification&lt;/strong&gt; answers &lt;em&gt;did the resulting system still pass its objective checks?&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Governance is its own layer because the problem it solves is not solvable inside any of the others. Models produce text. Harnesses coordinate. Memory recalls. None of them can deterministically resolve which ADR governs a given change, or block output that violates the active decision graph.&lt;/p&gt;

&lt;p&gt;This is what makes governance infrastructure a category, not a feature. It cannot be folded into the harness without giving the harness an enforcement responsibility that conflicts with its continuity responsibility. It cannot be folded into observability without losing its blocking authority.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Mneme fits
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/TheoV823/mneme" rel="noopener noreferrer"&gt;Mneme&lt;/a&gt; is a deliberately narrow layer in this stack. It does not orchestrate tools. It does not retry calls. It does not manage memory or context. It does one thing: it compiles the repository's ADR corpus into a deterministic decision graph and enforces it at the boundaries where agents make consequential changes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repo-native.&lt;/strong&gt; ADRs live in the repository. The decision graph is rebuilt from them on every check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic enforcement.&lt;/strong&gt; Given the same decision graph and the same change, the result is the same every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance before generation.&lt;/strong&gt; &lt;code&gt;mneme check --mode warn&lt;/code&gt; at session start tells the agent which decisions are active before it writes anything. &lt;code&gt;mneme check --mode strict&lt;/code&gt; at pre-commit and CI is the enforcement gate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Harnesses help agents act.&lt;/strong&gt; Governance ensures they act within architectural boundaries. Verification confirms the result. All three layers are infrastructure. None of them substitute for the others.&lt;/p&gt;

&lt;p&gt;Long-running agents need more than memory and orchestration. They need enforceable architectural boundaries. The next phase of agent infrastructure is the layer that provides them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/harness-engineering-still-needs-governance/" rel="noopener noreferrer"&gt;https://mnemehq.com/insights/harness-engineering-still-needs-governance/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aigovernance</category>
      <category>claudecode</category>
      <category>ai</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Review Is Not Governance</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Mon, 18 May 2026 15:13:25 +0000</pubDate>
      <link>https://dev.to/mnemehq/review-is-not-governance-3hi1</link>
      <guid>https://dev.to/mnemehq/review-is-not-governance-3hi1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;CodeRabbit helps review AI-generated code. Mneme helps govern what the AI generates in the first place. These are not competing tools. They are different layers of the same problem.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most teams using AI coding assistants have invested in the review layer. Tools like CodeRabbit, PR comment bots, and AI-assisted linters sit at the end of the pipeline, inspecting output after it surfaces in a pull request. That is useful work.&lt;/p&gt;

&lt;p&gt;But it is reactive by design. You are reading the result of a generation process that already happened. The architectural violation, the naming drift, the deprecated dependency — they are already in the diff. Now you are deciding whether to accept them or ask for a rewrite.&lt;/p&gt;

&lt;p&gt;Governance operates differently. It runs before generation. It takes the architectural decisions your team has made — which services own which data, which dependencies are approved, which patterns are preferred — and makes them available to the AI assistant at the moment it is about to write code. Constraints are enforced at the source, not caught in review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The distinction matters at scale.&lt;/strong&gt; When AI multiplies code output 10-50x, catching violations after generation means reviewing the same class of problems over and over. Preventing them before generation means they stop appearing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two layers, one stack
&lt;/h2&gt;

&lt;p&gt;The AI coding stack is evolving into distinct layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation runtimes&lt;/strong&gt; like Cursor and Claude Code accelerate output. They are the engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review platforms&lt;/strong&gt; like CodeRabbit inspect what surfaces in pull requests. They are the quality gate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance systems&lt;/strong&gt; enforce architectural constraints before code is generated. They are the upstream layer that most teams are only beginning to build.&lt;/p&gt;

&lt;p&gt;These layers are complementary. A team with strong generation tooling and strong review tooling but no governance layer will ship fast and review often — and still accumulate architectural drift at AI velocity. A team that adds the governance layer stops correcting and starts directing.&lt;/p&gt;

&lt;p&gt;When intent only enters the workflow at PR review, the organization has already allowed the wrong work to be generated. That is the core failure mode of intent debt in AI-native engineering: decisions that exist but are not enforced at the moment agents produce code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why governance is infrastructure, not a feature
&lt;/h2&gt;

&lt;p&gt;Security engineering went through this same evolution. Code security was once a review concern: scan the output, flag the vulnerabilities, ask for fixes. Shift-left security moved those checks earlier — into the IDE, into the CI pipeline, into the generation process itself. The industry now treats security tooling as infrastructure, not an optional review enhancement.&lt;/p&gt;

&lt;p&gt;Architectural governance is at the same inflection point. As AI dramatically increases the volume of code entering a codebase, the constraints that govern that code must be embedded earlier in the process. Review catches what slips through. Governance determines what gets generated in the first place.&lt;/p&gt;

&lt;p&gt;That is the layer Mneme is built for.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/review-is-not-governance/" rel="noopener noreferrer"&gt;https://mnemehq.com/insights/review-is-not-governance/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aigovernance</category>
      <category>codereview</category>
      <category>claudecode</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Why CLAUDE.md Stops Scaling</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Mon, 18 May 2026 15:10:04 +0000</pubDate>
      <link>https://dev.to/mnemehq/why-claudemd-stops-scaling-58ol</link>
      <guid>https://dev.to/mnemehq/why-claudemd-stops-scaling-58ol</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Teams start with a small CLAUDE.md. Then the file grows. Then the team realizes it is no longer maintaining instructions — it is maintaining a governance system, with none of the infrastructure governance requires.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every engineering team that adopts an AI coding assistant goes through the same evolution. The first sessions are inconsistent. Naming conventions get ignored. Service boundaries blur. Approved dependencies get substituted. The team writes down the rules.&lt;/p&gt;

&lt;p&gt;A CLAUDE.md file in the repo root. A few coding conventions. Architecture notes. Testing expectations. The AI reads them. The sessions improve.&lt;/p&gt;

&lt;p&gt;For a solo developer on a six-month-old codebase, this works well enough to feel like a solution. Then the file grows. More rules. More edge cases. More exceptions. More workflows. Anti-patterns. Deployment procedures. Team-specific carve-outs.&lt;/p&gt;

&lt;p&gt;Eventually something shifts. The team is no longer maintaining instructions. It is maintaining a governance system — one built on a text file, with no enforcement layer, no precedence engine, and no decision provenance. &lt;strong&gt;Presence of instructions is not equivalent to enforcement.&lt;/strong&gt; That gap is invisible at small scale. It becomes structural at large scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why CLAUDE.md works — and why that matters
&lt;/h2&gt;

&lt;p&gt;It would be a mistake to dismiss what CLAUDE.md actually does well. The tool has genuine strengths, and the teams using it are solving a real problem correctly — for a while. Acknowledging this is not politeness. It is precision.&lt;/p&gt;

&lt;p&gt;CLAUDE.md is frictionless. It lives in the repo alongside the code, versioned with git, visible to every engineer and every session. It requires no infrastructure, no tooling, no setup beyond writing a file that was already useful before AI was in the picture. It is human-readable and composable: any engineer can open it, update it, and understand it in minutes.&lt;/p&gt;

&lt;p&gt;For behavioral steering, it works. Style conventions, naming patterns, preferred libraries, testing expectations, deployment notes — all of it can be communicated to the model at session start and meaningfully improves output consistency. A well-maintained CLAUDE.md on a small team is a real productivity asset.&lt;/p&gt;

&lt;p&gt;These strengths are why the pattern spread. They are also why the ceiling is invisible until you hit it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The instruction-surface ceiling
&lt;/h2&gt;

&lt;p&gt;The ceiling is not about Claude. It is not about prompt quality or file organization. It is about what static instruction files can and cannot do, regardless of how well they are written or maintained.&lt;/p&gt;

&lt;p&gt;A text document can describe a rule. It cannot enforce one. A CLAUDE.md can say "use the repository pattern for all data access." It cannot prevent a model from bypassing that pattern when the task signal is strong enough. The rule is present. The enforcement is not.&lt;/p&gt;

&lt;p&gt;This gap is invisible at small scale because teams compensate for it: code review catches violations, the team is small enough to remember the rules, the file is recent enough to still be accurate. As scale increases, each of those compensating factors erodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five failure modes
&lt;/h2&gt;

&lt;p&gt;The failure modes are not random. They follow the structure of the tool. Each one is a structural property of static instruction files, not a deficiency fixable by better maintenance or more careful writing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;01 — Context accretion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rules accumulate without prioritization semantics. Every token of injected context competes with the actual task. A 3,000-token context block on a complex refactoring session degrades output quality and increases inference cost. More critically: as contradictions accumulate, the model resolves them by natural language interpretation. The important rule and the outdated footnote have equal weight. Important rules get diluted by the volume around them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;02 — No deterministic enforcement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model can ignore, partially follow, reinterpret, or override any instruction in the file. When task completion pressure conflicts with an architectural rule, task completion tends to win. A governance system that depends on probabilistic compliance is not a governance system. The model read the rule. The violation happened anyway. These two facts are not in contradiction — they are the expected behavior of a text-based suggestion system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;03 — No decision provenance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Teams eventually ask: why does this rule exist? Which ADR introduced it? Is it still valid? When was it last reviewed? Was it superseded? A flat text file collapses all provenance into unstructured paragraphs. There is no way to trace a CLAUDE.md rule back to the decision record that created it, the alternatives that were rejected, or the conditions under which it should be superseded. Governance without provenance is institutional knowledge decay in progress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;04 — Poor scope resolution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Different rules apply globally, per service, per directory, per workflow, per environment. Flat instruction files have no mechanism for precedence, specificity, or conflict resolution. An org-level rule and a team-level exception coexist as equal-weight paragraphs. The model picks one based on proximity and attention in the context window. When two rules genuinely conflict, the outcome is model interpretation, not deterministic resolution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;05 — Autonomous agent drift&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This failure mode matters most as AI workflows shift from single-response generation to iterative execution loops, autonomous retries, and multi-agent orchestration. Context windows fill with task history and tool outputs. Rules injected at session start have measurably less influence by the middle of a long run. A violation in step 12 of an autonomous workflow is invisible to a rule read in step 1. Generation scales faster than governance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real category shift
&lt;/h2&gt;

&lt;p&gt;These failure modes are not surprising once you understand the era they belong to. CLAUDE.md is a context engineering tool. It solves context engineering problems well. The problem teams are actually running into is a governance infrastructure problem — a different category with different requirements.&lt;/p&gt;

&lt;p&gt;Each era solved its problem and revealed the next one. Better prompts improved output quality but could not enforce architectural invariants. Better context improved relevance but added no precedence or provenance. Longer workflows surfaced the drift that short sessions had hidden. The current problem is not a better version of the previous one. It requires different infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The memory misdiagnosis
&lt;/h2&gt;

&lt;p&gt;When teams hit the ceiling, the common misdiagnosis is that the model has a memory problem. The file is too long. The rules are not being retained across sessions. The context window is filling up.&lt;/p&gt;

&lt;p&gt;This leads to the wrong remedies: structured retrieval, semantic search over decision documents, RAG pipelines over architectural notes. These are real tools for real problems. They are not the right tool for this one.&lt;/p&gt;

&lt;p&gt;Architectural integrity cannot rely on probabilistic recall alone. A system where a constraint &lt;em&gt;might&lt;/em&gt; be followed, depending on context window pressure and model interpretation, is not a governance system. It is a soft suggestion that usually works.&lt;/p&gt;

&lt;p&gt;For most outputs, soft suggestions are fine. For architectural invariants that protect service boundaries, dependency policies, or security requirements, "usually works" is not a viable guarantee. The difference between those two categories is the governance boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The governance stack
&lt;/h2&gt;

&lt;p&gt;The right framing is not that CLAUDE.md is obsolete. It is that CLAUDE.md is one layer in a larger stack — specifically the layer that handles behavioral steering, style, and session context. The layer it cannot be is the enforcement layer.&lt;/p&gt;

&lt;p&gt;The governance layer above context and retrieval is what enforces constraints before generation output is accepted. It operates on structured decision records — typed, scoped, versioned, with explicit precedence — not on natural language files that the model reads and interprets. It runs before violations reach the codebase, not after a PR is opened.&lt;/p&gt;

&lt;p&gt;What that layer requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scoped governance.&lt;/strong&gt; Rules that apply globally, per service, per directory, or per workflow are stored with scope metadata and resolved deterministically when triggered — not matched by attention weight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precedence resolution.&lt;/strong&gt; When two decisions conflict, the system resolves the conflict by explicit precedence rules. The outcome is not model interpretation of overlapping paragraphs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforcement checks.&lt;/strong&gt; Decisions are validated against generated output at the hook level, before the file is written. Violations are blocked or flagged, not discovered in review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision provenance.&lt;/strong&gt; Every constraint traces back to the ADR or decision record that created it, with status, rationale, and supersession history maintained.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are infrastructure properties. They cannot be delivered by a better-maintained text file, regardless of how well it is written. They require a system that operates at a different layer of the stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;Teams at the early stages of AI adoption have not hit this problem yet. CLAUDE.md works well, sessions are consistent enough, review catches the violations that slip through. The pattern feels like it is scaling.&lt;/p&gt;

&lt;p&gt;The teams that have hit it recognize the symptoms: a CLAUDE.md that has grown into a maintenance burden, rules that conflict without resolution, enforcement that depends on reviewer attention, architectural violations that accumulate slowly and then become structural. Autonomous agents that followed architectural constraints in session 1 and drifted by session 50.&lt;/p&gt;

&lt;p&gt;The solution is not a more organized CLAUDE.md. It is governance infrastructure: structured decision records with scope and precedence, deterministic retrieval based on what is being generated, and hook-level enforcement that operates before output reaches the codebase. That infrastructure is what &lt;a href="https://github.com/TheoV823/mneme" rel="noopener noreferrer"&gt;Mneme&lt;/a&gt; is designed to provide — an architectural compiler layer that sits above the context window, not inside it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLAUDE.md keeps your context aligned. Mneme keeps your architecture enforced.&lt;/strong&gt; The two layers are complements, not competitors. What changes is the expectation of which one is responsible for enforcement — and the infrastructure needed to deliver on that responsibility.&lt;/p&gt;

&lt;p&gt;AI-native SDLCs are not failing because models are weak. They are failing because instruction surfaces are being mistaken for governance systems. As agent workflows become longer-lived and more autonomous, architectural integrity becomes an infrastructure problem, not a prompting problem.&lt;/p&gt;

&lt;p&gt;That is the category shift. CLAUDE.md is where it starts to show.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/why-claude-md-stops-scaling/" rel="noopener noreferrer"&gt;https://mnemehq.com/insights/why-claude-md-stops-scaling/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>claudecode</category>
      <category>aigovernance</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Datadog's State of AI Engineering Report Quietly Confirms the Governance Crisis</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Thu, 14 May 2026 21:52:40 +0000</pubDate>
      <link>https://dev.to/mnemehq/datadogs-state-of-ai-engineering-report-quietly-confirms-the-governance-crisis-10ni</link>
      <guid>https://dev.to/mnemehq/datadogs-state-of-ai-engineering-report-quietly-confirms-the-governance-crisis-10ni</guid>
      <description>&lt;p&gt;Datadog surveyed over 1,000 organizations running AI in production. The report is framed around observability and operational maturity. Read carefully, it is also the clearest empirical signal yet that the industry's next unsolved problem is governance.&lt;/p&gt;

&lt;p&gt;Most industry reports on AI engineering measure what is easy to measure: adoption rates, token volumes, model preferences, framework usage. Datadog's State of AI Engineering 2026 does all of that -- and then, in a handful of sentences buried across four findings, says something the AI tooling industry has been reluctant to say directly.&lt;/p&gt;

&lt;p&gt;The report does not use the word "governance" as its organizing frame. It talks about observability, operational discipline, and the maturation of production systems. But the data it surfaces -- model churn rates, context composition, error clustering, agent complexity -- all point to the same structural gap. &lt;strong&gt;The industry has scaled AI execution faster than it has scaled AI constraint enforcement.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What the report actually measures
&lt;/h2&gt;

&lt;p&gt;The 2026 report surveyed over 1,000 organizations and analyzed production telemetry across LLM API calls, agent frameworks, token consumption, error patterns, and model distribution. The scope is deliberately operational -- not "what are teams building" but "what is actually running in production, at what cost, with what failure patterns."&lt;/p&gt;

&lt;p&gt;Key numbers from the data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orgs using 3+ models&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Growth in orgs using 6+ models&lt;/td&gt;
&lt;td&gt;Nearly 2x YoY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input tokens that are system prompts&lt;/td&gt;
&lt;td&gt;69%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token growth at 90th percentile&lt;/td&gt;
&lt;td&gt;4x YoY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent framework adoption&lt;/td&gt;
&lt;td&gt;9% → 18% YoY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limit errors, March 2026 (Anthropic API)&lt;/td&gt;
&lt;td&gt;8.4M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The sentence that changes everything
&lt;/h2&gt;

&lt;p&gt;Buried in the second finding, after the model distribution charts, is the report's most important claim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"In practice, model churn becomes a governance problem."&lt;/p&gt;

&lt;p&gt;-- Datadog State of AI Engineering 2026, Fact 2&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The logic is direct. When 70% of production organizations run three or more models, and when the share running six or more nearly doubled in a single year, every model swap is also a behavior change. The same prompt does not produce identical output across models. The same architectural constraint is not uniformly respected. The same anti-pattern may be caught by one model and missed by another.&lt;/p&gt;

&lt;p&gt;Teams without a governance layer discover this through violations: in code review, in production incidents, in architectural drift that accumulates over months. Teams with a governance layer -- one that enforces constraints deterministically rather than relying on model behavior -- are insulated from the per-model variance. The enforcement runs before generation. Which model executes the prompt is irrelevant.&lt;/p&gt;

&lt;p&gt;This is not a problem you solve by picking a better model. It is a problem you solve by adding an enforcement layer that is model-agnostic by design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context quality is the new limiting factor
&lt;/h2&gt;

&lt;p&gt;The report's fifth finding is titled around context quality -- and the data here is striking. Sixty-nine percent of all input tokens are already system prompts. Not user turns, not retrieved documents, not task specifications: the baseline context injected at session start.&lt;/p&gt;

&lt;p&gt;This matters for governance because the most common response to enforcement gaps is to add more context: more rules to CLAUDE.md, more instructions to the system prompt, more documentation retrieved at session start. The data suggests that approach has reached its ceiling. More tokens do not improve constraint compliance if the enforcement surface remains probabilistic.&lt;/p&gt;

&lt;p&gt;The alternative is structured context: constraints that are scoped, typed, and retrieved based on what is actually being generated. Not a flat block of text injected at the top of every session, but a governance layer that surfaces the relevant decision at the moment it matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The observability ceiling
&lt;/h2&gt;

&lt;p&gt;The report quotes Guillermo Rauch, CEO of Vercel:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The next wave of agent failures won't be about what agents can't do. It'll be about what teams can't observe."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is half-right, and the half it misses is revealing. The next wave of agent failures will be about two things: what teams cannot observe, and &lt;strong&gt;what teams cannot enforce&lt;/strong&gt;. Observability tells you a violation happened. Governance prevents the violation from happening in the first place.&lt;/p&gt;

&lt;p&gt;The report's data supports this reading. Five percent of LLM API calls returned errors in February 2026. Sixty percent of those errors were rate limit errors. But errors are the recoverable failure mode. The unrecoverable failure mode is an architectural violation that passes the model, passes the test suite, passes code review, and ships.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disciplined production systems as the next competitive surface
&lt;/h2&gt;

&lt;p&gt;The report's Looking Ahead section:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The next wave of advantage belongs to organizations that can mature their agents into disciplined production systems -- continuously evaluating and improving them to be more observable, governable, resilient, and cost-aware."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Observable. Governable. Resilient. Cost-aware.&lt;/strong&gt; The framing is a four-part maturity model. Observability has tooling. Cost-awareness has tooling. Resilience has tooling. Governability -- the specific ability to enforce architectural constraints deterministically, across models, at generation time -- does not yet have mature tooling at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five governance signals from the data
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-model production is now the default.&lt;/strong&gt; 70% of orgs use three or more models. Every model swap is a behavior change. Governance must be model-agnostic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context is already saturated with system prompts.&lt;/strong&gt; 69% of input tokens are system prompts. Volume has hit its ceiling. Structure is what matters now.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agent framework adoption is accelerating.&lt;/strong&gt; Framework use doubled year-over-year. More orchestration complexity means more opportunities for architectural violations that no single-session review can catch.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prompt caching remains underused.&lt;/strong&gt; Only 28% of calls use prompt caching, despite 69% of tokens being system prompts. Structured governance constraints designed for caching would reduce both cost and latency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The error rate is stable, but errors are the wrong metric.&lt;/strong&gt; A 5% error rate with increasing agent complexity means violations are compounding silently in the non-error path.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What this means for teams building now
&lt;/h2&gt;

&lt;p&gt;The Datadog report is not a roadmap. It is a baseline. But the direction is implied in every finding.&lt;/p&gt;

&lt;p&gt;The era table for AI engineering maturity now has a new row:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Maturity layer&lt;/th&gt;
&lt;th&gt;What it addresses&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model selection&lt;/td&gt;
&lt;td&gt;Capability per task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt engineering&lt;/td&gt;
&lt;td&gt;Output quality per session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Visibility into what ran&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evaluation&lt;/td&gt;
&lt;td&gt;Quality measurement at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Deterministic constraint enforcement across models, agents, and time&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Teams that have observability without governance can see violations after they happen. Teams with governance can prevent violations before they do.&lt;/p&gt;

&lt;p&gt;The report's conclusion is worth sitting with: &lt;strong&gt;"actively governing model and context sprawl before it compounds into technical debt."&lt;/strong&gt; Not managing. Not monitoring. Governing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/datadog-state-of-ai-engineering-governance-crisis/" rel="noopener noreferrer"&gt;mnemehq.com&lt;/a&gt;. Mneme HQ builds open-source governance infrastructure for AI-native codebases -- typed architectural decisions, a precedence engine, and hook-level enforcement before violations reach the codebase.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>governance</category>
      <category>devops</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Why AI Architectural Governance Needs Precedence Semantics</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Thu, 14 May 2026 14:20:12 +0000</pubDate>
      <link>https://dev.to/mnemehq/why-ai-architectural-governance-needs-precedence-semantics-2ob6</link>
      <guid>https://dev.to/mnemehq/why-ai-architectural-governance-needs-precedence-semantics-2ob6</guid>
      <description>&lt;p&gt;Two architectural decisions overlap. An engineer in Cursor follows one. The async PR bot in CI follows the other. A reviewer signs off on a diff that flatly contradicts a decision that landed last week. Nobody is wrong. Nobody has the authority to be right. This is what AI coding governance looks like without a precedence layer — and it is the gap every system in the category is currently failing to close.&lt;/p&gt;

&lt;p&gt;Most of the conversation about "AI coding governance" in 2026 is still about the wrong layer. Prompt rules. CLAUDE.md. &lt;code&gt;.cursor/rules&lt;/code&gt;. RAG over an ADR folder. A reviewer agent on PRs. Policy docs in a wiki. Every one of these answers the question "how do we tell the model what we want?" — and not one of them answers the prior question: "when two of the things we want disagree, which one wins?"&lt;/p&gt;

&lt;p&gt;That second question is not a corner case. It is the central question of any governance system that has to operate at the scale of a real codebase, with real exceptions, written by real teams over real years. And it is the question almost every tool in the category is currently dodging.&lt;/p&gt;

&lt;p&gt;The missing layer has a name. It is precedence semantics. It is what makes governance deterministic, reviewable, and durable. Without it, "AI coding governance" is just a polite name for whichever rule the model happened to attend to this morning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The collision nobody owns
&lt;/h2&gt;

&lt;p&gt;Consider a perfectly ordinary situation. A team's data layer is governed by ADR-014: "All persistent data access goes through the repository pattern. No service may call an ORM session directly." The decision is sound. It has been the law of the codebase for a year.&lt;/p&gt;

&lt;p&gt;Eight months later, the payments team writes ADR-022: "The payments service may invoke the Stripe SDK directly inside an idempotency-key boundary. The repository abstraction does not compose with Stripe's at-most-once semantics, and the ledger requires raw call ordering." Also sound. Also reviewed. Also accepted.&lt;/p&gt;

&lt;p&gt;These two decisions overlap. Now four things happen at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An engineer in Cursor opens the payments service, sees ADR-014 because it was indexed first, and proposes a repository-based implementation.&lt;/li&gt;
&lt;li&gt;An engineer in Claude Code, with a different CLAUDE.md, sees ADR-022 highlighted and writes a direct-SDK implementation. Both diffs get approved.&lt;/li&gt;
&lt;li&gt;An async PR bot refactors an adjacent file and rewrites the call site to obey ADR-014, because that is what its system prompt encodes.&lt;/li&gt;
&lt;li&gt;Six months later, a new engineer reads the codebase, sees both patterns in production, and asks: "which one are we supposed to follow?" Nobody can answer without re-litigating the conflict from scratch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of those four people are misbehaving. The codebase has two correct decisions and no mechanism to say which one wins where. That is the precedence problem — and it is exactly the problem that no prompt rule, no retrieval system, and no review process will ever resolve, because none of them are designed to resolve anything. They are designed to retrieve.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why current systems can't resolve it
&lt;/h2&gt;

&lt;p&gt;It is tempting to think that the conflict above is a content problem — that if the team had written ADR-022 more carefully, or pasted it higher in CLAUDE.md, the right answer would emerge. It is not a content problem. It is a structural one. Every governance substrate currently in widespread use is fundamentally a retrieval substrate, and retrieval has no opinion on conflict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;01. Prompt rules resolve by attention.&lt;/strong&gt;&lt;br&gt;
CLAUDE.md, &lt;code&gt;.cursor/rules&lt;/code&gt;, &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; all hand the model a block of text. When two rules disagree, the model picks one based on whichever it attended to more strongly under this temperature, this context length, this surrounding code. Reorder the file and the answer changes. That is not governance — it is a coin flip with extra steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;02. RAG resolves by retrieval score.&lt;/strong&gt;&lt;br&gt;
Indexing the ADR folder and retrieving the top-k chunks per query feels rigorous. It is not. When two ADRs both score highly for "payments writes a charge," whichever the embedder happens to rank higher gets injected. Re-embed the corpus and the resolution can flip silently. The architecture is now a function of the vector index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;03. PR review resolves by whoever was looking.&lt;/strong&gt;&lt;br&gt;
If the reviewer who happens to be assigned remembers ADR-022, the diff lands correctly. If a different reviewer is assigned next time, the next diff lands the other way. The codebase ends up with both patterns and no record of which decision governed which file. The conflict is resolved by social process, and social process does not scale across async agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;04. Policy docs resolve nothing at all.&lt;/strong&gt;&lt;br&gt;
The wiki page is updated, sometimes. The Notion entry is written, occasionally. Neither of them is on the path between a model and a diff. They are referenced when a human looks them up — which is exactly when the resolution is already most expensive.&lt;/p&gt;

&lt;p&gt;The common failure underneath all four is the same: none of them have an opinion on which rule applies when several do. They surface rules. They do not resolve between rules.&lt;/p&gt;
&lt;h2&gt;
  
  
  What precedence semantics actually is
&lt;/h2&gt;

&lt;p&gt;Precedence semantics is the small body of rules a governance system uses to answer one question, every time, the same way: "given the current task, the current file, the current scope, and the full set of architectural decisions, which decision actually applies here, and which loses?"&lt;/p&gt;

&lt;p&gt;The answer has to be &lt;strong&gt;deterministic.&lt;/strong&gt; Not "the model probably picks the right one." Not "the reviewer usually catches it." Deterministic, in the engineering sense: same inputs in, same answer out, every time, regardless of which agent or model or temperature is on the other end.&lt;/p&gt;

&lt;p&gt;It also has to be &lt;strong&gt;reviewable.&lt;/strong&gt; The resolution has to be explainable as a chain of named facts — this scope was more specific, that decision supersedes the earlier one, this status retired the override — not as an opaque embedding score.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Governance is not memory retrieval. Governance is deterministic conflict resolution over architectural constraints. Retrieval is one input to that resolution — not a substitute for it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  The five axes of resolution
&lt;/h2&gt;

&lt;p&gt;A useful precedence engine resolves over a small, finite set of axes. Five of them carry almost all the weight in real codebases.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;What it answers&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Is the decision in force at all?&lt;/td&gt;
&lt;td&gt;A &lt;code&gt;deprecated&lt;/code&gt; ADR loses to any &lt;code&gt;accepted&lt;/code&gt; ADR touching the same scope, even if the deprecated one was more specific.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Supersedes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does this decision explicitly retire an older one?&lt;/td&gt;
&lt;td&gt;ADR-031 carries &lt;code&gt;supersedes: ADR-014&lt;/code&gt;. Wherever they overlap, ADR-014 is treated as deprecated.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope specificity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Whose scope is narrower?&lt;/td&gt;
&lt;td&gt;ADR-022 (&lt;code&gt;services/payments/**&lt;/code&gt;) beats ADR-014 (&lt;code&gt;services/**&lt;/code&gt;) inside its narrower scope.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Priority&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;When scopes are equal, who is authoritative?&lt;/td&gt;
&lt;td&gt;A security-class ADR (&lt;code&gt;priority: critical&lt;/code&gt;) wins over an ergonomics-class ADR (&lt;code&gt;priority: normal&lt;/code&gt;) at the same scope.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Temporal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;If everything else ties, the newer decision wins.&lt;/td&gt;
&lt;td&gt;Two equal-priority, equal-scope, neither-supersedes-the-other decisions resolve by acceptance date. Tiebreaker, not primary driver.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things matter about that table more than the contents of any one row. First: the axes are evaluated in a declared order, not improvised per query. Second: each axis is a fact carried by the decision itself — &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;supersedes&lt;/code&gt;, &lt;code&gt;scope&lt;/code&gt;, &lt;code&gt;priority&lt;/code&gt;, &lt;code&gt;accepted_at&lt;/code&gt; are properties an ADR declares, not inferences a model has to make. Once declared, resolution is a finite computation, not a guess.&lt;/p&gt;

&lt;p&gt;This is why precedence semantics is the missing layer. Every term in the table is already familiar to anyone who has written ADRs for more than a year. What is new is the claim that resolving over them is a system responsibility, not a reviewer's job to do in their head.&lt;/p&gt;
&lt;h2&gt;
  
  
  Governance is deterministic conflict resolution
&lt;/h2&gt;

&lt;p&gt;Once precedence semantics is named, the whole category reframes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The retrieval framing:&lt;/strong&gt; Governance is about getting the relevant constraints into context. The model takes it from there. Conflict resolution is whatever the model produces, hopefully on most runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The precedence framing:&lt;/strong&gt; Governance is about computing, deterministically, the single constraint that governs this scope. The model receives that constraint, not the conflict. The same inputs produce the same answer in every agent.&lt;/p&gt;

&lt;p&gt;The retrieval framing makes the model the conflict-resolution engine. That is exactly the wrong place to put it. Models are excellent at code synthesis under a constraint and unreliable at choosing between constraints. The precedence framing keeps the model out of the resolution and lets it do what it is good at.&lt;/p&gt;

&lt;p&gt;Said differently: the architectural truth of a codebase is not allowed to depend on which agent ran the query. If it does, the codebase does not have an architecture — it has a sample.&lt;/p&gt;
&lt;h2&gt;
  
  
  Governance is a compiler problem
&lt;/h2&gt;

&lt;p&gt;The frame that follows from all of this is that an AI-era governance system is shaped like a compiler.&lt;/p&gt;

&lt;p&gt;The input is the same kind of thing teams already write: ADRs, design documents, exceptions, and the relationships between them. The output is a single, queryable, scope-aware representation that every agent — interactive, async, in CI — can ask the same question of and get the same answer. Between the two is a pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;01. Normalize  — ADRs to canonical facts
02. Resolve    — precedence over the five axes  [the missing layer]
03. Compile    — one constraint per scope
04. Enforce    — pre-gen inject · post-gen check
05. Trace      — "this diff applied ADR-022 at ..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage already has a recognizable analog in software engineering. Normalization is what a parser does. Resolution is what a type checker or constraint solver does. Compilation is what an intermediate representation is for. Enforcement is what a pre-commit hook or CI gate is for. Traceability is what build provenance is for.&lt;/p&gt;

&lt;p&gt;What is new in 2026 is that all five together describe a layer above the agent — not a feature inside any single one.&lt;/p&gt;

&lt;p&gt;Compilers are deterministic by construction. Their outputs are reproducible. Their decisions are explainable. Their failures are localizable. Every property AI coding governance currently lacks is a property a compiler-shaped governance layer would have by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes for engineering leaders
&lt;/h2&gt;

&lt;p&gt;For an engineering leader looking at the category in 2026, the practical implication is that the question to ask of any "AI governance" pitch is no longer "does it read our rules?" It is: &lt;strong&gt;"what does it do when two of our rules disagree?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the answer is "the model figures it out," the system is a retrieval layer with a governance label on it. If the answer involves embeddings, ranking scores, or "we tune the prompt," the system is doing statistics, not governance. If the answer involves a declared resolution order over status, supersedes, scope, priority, and time — it is doing the actual job.&lt;/p&gt;

&lt;p&gt;That distinction matters more than the surface-level tool category. A team can run Claude Code, Cursor, Copilot, Windsurf, and three in-house SDK agents on the same codebase — as most engineering orgs already do — and still have a coherent architecture, but only if the layer underneath has a deterministic answer to "which decision applies here."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The engineering teams that win the next cycle will not be the ones that picked the best assistant. They will be the ones whose architecture survived having a portfolio of assistants — because the layer that decided what the rules were did not live inside any of them.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Related:&lt;/strong&gt; &lt;a href="https://mnemehq.com/insights/why-rag-fails-for-architectural-governance/" rel="noopener noreferrer"&gt;Why RAG Fails for Architectural Governance&lt;/a&gt; · &lt;a href="https://mnemehq.com/insights/memory-is-not-governance/" rel="noopener noreferrer"&gt;Memory Is Not Governance&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/why-architectural-governance-needs-precedence-semantics/" rel="noopener noreferrer"&gt;mnemehq.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>programming</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Why RAG Fails for Architectural Governance</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Thu, 14 May 2026 14:18:28 +0000</pubDate>
      <link>https://dev.to/mnemehq/why-rag-fails-for-architectural-governance-4h79</link>
      <guid>https://dev.to/mnemehq/why-rag-fails-for-architectural-governance-4h79</guid>
      <description>&lt;p&gt;Retrieval-augmented generation is an excellent tool for knowledge lookup. It is the wrong tool for enforcing architectural decisions. The distinction matters — and most teams building AI coding workflows haven't confronted it yet.&lt;/p&gt;

&lt;p&gt;When teams first encounter the problem of governing AI-generated code, RAG is the intuitive answer. You have a set of architectural decisions — ADRs, style guides, internal wikis, team conventions — and you want your AI coding assistant to respect them. RAG can retrieve relevant documents and inject them into the prompt. Problem solved, apparently.&lt;/p&gt;

&lt;p&gt;It isn't. The mismatch between what RAG provides and what architectural governance requires is deep, and fixing it requires a different kind of system entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What RAG is actually good at
&lt;/h2&gt;

&lt;p&gt;RAG excels when the task is: given a query, find the most semantically relevant passages from a corpus and surface them to the model. It works well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Documentation lookup&lt;/strong&gt; — "How does our auth middleware work?" retrieves the relevant design doc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAQ / support&lt;/strong&gt; — surface the right answer from a knowledge base.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context injection&lt;/strong&gt; — prime the model with background it wouldn't otherwise have.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarization&lt;/strong&gt; — condense a retrieved document for downstream consumption.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: RAG is a retrieval and suggestion mechanism. It finds relevant information. It does not enforce anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  What architectural governance actually requires
&lt;/h2&gt;

&lt;p&gt;Architectural governance is a different problem category. When you need to prevent an AI agent from making a decision that violates your service boundaries, your decisions need to be:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Authoritative&lt;/strong&gt; — not "here's something relevant," but "this is the rule that applies here."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precedence-aware&lt;/strong&gt; — when two decisions conflict, the system must resolve the conflict deterministically, not leave it to the model's judgment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope-aware&lt;/strong&gt; — a decision about the payments service should not fire when the model is editing the analytics pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforcement-capable&lt;/strong&gt; — the system needs to block or flag violations, not merely suggest alternatives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structurally validated&lt;/strong&gt; — decisions need a schema, not just free-form text, so violations can be detected consistently.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;RAG addresses exactly none of these requirements natively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 1: Semantic similarity ≠ decision authority
&lt;/h2&gt;

&lt;p&gt;RAG retrieves documents based on embedding similarity to the query. "Which cache library should I use?" might retrieve three documents: a blog post about Redis, an ADR mandating Valkey, and a benchmark comparing both. The ADR is authoritative. The blog post is noise. RAG has no mechanism to distinguish them.&lt;/p&gt;

&lt;p&gt;You can try to patch this by tagging documents with metadata and filtering by source type. But you've now built a lightweight decision registry on top of your RAG system — which is a separate architectural layer. And you still haven't solved the next problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The core issue:&lt;/strong&gt; RAG ranks by similarity. Governance requires ranking by authority. These are orthogonal dimensions that need separate systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 2: No precedence resolution
&lt;/h2&gt;

&lt;p&gt;Real architectural decision sets contain conflicts. An org-level decision from 2022 says "use PostgreSQL for all relational storage." A project-level decision from 2024 says "this service uses SQLite for simplicity — approved by the platform team." Which wins?&lt;/p&gt;

&lt;p&gt;The correct answer depends on scope, recency, and the explicit precedence relationship between the two decisions. A RAG system doesn't model any of this. It retrieves both, injects both, and leaves the model to interpret the contradiction. In practice, the model will pick whichever it finds more convincing in context — which is non-deterministic and ungovernable.&lt;/p&gt;

&lt;p&gt;A proper governance system needs a precedence engine: an explicit, deterministic function that takes a set of retrieved decisions and produces a single authoritative answer for a given context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 3: Retrieval quality degrades at scale
&lt;/h2&gt;

&lt;p&gt;RAG retrieval quality is a function of embedding model quality, corpus size, and query construction. In small corpora (under 100 documents), RAG works reasonably well. As your decision corpus grows to hundreds of ADRs, style guides, runbooks, and policy documents, retrieval precision drops and recall degrades.&lt;/p&gt;

&lt;p&gt;More importantly, architectural decisions have precise scope signals — file patterns, service names, module boundaries — that embedding-based retrieval handles poorly. Scope-aware retrieval requires structured matching, not vector similarity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 4: Suggestion, not enforcement
&lt;/h2&gt;

&lt;p&gt;Even when RAG retrieves the right decision, the model can ignore it. Whether it respects the retrieved constraint depends on prompt construction, model behavior, and context window dynamics — none of which are deterministic.&lt;/p&gt;

&lt;p&gt;In practice, models under instruction to complete a coding task will prioritize task completion over constraint adherence when the two are in tension. A suggestion system isn't a governance system.&lt;/p&gt;

&lt;p&gt;Enforcement requires a layer that can inspect generated output against structured constraints and block or flag violations before they reach review. This is architecturally separate from the generation layer, and RAG has no role in it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a proper governance layer looks like
&lt;/h2&gt;

&lt;p&gt;The distinction between RAG-based context injection and governance enforcement:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;RAG approach&lt;/th&gt;
&lt;th&gt;Governance layer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval basis&lt;/td&gt;
&lt;td&gt;Embedding similarity&lt;/td&gt;
&lt;td&gt;Scope + keyword + recency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authority model&lt;/td&gt;
&lt;td&gt;None — all documents equal&lt;/td&gt;
&lt;td&gt;Explicit precedence hierarchy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conflict resolution&lt;/td&gt;
&lt;td&gt;Left to the model&lt;/td&gt;
&lt;td&gt;Deterministic precedence engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enforcement&lt;/td&gt;
&lt;td&gt;Suggestion only&lt;/td&gt;
&lt;td&gt;Block / flag at generation time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decision schema&lt;/td&gt;
&lt;td&gt;Free-form text&lt;/td&gt;
&lt;td&gt;Structured with typed fields&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope handling&lt;/td&gt;
&lt;td&gt;Implicit / approximate&lt;/td&gt;
&lt;td&gt;Explicit scope patterns per decision&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A governance layer needs a structured decision schema — typed fields for scope, rationale, status, superseded-by, and the constraint itself. An example structured decision record:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ADR-012: Payment service storage backend&lt;/span&gt;
&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ADR-012&lt;/span&gt;
&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;active&lt;/span&gt;
&lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;services/payments/**&lt;/span&gt;
&lt;span class="na"&gt;supersedes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ADR-004&lt;/span&gt;
&lt;span class="na"&gt;precedence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;project&lt;/span&gt;   &lt;span class="c1"&gt;# beats org-level if conflict&lt;/span&gt;
&lt;span class="na"&gt;constraint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Use PostgreSQL with SQLAlchemy ORM. No direct SQL. No SQLite.&lt;/span&gt;
&lt;span class="na"&gt;rationale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Consistency with audit logging requirements (SOC 2 TR-7).&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the model targets a file matching &lt;code&gt;services/payments/&lt;/code&gt;, this decision fires. Its precedence level is checked against any conflicting org-level rules. The resolved constraint is injected authoritatively. If the model's output violates it, the enforcement layer blocks the write.&lt;/p&gt;

&lt;h2&gt;
  
  
  When RAG is still useful in this context
&lt;/h2&gt;

&lt;p&gt;RAG is useful for surfacing relevant prior art, similar code patterns, and background context during generation. It's appropriate when you want to enrich the model's knowledge without binding it to specific constraints. It pairs well with a governance layer: RAG provides context, the governance layer provides constraints.&lt;/p&gt;

&lt;p&gt;The error is conflating the two — assuming that injecting decision documents via RAG is equivalent to enforcing those decisions. It isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The underlying principle
&lt;/h2&gt;

&lt;p&gt;The distinction maps to a simple principle: suggestion systems and enforcement systems are architecturally incompatible. You cannot make a suggestion system into an enforcement system by improving retrieval quality. The enforcement guarantee requires a structural property — a deterministic check against structured constraints — that retrieval-based systems cannot provide by design.&lt;/p&gt;

&lt;p&gt;Architectural governance for AI coding is an enforcement problem. It should be solved with enforcement architecture.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related:&lt;/strong&gt; &lt;a href="https://mnemehq.com/insights/memory-is-not-governance/" rel="noopener noreferrer"&gt;Memory Is Not Governance&lt;/a&gt; · &lt;a href="https://mnemehq.com/insights/why-architectural-governance-needs-precedence-semantics/" rel="noopener noreferrer"&gt;Why Architectural Governance Needs Precedence Semantics&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/why-rag-fails-for-architectural-governance/" rel="noopener noreferrer"&gt;mnemehq.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>programming</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Memory Is Not Governance</title>
      <dc:creator>Theo Valmis</dc:creator>
      <pubDate>Thu, 14 May 2026 14:17:14 +0000</pubDate>
      <link>https://dev.to/mnemehq/memory-is-not-governance-5end</link>
      <guid>https://dev.to/mnemehq/memory-is-not-governance-5end</guid>
      <description>&lt;p&gt;The AI coding category has spent two years calling four different systems by the same word. Memory. Context. Retrieval. Governance. They share some primitives. They optimize for different things. And the most expensive mistake an engineering team can currently make is buying a memory system and expecting it to govern.&lt;/p&gt;

&lt;p&gt;The AI coding category is awash in memory products. Letta. Mem0. OpenAI's memory feature. Cursor's per-user context. Claude's projects. Every agent framework ships a "long-term memory" primitive. They are all built on a similar conceptual core — durable storage of past interactions, embedding-based retrieval, opportunistic injection — and they all do recall well.&lt;/p&gt;

&lt;p&gt;None of them governs.&lt;/p&gt;

&lt;p&gt;That sentence sounds polemical and is meant to. The conflation of "memory" and "governance" in the AI coding category is the single biggest source of category confusion in 2026, and it is the reason most engineering teams are paying for tools that promise architectural consistency and shipping codebases that do not have any.&lt;/p&gt;

&lt;h2&gt;
  
  
  One word, four systems
&lt;/h2&gt;

&lt;p&gt;Walk into ten engineering conversations about AI coding and you will hear the same four words used as if they meant the same thing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context.&lt;/strong&gt; The window of tokens the model can see right now. A per-request property.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval.&lt;/strong&gt; The mechanism by which something gets into that window. An index lookup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory.&lt;/strong&gt; The durable store of past interactions, decisions, preferences, and conversations that retrieval reads from.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance.&lt;/strong&gt; The rule system that decides which architectural constraints apply to which code, and enforces them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These four concepts get blurred because three of them are tightly coupled and the fourth happens to use the other three. Governance systems do read from memory. They do retrieve. They do inject into context. So at first glance, governance looks like a flavor of memory.&lt;/p&gt;

&lt;p&gt;It is not. Memory and governance differ on the most important thing a system can differ on: what they are trying to be good at.&lt;/p&gt;

&lt;p&gt;Memory systems optimize for recall. Governance systems optimize for constraint enforcement. Different targets, different math, different failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What memory actually optimizes
&lt;/h2&gt;

&lt;p&gt;A well-designed memory system is judged on questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Given a query, did we surface the relevant past artifact?&lt;/li&gt;
&lt;li&gt;How fuzzy can the query be before recall degrades?&lt;/li&gt;
&lt;li&gt;How long does the system continue to find the right thing as the corpus grows?&lt;/li&gt;
&lt;li&gt;How well does the system tolerate paraphrase, synonyms, near-duplicates?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All four are recall metrics. The optimization target is: given fuzzy input, return relevant material. The corpus is allowed to be redundant. The output is allowed to be ranked, partial, probabilistic. The system is doing well if the right thing is somewhere in the top results.&lt;/p&gt;

&lt;p&gt;That target is the right one for the problems memory systems were built to solve: personal assistants, agent continuity, customer support. In every case, recall is the job, and fuzziness is acceptable because a human (or a reasoning model) is on the other end to filter.&lt;/p&gt;

&lt;p&gt;None of those properties survive the move to governance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What governance actually optimizes
&lt;/h2&gt;

&lt;p&gt;A governance system is judged on a different question entirely:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Given the current task, current file, current scope, and the full set of architectural decisions — which decision applies here, and was the resulting code obedient to it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The optimization target is constraint enforcement. Output a single resolved rule. Reject code that violates it. Produce an audit trail explaining why. The job is not to surface candidates. The job is to pick.&lt;/p&gt;

&lt;p&gt;That distinction cascades through every property of the system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The output is one value, not a ranking.&lt;/strong&gt; Recall systems return top-k. Governance systems return top-1, by construction. "Here are five possibly-relevant ADRs" is a recall answer. "ADR-022 applies to &lt;code&gt;services/payments/charge.py&lt;/code&gt;, and ADR-014 is overridden in that scope" is a governance answer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The result has to be deterministic.&lt;/strong&gt; Recall can be probabilistic without harm. Governance cannot. The same input must produce the same answer in every agent, every model, every temperature, or the codebase is not actually governed by anything.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Conflict is the central case, not an edge case.&lt;/strong&gt; Recall systems treat overlapping documents as a ranking nuisance. Governance systems treat overlap as the entire point — conflict resolution is what makes governance deterministic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The audit surface is different.&lt;/strong&gt; A memory system's audit answer is "here is what we showed you, ranked by similarity." A governance system's audit answer is "this diff was generated under ADR-022, which won over ADR-014 because its scope is narrower."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The enforcement point exists.&lt;/strong&gt; Memory systems have no enforcement point. They surface and stop. Governance systems have a hook — pre-generation injection, post-generation check, CI gate — where output is rejected if it violates the resolved constraint.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The optimization-target table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Memory system&lt;/th&gt;
&lt;th&gt;Governance system&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Optimization target&lt;/td&gt;
&lt;td&gt;Recall under fuzziness&lt;/td&gt;
&lt;td&gt;Constraint enforcement under conflict&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output shape&lt;/td&gt;
&lt;td&gt;Top-k ranked list&lt;/td&gt;
&lt;td&gt;Top-1 resolved rule&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Determinism&lt;/td&gt;
&lt;td&gt;Probabilistic, acceptable&lt;/td&gt;
&lt;td&gt;Required, by construction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conflict semantics&lt;/td&gt;
&lt;td&gt;Ranking nuisance&lt;/td&gt;
&lt;td&gt;Central concern (precedence)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit surface&lt;/td&gt;
&lt;td&gt;"What we showed you"&lt;/td&gt;
&lt;td&gt;"Which rule won and why"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enforcement point&lt;/td&gt;
&lt;td&gt;None — surfaces and stops&lt;/td&gt;
&lt;td&gt;Hook at file write / commit / PR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure mode&lt;/td&gt;
&lt;td&gt;Missed recall (false negative)&lt;/td&gt;
&lt;td&gt;Silent drift, contradictory diffs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A team that buys row one of that table and assumes they got row seven has bought a recall system and labeled it governance. Six months later, the codebase has both versions of the rule in production, and nobody knows which decision the last bot-generated PR was actually written under.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory is an input to governance, not a substitute
&lt;/h2&gt;

&lt;p&gt;Naming the gap is not the same as saying memory does not belong in the picture. It does — just one layer below where the category currently puts it. Memory is one of the inputs a governance system reads from. It is not the governance system itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The current framing:&lt;/strong&gt; Buy a memory product. Index your ADRs. Hand the agent the top retrieved chunks. Call it AI coding governance. Discover six months in that the same constraint resolves differently across services and nobody can audit why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The correct framing:&lt;/strong&gt; Memory stores decisions and their metadata. Governance queries memory to discover candidates, then resolves between them deterministically over a declared precedence order, then enforces the resolved rule at the file-write or PR boundary.&lt;/p&gt;

&lt;p&gt;Once the layering is drawn this way, the category map snaps into focus. Memory products are real, useful, and almost universally available. The governance layer above them is mostly missing — not because it is impossible to build, but because the conflation of names has let vendors keep selling memory and call it governance, and let buyers keep buying memory and assume the architectural-constraint problem is solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the conflation persists
&lt;/h2&gt;

&lt;p&gt;Three reasons, roughly in order of weight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The primitives genuinely overlap.&lt;/strong&gt; A governance system that does not read from a durable store of decisions and retrieve relevant ones is not a governance system. So every governance system has a memory inside it. The reverse implication — that every memory system is therefore a governance system — is the false step, but it is an easy one to take when the substrate looks identical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The vendors are incentivized to blur the line.&lt;/strong&gt; Memory is a solved product category with shipped tooling and growing budgets. Governance is a category that is still being defined. The path of least resistance for any incumbent is to relabel its memory product as governance and let the buyer discover the difference in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The buyers do not yet have a sharp ask.&lt;/strong&gt; Engineering teams know they want their codebase to obey its architectural decisions across agents. Most have not yet articulated that as a separate problem from "the agent should remember things." Until the request is sharper than that, vendors will keep answering it with memory products.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The next time a vendor pitches "AI coding memory" for your architecture, the test is one question: "What happens when two of the rules in your store disagree on the same file?"&lt;/p&gt;

&lt;p&gt;If the answer is about retrieval scores, embedding quality, or chunking strategy — it is a memory system. Useful for some problems. Not the one being solved.&lt;/p&gt;

&lt;p&gt;If the answer is about declared precedence axes, deterministic resolution, and an enforcement point that a generated diff actually has to pass through — it is a governance system. That is the category that matters for codebases governed by architecture, and it is the layer the AI coding ecosystem is still mostly missing.&lt;/p&gt;

&lt;p&gt;Memory systems optimize recall. Governance systems optimize constraint enforcement. Two different jobs. One word. The cost of that conflation is paid in silent drift, contradictory diffs, and codebases that look architected and behave sampled.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mnemehq.com/insights/memory-is-not-governance/" rel="noopener noreferrer"&gt;mnemehq.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devtools</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
