DEV Community: Theo Valmis

How Mneme governs AI-generated code before the model writes a line

Theo Valmis — Mon, 29 Jun 2026 13:58:30 +0000

LLMs start every call from zero. They reintroduce a library you dropped six months ago, rebuild a component you chose to keep small, and contradict decisions your team already settled. Each violation reads as reasonable on its own. Stack them across a week of agent sessions and you get architectural drift.

Mneme works at the prompt boundary. It reads the decisions your project already made and checks the task against them before the model generates anything. The repo ships Layer 1: local-repo, single-developer, project-scoped governance. Here is the shape.

The pipeline

Five stages, running locally in under two minutes:

project_memory.json → MemoryStore → Retriever → ContextBuilder → LLMAdapter → Evaluator

project_memory.json holds the corpus: rules, constraints, anti-patterns, and decision records as structured, human-editable JSON. You write it by hand or compile it from ADRs.
MemoryStore loads the file and migrates legacy item shapes so older corpora still parse.
Retriever picks the decisions relevant to the current task. It scores on keyword overlap, tag match, and priority weight. No embeddings, no vector database.
ContextBuilder formats the top matches into a compact context packet.
LLMAdapter injects that packet as the system prompt and calls the model, or dry-runs with no API key.
Evaluator scores the response against the injected decisions and reports an alignment number.

A second path adds conflict_detector, which scans the response after generation, and an ADR compiler (adr_parser then adr_compiler) that turns ADR files with YAML frontmatter into the corpus and resolves precedence between decisions that disagree.

The demo runs each task twice, once with no governance and once with the corpus enforced, so you read the delta yourself.

Three principles hold the design in place

Deterministic over clever. Same corpus and same query produce byte-identical retrieval order on every run. A simple retriever that returns the same answer twice beats a smart one that does not.
Auditable over autonomous. Every block records which decision matched, which rule fired, and which term in the input triggered it. You can rebuild any verdict from the artifacts.
Prevention before review. The check lands before generation. By the time a reviewer opens the pull request, the drift already shipped into the branch.

Why this is not RAG

RAG retrieves documents to inform an answer. Mneme retrieves decisions to constrain one.

	RAG	Mneme
Input	Documents, chunks, embeddings	Rules, constraints, decision records
Goal	Inform the response	Shape the response
Output	The model knows more	The model follows what you decided
Test	"Did it cite the right source?"	"Did it respect the constraint?"

No vector store, no agent loop. The corpus stays small, structured, and yours.

What it is not, by design

The freeze pins the retrieval mechanics, enforcement semantics, and benchmark methodology at commit e73ff7d. The open exit criterion is real-world validation with design partners. Several things sit outside the wedge on purpose, not on a backlog:

Not generalized agent memory or a conversation-history store
Not autonomous planning or tool-use orchestration
Not prompt rewriting. Mneme blocks a violating prompt, it does not polish one.
Not auto-fixing. Mneme blocks, and the human or model fixes.

The benchmark carries the same restraint. It is a regression instrument, not a generalization claim: canned model responses, fixed retrieval, two-layer scoring, today at 7/7 scenarios and recall@3 = 1.00. Its job is to make any change to retrieval or enforcement visible, so no regression lands unseen.

Read the code

Layer 1, the benchmark suite, and an example corpus are public at https://github.com/MnemeHQ/mneme. The concepts behind the design (governance before generation, architectural drift, verification contracts) are defined at mnemehq.com/concepts.

Cursor Developer Habits Report 2026: Why AI Coding Needs Governance Infrastructure

Theo Valmis — Wed, 03 Jun 2026 18:40:09 +0000

Cursor's Developer Habits Report is one of the clearest signals yet that AI coding has crossed from individual productivity into software-delivery infrastructure. The headline numbers read as a story about speed: more code per week, larger PRs, deeper agent sessions, more changes committing without manual review. The deeper implication is governance -- whether teams can preserve architectural intent while generation, review, automation, and commit flows all accelerate at once.

The velocity curve is now measured, not anecdotal. For two years the claim that AI coding is accelerating rested mostly on vibes and vendor decks. Cursor's data turns it into telemetry. And read as an operations document rather than a marketing one, that telemetry describes a structural shift: software delivery is getting harder to govern, not just faster to produce.

This is not a critique of Cursor. The report is strong validation. Cursor proves the velocity curve with numbers most of the industry only gestured at. The point of this essay is what sits on the other side of that curve.

What the Cursor Developer Habits Report Shows

The inaugural Cursor Developer Habits Report (Spring 2026 edition), published by Cursor (Anysphere, Inc.), draws on Cursor usage data rather than survey responses. It captures the transformation across five themes -- developer acceleration, the economics of intelligence, the power user gap, the rise of context, and the shift to automation. The headline figures:

3.6K -> 8.6K lines added per developer per week -- the per-developer code volume rose from 3.6K (Jan 2025) to 8.6K (May 2026), with growth accelerating since the start of 2026.
125.86 -> 345.02 lines per PR at p75 -- lines added per pull request at the 75th percentile rose roughly 2.5x year over year (Jan 2025 to May 2026). Developers are taking on larger units of work in a single PR.
8% -> 13.8% mega PRs -- the share of PRs with at least 1,000 changed lines grew from 8% (Jan 2025) to 13.8% (May 2026).
~30% more tool calls per session in two months -- coding agents are reading and editing files, searching code, running shell commands, and browsing the web more frequently as they take on more complex work.
7% -> 36.3% changes committed without manual diff acceptance -- since the start of 2026, more than 5x as many agent-generated changes are reaching commits without a separate manual diff acceptance step (7% on Jan 1, 2026; 36.3% on May 16, 2026).
~76% -> ~81% AI-generated code survival -- the share of AI-generated code that persists rose, so more agent-authored code is both landing and staying.
46x and 15x at the tail -- p99 developers produce 46x more lines than the median active user and merge 15x more PRs than the median active PR author.

A line chart shows agent changes reaching commits without a manual diff acceptance step rising from 7% in January 2026 to 36.3% in May 2026 -- more than 5x in 2026.

Pair two of those numbers and the framing changes. More agent-authored code is reaching commits with less manual diff review (7% to 36.3%), and a higher share of it survives in the codebase (~76% to ~81%). More unreviewed AI code is both landing and persisting. That is not just a productivity story. It is a story about where architectural decisions are now being made -- increasingly inside an agent session, not inside a review thread.

Cursor states the destination plainly: AI software development is entering a new era, with AI becoming infrastructure for automating more of the software development lifecycle end to end. When something becomes infrastructure, the relevant question stops being "is it fast" and becomes "how is it governed."

Velocity changed the unit of risk

When AI was autocomplete, governance could reasonably live in review. A human wrote most of the change, accepted small suggestions inline, and a reviewer read a human-sized diff before merge. The unit of risk was a line or a function, and review was a sufficient first governance surface.

The report describes a different unit of risk. PRs at p75 carry 2.5x the lines they did a year ago. Mega PRs -- 1,000+ changed lines -- are now 13.8% of merged work. Agent sessions touch more files and make ~30% more tool calls than they did two months prior. And changes increasingly reach commits without a separate manual diff acceptance step.

Review is not dead. But review can no longer be the first governance surface. By the time a 1,000-line agent-authored change lands in a PR, the architectural decisions inside it have already been made -- which dependencies to import, which boundary to cross, which pattern to follow. A reviewer reading that diff is auditing decisions, not shaping them. The leverage point moved earlier, to the moment of generation.

The unit of risk scaled faster than the unit of review. When a single PR can carry 1,000+ agent-authored lines that committed without manual diff acceptance, the first place to assert architectural intent is before generation, not after merge.

This is why governance before generation stops being a slogan and becomes an operational requirement. The cheapest place to prevent an architectural violation is before the agent writes the code that contains it. Catching it in review still works, but at agent velocity, review becomes a backlog, not a gate. Verification contracts -- architectural rules expressed as checks an agent's output must satisfy -- move the assertion to where the decisions are actually being made.

Context is not governance

The report's "rise of context" theme is the one most likely to be misread as a solution. Models are reading far more before they write: input now accounts for more than 90% of input-output token volume, making context the dominant part of non-cache model usage. The intuition that follows is comforting -- if the model can see the whole codebase, surely it will respect the codebase.

It will not, not reliably. More context helps an agent understand a codebase. It does not guarantee the agent complies with it. A model can read every file, ingest every convention, hold the entire architecture in its window, and still import a forbidden dependency, cross a layer boundary, violate a naming rule, or contradict a platform decision recorded in an ADR. Reading a constraint is not the same as being bound by it.

This is the difference between memory volume and enforceable intent. Context is probabilistic input to a generation step. Governance is a deterministic check on the output. The two are not substitutes, and scaling the first does not produce the second. A 90%-input token mix means agents are extraordinarily well-informed and still entirely unconstrained.

An agent that reads more is not an agent that complies more. Context tells a model what exists; it does not tell the system what must not ship. Architectural compliance is an enforcement property, not a retrieval property.

This is also the line between retrieval and governance. Feeding architecture into the prompt is retrieval; binding generation to it is governance. We have written about that distinction at length in RAG vs governance: retrieval surfaces relevant text, governance enforces a rule deterministically regardless of what the model chose to attend to.

Automation creates governance surfaces

The report's "shift to automation" theme is where the governance gap becomes concrete. More AI changes are being accepted automatically: agent-generated changes reaching commits without a separate manual diff acceptance step grew more than 5x in 2026, from 7% on Jan 1 to 36.3% on May 16. Cursor also notes that adoption of its Automations is growing quickly, with security review emerging as a strong use case, and that SDK runs show early demand for turning agent infrastructure into a programmable platform customized to how each company builds software.

Every one of those is a new automated surface. And every automated surface is a place where architectural intent can be honored or quietly broken with no human in the loop. Automation does not remove the need for governance; it multiplies the number of points at which governance has to apply.

The implication is that governance can no longer be a single checkpoint. It has to propagate across the lifecycle:

Before code generation -- surface the relevant architectural constraints to the agent.
Before tool execution -- an agent running shell commands and editing files is acting on the system, not just proposing text.
Before commit -- the surface that 36.3% of agent changes now cross with no manual diff review.
Before the PR -- so a mega PR is born compliant rather than audited late.
In CI -- the deterministic backstop that fails the build on violation.
Across generated artifacts -- config, infrastructure, schemas, and migrations, not just application source.

A vertical flow diagram shows governance applying across six automated surfaces in sequence: before generation, before tool execution, before commit, before the PR, CI enforcement, and across generated artifacts.

A programmable agent platform with auto-accepting commit flows needs governance wired into each of those surfaces, not bolted onto one. That is the substance of governance propagation: the same architectural constraints, enforced consistently everywhere code is generated, evaluated, and committed. And it is why governance before generation is the anchor -- the earlier in the chain a constraint is asserted, the fewer downstream surfaces have to catch the failure.

The power-user gap becomes a governance gap

The report's "power user gap" theme is, on its face, a story about inequality of output: p99 developers produce 46x more lines than the median active user and merge 15x more PRs than the median active PR author. Activity is heavily concentrated at the tail.

Read as a governance document, this is the sharpest finding in the report. The most productive AI users are reshaping the architecture of a codebase 46x faster than the median developer -- and far faster than review, documentation, onboarding, and informal team knowledge can keep up. A power user can refactor a subsystem, introduce a dependency, or establish a pattern in an afternoon that the rest of the team discovers weeks later.

Asymmetric implementation velocity is asymmetric architectural influence. When one developer with agents can out-produce a team, the architectural rules that hold the system together cannot live in that team's collective memory or in a reviewer's vigilance. They have to be machine-readable and machine-enforceable, so they apply at the power user's velocity rather than the team's. That is the core argument for architectural governance: encoding the rules of the system so they bind every contributor, human or agent, at any speed.

At 46x output, informal architecture stops scaling. Rules that depend on a human noticing in review cannot keep pace with a developer who reshapes the codebase faster than the team can read the diffs. Machine-enforceable intent is the only thing that scales with the tail.

The missing layer: architectural governance infrastructure

Put the five themes together and the report describes a single transition: AI is becoming infrastructure for execution. Cursor, Claude Code, Copilot, and Devin all increase execution capacity -- they make it cheaper to generate, edit, and ship code. That capacity is real, measured, and accelerating.

What the velocity curve does not include is a layer that preserves architectural intent across all that execution. That is the layer Mneme occupies. It is not memory, not RAG, not PR review. It is a repo-native governance infrastructure layer that compiles architectural decisions -- the ADRs and constraints a team has already agreed on -- into machine-evaluable rules that agents can retrieve, respect, and be checked against.

The division of labor is clean. Execution tools own the velocity curve: more code, larger PRs, deeper sessions, more automatic commits. A governance layer owns the governance curve: the same architectural intent, enforced deterministically before generation and across every automated surface, regardless of which model or agent did the work. Because the enforcement is deterministic and model-agnostic, it does not erode as agents get faster or as the tool mix changes underneath it.

This is also where the layer meets the tools developers already use. Governance that runs at the hook level reaches the agent before generation; governance that runs in CI catches what slips through. Mneme is designed to sit at both -- alongside execution in the Claude Code integration and as a deterministic gate in GitHub Actions -- so the same constraints apply from the first prompt to the merge.

Cursor proves the velocity curve; the governance curve is the open problem. The report makes the case that AI is now SDLC infrastructure. The unanswered half is the infrastructure that keeps architectural intent intact while that execution scales -- and that is the layer worth building toward.

Source: The Cursor Developer Habits Report

Originally published at mnemehq.com. Mneme HQ is open-source architectural governance that enforces decisions at the point of authorship -- view it on GitHub.

Microsoft's Agentic Transformation Playbook Shows Why AI Agent Governance Is Now Infrastructure

Theo Valmis — Wed, 03 Jun 2026 18:39:01 +0000

Microsoft's Agentic Transformation Patterns Playbook is a useful signal because it does not treat AI agents as another productivity tool. It frames agentic AI as an enterprise operating-model shift: agents are moving from assisting humans to executing work across processes, systems, and teams. The implication for software teams is sharper than it looks -- coding agents are on the same trajectory, and architectural governance becomes part of the infrastructure stack the moment agents start executing.

Microsoft's playbook describes six transformation patterns and emphasizes that each pattern requires different ownership, governance, and operating discipline. That is the move worth paying attention to. It reframes agentic AI from a model question into an enterprise operating-model question.

That shift matters for software teams because coding agents are following the same path. They are moving from autocomplete to execution. Once agents edit files, open PRs, modify infrastructure, or coordinate multi-step changes, architectural governance becomes infrastructure.

What is Microsoft's Agentic Transformation Playbook?

Microsoft's playbook is a practical guide for choosing, scaling, and operating AI agents across the enterprise. Public summaries describe it as a 52-slide guide covering six transformation patterns, from employee productivity to core business processes and customer-facing agents. The throughline is that agents are not a single category -- they are a family of patterns with different ownership models, different risk surfaces, and different requirements for governance.

That framing matters because it cuts against the dominant adoption narrative. Most enterprises are still treating AI as a per-team productivity story: this team gets Copilot, that team gets an internal assistant, another team is piloting an agent for support tickets. Microsoft is arguing that the pattern of deployment determines the operating discipline required, and that ad-hoc deployment does not scale into core processes.

The important shift is Assist -> Execute

The meaningful distinction in this playbook is not chatbot vs agent. It is assist vs execute.

Assistive AI supports human work. Agentic AI increasingly performs work. That changes the governance requirement because the agent is no longer merely producing text. It may trigger workflows, access systems, make changes, and coordinate actions across processes that previously required a human signature, a code review, or a change ticket.

An assistant that drafts a paragraph and an agent that opens a pull request, modifies infrastructure, or updates a customer record are not the same risk surface. They look similar from a UX perspective. They are not similar from a governance perspective.

Why agentic transformation becomes an operating-model problem

One of the more useful arguments in the Microsoft material is that the six patterns are design choices, not maturity stages. An enterprise does not graduate from employee-productivity agents to core-process agents. It runs them in parallel, each with its own ownership, escalation, and release discipline.

The governance implication is that enterprises cannot manage every agent with the same lightweight checklist. A productivity assistant in marketing and a coding agent that edits production services need different boundaries, different release gates, and different escalation paths.

The failure mode is not that enterprises lack AI agents. It is that every team starts deploying agents with different assumptions about what agents are allowed to know, change, approve, and escalate.

AI agent governance cannot stay trapped in policy documents

Most organizations already have governance language. They have architecture principles, security standards, ADRs, platform rules, review processes, and release criteria. The problem is not absence of intent -- it is that agents do not reliably inherit that intent unless it is made available as an enforceable part of the workflow.

Policy documents work for humans because humans interpret them inside an institutional culture. They have weak gravitational pull on a model that has never read them, will not retain them across sessions, and is optimizing for the prompt in front of it. Governance that lives only in PDFs becomes invisible the moment work is delegated to a non-human executor.

For software teams, architectural governance is the missing agent infrastructure layer

A coding agent that only receives a task prompt has no durable understanding of the architecture it is operating inside. It may generate correct code that violates local decisions: bypassing a service boundary, introducing a forbidden dependency, duplicating an integration pattern, or placing logic in the wrong layer.

That is why AI coding governance has to move earlier than PR review. Review remains necessary, but it is too late to be the first place architectural intent appears. By the time a non-compliant PR exists, the agent has already done the work, the deviation is already in the diff, and the cost of correction is paid by a human reviewer.

The architectural governance layer answers a question that retrieval and prompting cannot: is this change allowed, given everything this codebase has already decided? That is a binary enforcement question, not a recall question.

From AI agent governance framework to governance infrastructure

Frameworks define what good governance looks like. Infrastructure makes governance executable.

Layer	What it produces	How it fails
Policy document	Stated intent, audit trail	Agents never read it; humans drift
Governance framework	Roles, principles, controls	Compliance theater without enforcement
Governance infrastructure	Checks, constraints, CI gates, context packets	Only if it does not cover all execution surfaces

A governance framework says agents need ownership, monitoring, approvals, and release gates. Governance infrastructure turns those rules into checks, constraints, context packets, CI gates, and workflow-level boundaries. The first is necessary. The second is what scales.

What Microsoft's playbook means for engineering leaders

The practical reading for an engineering leader looking at this material:

Treat coding agents as execution systems, not just developer tools. The risk surface is closer to a deploy pipeline than to a code editor.
Move architectural intent closer to generation time. Decisions that exist only in ADRs and Slack threads do not constrain agent output.
Convert ADRs and platform rules into enforceable constraints. If a rule matters, it should produce a verdict, not just a paragraph.
Separate agent experimentation from production governance. Sandboxes can be permissive. Production should not be.
Build governance that travels with the repo, not only with the policy team. The same compiled constraints should reach every agent, IDE, hook, and CI surface -- not depend on which agent happened to run.

The pattern Microsoft is naming, restated for software

Microsoft's adoption maturity material makes a related point worth quoting in spirit: agents create value when they operate within well-designed processes, and layering agents onto existing workflows without redesign can fail to improve end-to-end outcomes. The lift is not from the agent. It is from the operating model around the agent.

For software, the operating model around the agent is architectural governance. It is the thing that says: this is the boundary, these are the constraints, this is the release gate, this is what counts as a violation, this is what gets escalated. Without that layer, faster agents simply produce architectural drift faster.

Conclusion

Microsoft's Agentic Transformation Playbook makes the enterprise point clear: scaling agents is not only about better models or more pilots. It is about operating discipline.

For software teams, that discipline needs a technical layer. As AI agents begin executing engineering work, architectural governance becomes part of the infrastructure stack -- not a policy artifact, not a review-time backstop, but a layer that the agent must pass through to act.

Agents do not need more memory. They need an enforceable operating model.

Originally published at mnemehq.com. Mneme HQ is open-source architectural governance that enforces decisions at the point of authorship -- view it on GitHub.

Agent Runtime Governance: The Next AI Infrastructure Layer

Theo Valmis — Wed, 03 Jun 2026 18:38:32 +0000

Google's Managed Agents announcement is one of the clearest signals yet that the AI industry is moving beyond stateless tool calling toward persistent execution environments and long-running agent systems. That shift expands what models can do. It also expands the governance surface -- from prompt and PR review into the runtime itself.

We spent two years building brains in jars

For most of the current AI cycle, the system around the model has been thin. Models could reason, propose commands, and orchestrate small tool calls. But they ran in short sessions, against narrow APIs, under human supervision, with ephemeral state. The model was a brain; the body was a few HTTP requests and a JSON tool schema.

That assumption is ending. The frontier is not just better reasoning. It is a body for the brain.

The brain finally has a body. Now it needs governance.

The runtime layer for AI agents is arriving

Google Managed Agents (and the parallel motion across the ecosystem -- OpenAI's containerized execution work, Claude Code's persistent sessions, MCP-based tool ecosystems, hosted agent harnesses) formalizes the runtime as a product:

Sandboxed execution
Persistent state across sessions
Orchestration loops
Infrastructure-native agents
Agent-as-a-service lifecycle
Long-running sessions
Mid-session tool injection
Managed runtime lifecycle

This resembles the transition from scripts -> applications -> cloud platforms. Agents are no longer just calling tools. They are beginning to inhabit programmable environments.

Why persistent agent systems change governance

Once agents can continuously modify filesystems, maintain state across sessions, autonomously remediate, inject tools dynamically, operate against production systems, and coordinate across workflows, governance failures stop being one-off review misses. They compound over time.

What that compounding looks like:

Architectural drift -- small deviations accumulate across long-running sessions
Policy propagation failures -- constraints applied in one tool not enforced in the next
Runtime state divergence -- the world the agent believes it's acting in stops matching production
Autonomous violation loops -- a remediation that itself violates an invariant runs again on the next tick
Inconsistent remediation behavior -- same condition, different fix, no audit of why
Invisible constraint decay -- rules that no longer hold in practice but are never re-checked
Provenance loss across execution chains -- nobody can reconstruct why the system did what it did

Architectural governance becomes an execution-time systems concern, not a review-time coding concern.

Execution environments expand the governance surface

The surface that needs governance is no longer "a diff before merge." It is everything an agent can touch while it runs:

Filesystem mutations
Terminal execution
Deployment actions
Runtime state
Orchestration loops
Remediation chains
Branch and PR generation
Operational metadata
Tool injection
Infrastructure APIs

Every one of those is an execution surface that can carry, or fail to carry, architectural intent. The point of governance propagation is that the same compiled constraints reach all of them -- or the layer is not doing its job.

Why PR review governance stops scaling

Traditional governance assumes a human reviews generated artifacts after execution. That worked when generation was human-paced.

Long-running agents generate continuously:

Branches
Commits
Remediation loops
Infrastructure changes
Deployment actions
Operational metadata
Runtime state mutations

Pushing all of that into PR review turns the review queue into downstream damage control. The agent has already acted. Whatever drifted has already drifted. Review can document it; review cannot prevent it.

Persistent agent runtimes break review-based governance models.

The implication is that governance has to move where the execution is -- before generation, during the run, and at every tool boundary the runtime exposes.

Runtime governance and architectural invariants

The right primitive for this is the invariant: a constraint that must hold continuously across the agent's execution, not just be true at one merge point.

Examples of runtime invariants:

Forbidden dependencies never enter the workspace, even mid-session
Deployment restrictions apply to every action the agent takes against production
Architectural boundaries hold across files the agent visits hours apart
Data access policies are enforced for every query, not just code review
Remediation constraints prevent the agent from "fixing" a problem by violating another rule
Execution scopes bound what the agent is even allowed to attempt

These are the runtime-time equivalent of an ADR: a rule the system enforces, not a paragraph the human remembers. They compose with verification contracts -- predefined checks that prove the invariant held across the run.

The emerging AI infrastructure stack

The shape that is starting to settle:

Layer	Job
Model layer	Reasoning and generation
Runtime layer	Execution environments, orchestration, persistence
Tool layer	APIs, MCP, integrations, external systems
Governance layer	Architectural invariants, provenance, policy propagation
Verification layer	Runtime validation, enforcement traces, constraint evaluation

The governance and verification layers used to sit downstream of model and runtime, applied at the PR or the deploy. In a persistent-agent world, they have to sit inside the loop -- reachable from every tool call, every orchestration step, every remediation tick.

Execution environments need verification layers

Persistent agents introduce continuity, memory, authority, and compounding execution. Those properties are the source of the capability gains. They are also the source of the new failure mode.

Continuity without invariants creates drift. Memory without provenance creates plausible but ungrounded decisions. Authority without verification creates silent state divergence. Compounding execution without enforcement traces creates incidents nobody can reconstruct.

Persistent agent runtimes transform governance from a review-time concern into a runtime systems problem.

Conclusion: the next AI infrastructure battle

The industry solved how agents execute. The next problem is ensuring they continue executing within architectural intent over time.

The first generation of AI systems optimized reasoning. The next generation is optimizing execution. The generation after that will optimize governance across persistent execution environments -- runtime governance, runtime invariants, deterministic enforcement, and provenance that survives across long-running agent workflows.

The next AI infrastructure layer is not more reasoning. It is invariant preservation across execution surfaces. For the conceptual definition, see runtime governance.

Originally published at mnemehq.com. Mneme HQ is open-source architectural governance that enforces decisions at the point of authorship -- view it on GitHub.

What the AI Peer Review Study Reveals About Context Loss and Governance

Theo Valmis — Wed, 03 Jun 2026 18:38:06 +0000

A new AI peer review study found GPT-5.2 outperforming the top-rated human reviewer on Nature-family papers across a composite quality metric. The headline is the easy story. The harder story is in the breakdown: AI reviewers were still less factually correct than the top-rated human, and one of the recurring weaknesses was long-context management across multiple files. The real lesson for enterprise AI is not replacement -- it is context loss, verification, and governance around high-confidence outputs.

AI peer review crossed an important threshold

The paper On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists is a careful piece of evaluation. 45 domain scientists spent 469 hours rating 2,960 review criticisms from human and AI reviewers across 82 Nature-family papers. Each criticism was judged on three dimensions: correctness, significance, and sufficiency of evidence.

The crucial design choice is that the researchers did not evaluate whether AI predicted paper acceptance or matched reviewer scores. They evaluated the actual review criticisms themselves: were they correct, significant, and supported by enough evidence?

That matters because enterprise AI has the same problem. We do not only need to know whether an AI output sounds right. We need to know whether each claim is valid, grounded, and operationally useful.

The important shift is not that AI can now write peer reviews. It is that AI-generated criticisms are becoming good enough to influence expert judgment.

The result is impressive, but the aggregate hides the risk

On the composite fully-positive metric -- the share of criticisms rated correct, significant, and well-evidenced -- GPT-5.2 scored 60.0%, above the top-rated human reviewer at 48.2%. Claude Opus 4.5 and Gemini 3.0 Pro exceeded the lowest-rated human reviewer across every dimension. Where AI criticisms were accurate, they were often more significant and better-evidenced than human ones.

That is the headline number, and it is real.

The aggregate hides what matters most for governance: on factual correctness specifically, AI reviewers were still less correct than the top-rated human reviewer. The weighted composite favoured the model; the per-dimension breakdown did not.

Dimension	What it measures	Where AI reviewers lag
Correctness	Is the criticism factually right?	Below top-rated human reviewer
Significance	Does it matter for the paper?	Competitive when correct
Sufficiency of evidence	Is it grounded in the source?	Competitive when correct
Long-context management	Holding state across multiple files	Named as a recurring weakness

The pattern is consistent: AI reviewers are useful when correct, and confidently wrong when context drops out.

The better AI becomes at producing high-value criticism, the more expensive its grounding failures become.

The PM2.5 example is the enterprise failure mode

The cleanest illustration in the paper: Claude Opus 4.5 criticised a paper for missing a PM2.5 calibration procedure that was already described in the methods section.

That is not a dumb-model failure. The class of criticism -- "your calibration is not documented" -- is exactly the kind of thing a serious reviewer should raise. The failure was not capability. It was context management: the model produced high-confidence criticism that contradicted information already present in the source it was reviewing.

That maps almost one-to-one onto enterprise AI failure modes:

False-positive PR reviews flagging code that already complies
Duplicate architectural objections raised against decisions already documented in an ADR
Stale policy enforcement based on superseded guidance
Agents recommending patterns that violate constraints living outside their active context
Assistants criticising decisions already approved elsewhere in the repo
Multi-file workflows losing source provenance for the claims they generate

In every case, the model is capable. The workflow is not governed.

This is not only a model capability problem

The paper is careful to position current AI reviewers as complements, not substitutes, for human reviewers. The authors identify recurring weaknesses: limited subfield knowledge, lack of long-context management over multiple files, and overly critical treatment of minor issues.

That last one is worth dwelling on. An AI reviewer that flags too many low-significance issues, with confidence, is not a neutral tool. It shifts cost onto whoever has to triage the output. The same dynamic shows up in software: an AI agent that produces twenty plausible-looking PR comments creates a queue, not a signal.

Translated into enterprise language: the bottleneck is moving from can the model produce useful analysis? to can the system verify whether that analysis is grounded in the right context?

That requires a different layer than "a better model." It requires:

Source-aware context tracking -- what the model actually read versus what it should have
Provenance for claims -- every criticism traceable to the artifact it references
Verification loops before outputs are trusted -- check the claim against the source before surfacing it
A distinction between valid and already-addressed criticism -- deduplication against existing decisions
Policy and decision memory that survives across tools, agents, and files

Peer review is a preview of AI-assisted software governance

Scientific peer review and AI-assisted development share the same structural problem.

Both involve expert judgment over complex artifacts.
Both depend on context spread across many files.
Both require distinguishing real issues from already-addressed issues.
Both become risky when AI outputs are treated as conclusions rather than claims requiring verification.

In software teams, this shows up when AI agents review or generate code without preserving the architectural decisions that should constrain the work.

The same failure mode appears in AI-assisted development. A coding agent can identify a real architectural concern, but apply it to the wrong part of the system. It can flag a missing guardrail that already exists. It can recommend a pattern that violates an ADR because the relevant decision was outside its active context. The model may be capable. The workflow is not governed.

High-confidence output without grounded context is not a model problem. It is an infrastructure problem.

Governance before generation, not review after damage

If AI systems are going to generate, review, and coordinate technical work, they need access to the decisions that define what good looks like before they act, not only after a human reviewer catches the mistake.

In software, that looks like:

Encoding architectural decisions as enforceable constraints, not just documents
Retrieving the relevant decisions before generation or review begins
Validating outputs against repo-native governance
Exposing drift before it reaches the PR queue or production
Making architectural context durable across agent sessions and tools

This is the broader pattern the peer-review study is pointing at, restated for software. Better models do not remove the need for a verification layer. They raise the stakes of not having one.

The future is not AI judgment alone -- it is verified AI judgment

The peer-review study should not be read as a simple replacement story. It is a warning that AI judgment is becoming useful enough to require infrastructure around it.

The next question is not whether AI can produce expert-level criticism. Increasingly, it can. The harder question is whether organisations can verify, preserve, and enforce the context that makes that criticism trustworthy.

As AI moves from assistance to review, approval, and autonomous execution, the governance question changes: how do you verify high-confidence outputs before they become operational decisions?

Originally published at mnemehq.com. Mneme HQ is open-source architectural governance that enforces decisions at the point of authorship -- view it on GitHub.

The Emerging AI Engineering Control Plane: What Anthropic's Claude Marketplace Reveals About the Post-Copilot Stack

Theo Valmis — Wed, 03 Jun 2026 18:37:38 +0000

Anthropic's Claude Marketplace launch is interesting less for the marketplace itself than for the composition of the vendors it surfaces. The launch lineup -- Augment Code, bolt.new, CodeRabbit, Hebbia, Legora -- reads like a layered diagram of the AI engineering stack: generation environments, repo memory, orchestration, verification, workflow coordination. The post-Copilot era is fragmenting into specialized infrastructure. Architectural governance is the layer not yet named.

The marketplace is a signal, not the story

The marketplace mechanics -- apply existing Anthropic spend commitment toward Claude-powered partner products -- matter for procurement. The vendor list matters for category structure. The five names announced map almost cleanly to distinct operational layers:

Vendor	Operational layer it validates
CodeRabbit	PR-stage review and verification
Augment Code	Repository memory and context
bolt.new	AI-native execution environments
Hebbia	Knowledge orchestration and workflows
Legora	Operational workflow coordination

These are not overlapping products. They are infrastructural layers. The shape that emerges when you stack them is closer to a control plane than to a tool catalog.

The first wave was monolithic

Copilot-era tooling assumed a single agent, a single developer, a single session. The mental model was an AI pair programmer sitting beside one human. Most of the assumptions baked into that wave still show up in today's products:

Single-agent execution
Prompt-centric workflows
Per-session context
Suggestion-then-accept interaction

That model breaks the moment work scales out: multi-agent systems, autonomous execution, long-running workflows, organizational scale. The vendor categories now appearing in the Claude Marketplace are exactly what teams have been hand-building to compensate -- review systems, repo-memory layers, sandboxed runtimes, orchestration. The market is productizing the missing layers.

The stack is fragmenting into governance surfaces

The useful frame for what comes next is the governance surface: any boundary where architectural intent has to survive an autonomous handoff. Once agents are doing the work, each of these is a place drift can enter:

Generation
Retrieval
Branch naming and PR metadata
CI pipelines
Deployment artifacts
Runtime execution
Review systems

Architectural drift propagates across all of them. Solving it at one surface and ignoring the rest is how teams end up with code that passes review, runs in production, and still violates the architecture nobody was checking against.

The shift is from "AI in the IDE" to a multi-layer control plane. Each layer needs its own infrastructure. None of them, alone, is the whole job.

Verification alone is not enough

CodeRabbit-style review systems are doing real, valuable work. They scale review throughput in a regime where generation throughput has already outpaced human reading. They are increasingly necessary.

They are also fundamentally post-generation.

By the time a review-stage system sees the change, the agent has already made the architectural choice. The reviewer can flag it, push back, demand a rewrite. What it cannot do is prevent the choice from being made in the first place. As autonomous development scales the volume of generated code, pushing all architectural verification into review turns the queue into incident response.

Review systems scale review. They do not preserve architectural intent upstream. That is a different layer with a different job.

The missing layer is governance before generation: invariant preservation, deterministic enforcement, verification contracts that run before the agent acts -- not after the diff is on the screen.

Why memory systems fail as governance

Repo-memory and context infrastructure -- the layer Augment Code is validating -- is also real, useful work. The agent that has the whole codebase indexed makes fewer obvious mistakes than the one that does not. But memory systems and governance systems solve different problems.

Memory systems	Governance systems
Optimize recall	Optimize invariants
Probabilistic retrieval	Deterministic verdicts
Best-effort ranking	Precedence semantics
"Did the agent see it?"	"Was the agent prevented from violating it?"
Information availability	Constraint enforcement

Context-window dilution, ranking instability, and conflicting decisions are real properties of retrieval pipelines. They are not properties governance can tolerate. RAG fails for architectural governance not because retrieval is broken but because retrieval is the wrong primitive for a binary enforcement question.

The emerging AI engineering control plane

The shape that is settling into place is a layered control plane, much like the ones cloud and CI/CD developed before it:

The control plane stacks six layers: (01) Generation -- Claude, GPT, Gemini, Mistral, the model layer; (02) Execution environments -- bolt.new, sandboxes, IDE agents, persistent runtimes; (03) Memory & context -- Augment Code, codebase indexes, retrieval pipelines; (04) Orchestration -- Hebbia, Legora, multi-agent workflows, knowledge coordination; (05) Governance -- architectural invariants, deterministic constraints, verification contracts, provenance, the layer the marketplace does not yet name; (06) Verification & review -- CodeRabbit, post-generation checks, observability.

The Claude Marketplace announcement names layers 1, 2, 3, 4, and 6. Layer 5 is what sits between them -- the place that says "these are the architectural rules; every generation, every tool call, every CI run has to clear them." That layer is not yet productized at the marketplace level. It is the next category.

Conclusion: the industrialization of AI-assisted development

The important trend is not better coding models. It is the industrialization of AI-assisted software development into specialized operational infrastructure. The first wave was a pair programmer. The second is an engineering organization's worth of infrastructure, decomposed into layers, each with its own vendor category and operational discipline.

The next phase of the market is not better autocomplete. It is coordination, governance, and architectural integrity at agent scale. The Claude Marketplace is one of the clearer signals that the stack has started to look this way for real.

What's missing from the marketplace today is the layer that says no. Generation, memory, orchestration, and review are all about producing and inspecting output. Governance is about constraining what the system is allowed to do in the first place. That is the category Mneme is built around.

Originally published at mnemehq.com. Mneme HQ is open-source architectural governance that enforces decisions at the point of authorship -- view it on GitHub.

The Acceleration Whiplash and the Governance Gap

Theo Valmis — Wed, 03 Jun 2026 18:34:34 +0000

The Faros AI Engineering Report 2026 is not a survey of developer sentiment. It is two years of telemetry from 22,000 developers across 4,000 teams, measuring what AI adoption actually produces downstream. The findings have a name: the Acceleration Whiplash. The structural explanation has one too.

What the telemetry actually shows

The output numbers in the Faros report are real and worth stating plainly. Epics completed per developer are up 66.2%. Task throughput per developer is up 33.7%. PR merge rate per developer is up 16.2%. These represent genuine delivery acceleration, and dismissing them would be dishonest. AI coding tools are producing real productivity gains at the business level.

The production quality numbers are also real:

Metric	Change
Incidents per PR under high AI adoption	+242.7%
Median time in code review	+441.5%
Code churn (lines deleted to lines added)	+861%
PRs merged with no review at all	31.3%

Source: Faros AI Engineering Report 2026: The Acceleration Whiplash. Telemetry from 22,000 developers across 4,000+ teams. Figures represent metric change from lowest to highest AI adoption periods within each organization.

Both sets of numbers are true simultaneously. That is the whiplash. Throughput accelerated. The downstream systems built to validate that throughput did not. Plotted together, generation throughput rises steeply while control capacity stays nearly flat -- and the gap between the two curves is the governance debt.

Why the systems did not scale

Code review, incident response, and architectural validation were all designed for a world where development velocity was human-paced. A senior engineer could review the meaningful PRs in a sprint. An incident postmortem could trace a failure to a specific change and a specific decision gap. Architectural drift was visible because it moved slowly enough to catch.

AI-generated code broke these assumptions quietly. Not because the code was obviously bad, but because it was often superficially convincing. The Faros report captures this in their description of the senior engineer tax: AI-generated code is idiomatic, well-named, and stylistically consistent with the surrounding codebase. The failures are structural, beneath the surface, requiring the reviewer to reason about intent rather than scan for errors. That is expensive cognitive work. The 441.5% increase in median review time is the cost of doing it at volume.

The 31.3% of PRs merging with no review at all is the cost of not doing it. Reviewers cannot keep pace. The queue backs up. Code ships unexamined. The incident rate rises.

The most important line in the Faros report: "the ability to push quality back to where it belongs, at the point of authorship, before the code ever reaches review." This is not a suggestion. It is the structural conclusion the telemetry points toward.

The governance gap

There is a name for the structural mismatch the Faros data is measuring: the governance gap.

The governance gap is the distance between where AI generates code and where the systems designed to validate it operate. AI generates at the beginning of the workflow. Review operates near the end. Testing and incident detection operate after deployment. As generation speed increases, this gap widens. Code enters the pipeline faster, and the downstream systems have less time and less capacity to catch what should not have been generated in the first place.

This is not a model quality problem. Better AI code generation does not close the governance gap. It can narrow the surface area of obvious errors, but it does not enforce architectural invariants, resolve conflicting decisions, or prevent drift from accumulating across the codebase over time. Those are not generation problems. They are structural problems that require structural solutions.

Review and memory are insufficient as scaling primitives

The two most common responses to the governance gap are harder review and richer context injection. Both are real interventions. Neither is a scaling primitive for the problem the Faros data describes.

Harder review is what the +441.5% median review time represents. Engineering teams did not loosen their standards when AI adoption increased. They tried to maintain them. The cost was reviewer time, and the outcome was still 31.3% of PRs merging unreviewed and monthly incidents up 57.9%. Review can only absorb so much volume before the queue overwhelms it.

Context injection, pasting architectural rules into CLAUDE.md or injecting ADR documents into a system prompt, addresses a real problem: AI agents lack institutional memory. But context injection has a ceiling. It degrades across sessions. It has no enforcement semantics. It cannot resolve conflicts between rules. It cannot be audited after an incident. And it has no effect on the agent that generates plausible-looking code that violates a constraint the prompt did not anticipate.

The Faros data describes a system where generation velocity has outpaced governance velocity. Neither more reviewers nor longer prompts changes the structural relationship between those two rates.

What closing the gap requires

The Faros report's structural conclusion points to the same place that the architectural governance argument points: quality needs to move to the point of authorship. Not downstream in review. Not in the incident postmortem. Before the code is written.

What "before the code is written" requires in practice is specific:

Architectural decisions as structured, machine-readable constraints -- not prose guidelines, not ADR documents in a prompt, but enforcement rules with explicit scope, precedence, and action
Hook-level integration -- enforcement at the agent's tool-use layer, before the write completes, not in the review queue after the PR is opened
Persistence across sessions and agent boundaries -- constraints that survive context rotation, multi-agent handoffs, and the next developer who picks up the work
Explainable enforcement traces -- structured output an agent can act on when blocked, not a pass/fail signal that requires human interpretation

This is not a review improvement. It is a different layer of the stack, operating at a different point in the workflow. The Faros data does not prescribe a specific implementation. But it does name the problem with precision: the systems that validate what AI generates are not scaling with the rate at which AI generates it. Closing that gap is the engineering problem the next phase of AI development has to solve.

What the data means for engineering organizations now

The Faros report includes a pointed observation about the DORA 2025 finding that strong engineering foundations amplify AI benefits. Two years of telemetry tell a different story. High-performing engineering organizations with mature DevOps practices are experiencing the same downstream deterioration as everyone else. The governance gap is not a maturity problem. It is a structural problem that mature practices do not automatically solve.

For engineering leaders reading the Faros data, the practical implication is this: the throughput gains from AI adoption are real and worth preserving. The incident rate and review burden increases are also real and compounding. The interventions that address the second set of problems without eliminating the first are the ones that operate upstream, at the governance layer, before code generation, not after.

The organizations the Faros report notes as "already ahead" are the ones with the observability to see where throughput is real and where review is failing. The next step is the infrastructure to enforce architectural correctness at the source.

Originally published at mnemehq.com. Mneme HQ is open-source architectural governance that enforces decisions at the point of authorship -- view it on GitHub.

The AI ROI Problem Is Not About Models. It Is About Systems.

Theo Valmis — Fri, 29 May 2026 13:28:52 +0000

The recent wave of weak enterprise AI ROI reporting is not evidence that AI fails to create value. It is evidence that organizations matured generation capability faster than the governance and verification infrastructure needed to operationalize it. Generation is rapidly commoditizing. Verification is not.

The findings, read carefully

Several recent enterprise studies are pointing at the same structural pattern: adoption is accelerating, local productivity gains are visible, but measurable financial impact remains inconsistent. Organizations are struggling to operationalize gains at the system level.

The headlines summarize this as "AI ROI is disappointing." That framing is the wrong takeaway. The stronger interpretation is:

AI generation capability matured faster than enterprise operational infrastructure. The result looks like ROI failure. It is actually a transition period.

That distinction matters because it changes the strategic direction. If the problem is "AI does not work," the response is to slow down. If the problem is "the operational layer underneath AI has not been built yet," the response is to build it.

The market misdiagnosed the problem

Most organizations treated AI adoption like a tooling upgrade. New IDE plugin, new copilot, new chat interface. That framing is structurally wrong. AI behaves much less like tooling and much more like an execution layer.

Traditional tooling assists humans. Emerging AI systems increasingly execute on behalf of humans. Once agents write code, modify infrastructure, trigger workflows, coordinate tasks, and interact with production systems, the operational requirement changes completely.

The primary question stops being "is generation quality high enough?" The question becomes:

How do organizations preserve coherence while execution scales? That is fundamentally a governance problem, not a model problem.

Why productivity gains fail to reach the P&L

The productivity gains are real. Teams report faster code generation, faster document production, accelerated research, and less repetitive work. None of that is fictional. The question is what happens to those gains as they propagate through the rest of the system.

Enterprise systems are interconnected. If acceleration in one layer creates instability elsewhere, the organization tends to relocate labor rather than remove it. The shape of the relocation is consistent across teams I have talked to and across the public studies:

developers generate code faster
reviewers spend more time validating it
architectural drift increases as more code lands
downstream bugs and incidents rise
integration complexity compounds
governance overhead expands to compensate

The system gets faster at producing work that still requires human reconciliation. People feel more productive. Leadership struggles to measure durable financial transformation. The gains exist. They are partially consumed by verification costs that nobody is tracking.

The hidden economic layer: verification

The AI industry has been framing generation as the scarce resource. That framing is becoming obsolete. Generation is commoditizing rapidly. Models get cheaper, smaller, more capable, and more numerous every quarter. The cost curve is pointed in one direction.

Verification is not on the same curve.

Generating output is becoming exponentially cheaper. Ensuring correctness, consistency, and alignment is not. That asymmetry is what is actually showing up in the ROI numbers. The new bottleneck is:

Verification — does this output meet the constraint?
Enforcement — can a violation be blocked, not just observed?
Governance — whose decisions does the running system reflect?
Explainability — can the verdict be traced back to a decision?
Provenance — can the lineage of a change be audited?
Architectural integrity — does the system still look like the system we intended?

The faster generation becomes, the more valuable deterministic enforcement becomes. Governance infrastructure becomes increasingly important as agent capability improves — not less.

Governance debt

Software engineering already has a name for one category of accumulated cost: technical debt. AI systems are introducing a second, related, distinct category. Call it governance debt.

Governance debt accumulates when:

organizational decisions fail to propagate consistently across agents and teams
agents make locally valid but globally conflicting decisions
architecture standards drift across sessions or sub-agents
operational constraints become implicit instead of enforceable
review queues absorb coordination failures the system should have caught

The dangerous property of governance debt is the same property that makes it expensive: systems appear productive locally while degrading globally. The organization experiences acceleration and fragmentation at the same time. Leaders feel both effects but cannot reconcile them in the same metric.

Category	Accumulates as	Pays back as
Technical debt	Shortcuts in implementation	Maintenance cost on the code itself
Governance debt	Constraints that fail to propagate	Coordination cost across teams and agents

Every major computing transition followed this shape

The AI ROI story rhymes with earlier shifts. Each major computing transition has the same two phases:

Phase 1: capability expansion. The new technology shows it can do things the previous stack could not.
Phase 2: operational stabilization. The infrastructure to actually run the new technology in production gets built.

Cloud computing required orchestration. Microservices required observability. Open source required CI/CD governance. None of those transitions paid off until the operational layer caught up. AI systems are now entering the same transition.

The first wave rewarded model capability, prompting, generation quality, and autonomy. The next wave will reward reliability, enforcement, coordination, deterministic governance, operational traceability, and execution controls. That is where the market is heading, and it is where the ROI is going to materialize.

The strategic question is changing

The AI conversation is slowly shifting from one question to another:

Old question: Can AI generate useful output?
New question: Can organizations safely operationalize AI-generated execution at scale?

The first question is essentially answered. The second one is open. And it introduces a different set of requirements: governance systems, verification contracts, policy enforcement, execution boundaries, architectural invariants, provenance tracking. The market is quietly moving from intelligence infrastructure toward operational infrastructure.

Conclusion: what wins the next phase

The organizations that win the next phase of AI adoption may not be the ones with the most autonomous agents or the fastest generation systems. They may be the ones best able to:

constrain execution
preserve architectural coherence
enforce operational decisions
verify outputs deterministically
integrate AI into reliable organizational systems

Because eventually every scaling AI system encounters the same reality:

Intelligence without governance creates acceleration. Governance is what turns acceleration into compounding value.

Originally published at mnemehq.com.

AI Is Becoming the Operating Layer for Software Execution

Theo Valmis — Fri, 29 May 2026 13:28:26 +0000

Operating systems were never about interfaces. They were coordination systems. They scheduled processes, allocated memory, enforced permissions, isolated workloads, and reconciled what software wanted with what hardware could provide. AI is now becoming a coordination layer of the same kind — only the resources being coordinated are intent, tools, repos, memory, and execution chains. Once that shift completes, governance stops being a policy concern and becomes infrastructure.

The wrong mental model

The dominant framing for AI in 2026 is still "AI replaces apps." Better search, better assistants, better interfaces sitting in front of the same software. The frame is incomplete because it inherits a mistake from the consumer era: it treats the operating system as a UI layer.

Operating systems were never fundamentally about interfaces. They were coordination systems. What they actually governed:

memory and address spaces
scheduling and CPU time
permissions and capabilities
process isolation
resource arbitration
execution boundaries

What AI systems are starting to coordinate, in 2026, looks structurally similar:

workflows and multi-step plans
tools and external APIs
repositories and codebases
memory across sessions and agents
execution chains and retries
decision flows and delegation paths
autonomous agents and sub-agents

That list is not interface behavior. It is operating-system behavior.

The evolution of computing layers

The progression is visible if you line up which resources each generation of platform actually coordinated. Each layer abstracts the one beneath it. Each layer also eventually has to grow the same kinds of controls the layer below it grew: scheduling, permissions, isolation, audit. The AI operating layer is in the early-OS era of that pattern — the coordination capabilities exist, the discipline does not yet.

From interaction to delegation

The other shift this layer makes is what the human is doing on top of it.

In the previous model, humans operated software directly. They navigated UIs, ran commands, wrote prompts. The system did exactly what they typed, then waited.

In the emerging model, humans delegate outcomes. They state an intent, a constraint, and a definition of done. The AI layer decides the sequencing, the tool calls, the retrievals, the implementation path, and the recovery strategy when something fails along the way.

Examples are not hypothetical anymore. IDE agents that own end-to-end feature work. Claude managed agents that run for hours on a goal. OpenAI Operators driving browser sessions. Enterprise copilots executing multi-system tasks. Autonomous CI/CD remediation loops. AI research agents that run experiments unsupervised.

None of those are autocomplete. They are runtime coordination over heterogeneous tools, working against a stated objective.

AI is not just becoming an interface layer. It is becoming an execution coordination layer.

Why memory and orchestration are not enough

Most of the visible investment in AI infrastructure today is in four areas: memory, orchestration, tool calling, and observability. Those are real and useful. They are also the same four capabilities early operating systems had before they grew up.

Operating systems eventually had to add permissions, policy enforcement, execution boundaries, verification, and invariant preservation — not because the early systems were bad, but because as more workloads ran on shared infrastructure, "do what the program asked" stopped being a sufficient guarantee.

The AI operating layer is missing the equivalent set of controls. Specifically, it is missing the layer that handles:

Architectural intent. What the system is allowed to be, not just what the task wants.
Governance propagation. Constraints that travel across agents, sessions, and execution surfaces.
Deterministic constraints. Rules that return the same verdict on the same artifact, every time.
Verification contracts. Pre-registered checks that prove architectural intent survived the run.

As AI becomes an operating layer, governance becomes operating infrastructure. Not a policy doc. Not a review process. Infrastructure — in the same sense that schedulers, permissions, and audit logs are infrastructure.

The emerging AI execution stack

The clearest way to see what is and is not in place is to enumerate the layers of the stack and ask which of them have first-class infrastructure today.

Layer	Purpose
Models	Intelligence generation
Memory	Context continuity across sessions
Tooling	External execution — APIs, file systems, commands
Orchestration	Workflow coordination across steps and sub-agents
Agent Runtime	Long-running execution and recovery
Observability	Monitoring, traces, and post-hoc diagnosis
Governance	Constraint enforcement against architectural intent
Provenance	Intent lineage from decision to artifact
Verification	Reliability guarantees at the moment of merge

Most companies are racing to build the top half: intelligence, memory, orchestration, automation. Very few are building the bottom half: execution governance, architectural verification, intent preservation. That gap is not stylistic. It is the same gap that early operating systems had before scheduling and permissions became non-negotiable.

Operating systems eventually become governance systems

This is the historical pattern worth taking seriously. Early operating systems were thin coordination layers over hardware. They evolved — under pressure from real failure modes — toward access control, sandboxing, process isolation, scheduling guarantees, and audit trails. Not because the original designers wanted more bureaucracy, but because shared, long-running, autonomous workloads forced it.

AI operating layers will follow the same arc. The forcing functions already exist:

Architectural drift as agents make locally plausible but globally inconsistent choices.
Intent divergence between what a team decided and what successive agent runs implement.
Policy inconsistency across heterogeneous agents acting on the same codebase.
Execution inconsistency across sessions, where the same task takes a different path each time.
Provenance loss, where no one can trace a generated artifact back to the decision that authorized it.

These are not abstract risks; they are the failure modes teams are already filing tickets about. The more autonomous the system becomes, the more governance has to be structural rather than aspirational.

The strategic consequence

If AI is becoming an operating layer rather than an interface layer, the competitive question is not who builds the best chatbot or the most fluent assistant. It is who builds the systems that best manage delegation, execution, reliability, governance, continuity, and verification on top of any sufficiently capable model.

Models will get better. Agents will get faster. Orchestration will get cheaper. The thing that decides whether a stack is fit for production work over years is the layer that is hardest to bolt on after the fact: the operating infrastructure for autonomous execution.

The next operating system is an execution system. And every execution system, eventually, becomes a governance system.

Closing

Treating AI as the new operating layer is not a metaphor. It is a re-statement of what an operating system actually does — coordination, isolation, permissioning, audit — applied to the new set of resources AI systems coordinate. The implication is not that this layer is optional. It is that the layer is forming whether or not anyone designs it deliberately, and the teams that take governance seriously now will be building on infrastructure rather than retrofitting it.

AI became an execution layer. Governance is the part that turns it into infrastructure.

Originally published at mnemehq.com.

The AI-Native SDLC: A Methodology for Generation at Machine Speed

Theo Valmis — Fri, 29 May 2026 13:27:53 +0000

Every human software development lifecycle — waterfall, agile, CI/CD, trunk-based development — was designed around human generation speed. AI agents broke that assumption. The AI-native SDLC is a rethinking from first principles.

The software development lifecycle has always been shaped by constraints. Waterfall was shaped by the constraint that requirements were expensive to change. Agile was shaped by the constraint that feedback loops were too long. CI/CD was shaped by the constraint that integration was painful and infrequent. Each methodology is a response to the binding constraint of its era.

The binding constraint of AI-assisted software development is not generation speed — that constraint is gone. The binding constraint is governance at generation velocity: ensuring that what AI agents produce at high speed remains consistent with what the team has decided. Every SDLC practice designed around human generation speed is either irrelevant to this constraint or actively makes it worse.

The AI-native SDLC is the methodological response to this new binding constraint.

What an AI-native SDLC actually means

An SDLC designed for AI agents as first-class actors makes four structural assumptions that differ from every prior methodology:

Generation is cheap and fast. The cost of producing code is near-zero and the speed is orders of magnitude higher than human-paced development. The SDLC does not optimize generation — it constraints it.
Governance is the strategic bottleneck. The rate-limiting step is ensuring architectural coherence across high-volume AI output. The SDLC is designed around this bottleneck, not around generation.
Human oversight concentrates on decisions, not code. Human reviewers are not line-by-line evaluators of AI output — they are architectural decision-makers whose decisions are then encoded as machine-evaluable constraints. The human role is upstream (deciding) and downstream (judging outcomes), not inline (reviewing every line).
CI gates enforce architectural decisions, not just style. In an AI-native SDLC, CI is an enforcement layer for the team's architectural decisions — not a linting step that catches formatting violations. The CI gate is a governance surface.

These assumptions are not aspirational. They are the structural realities that teams face when AI generation volume grows faster than their existing SDLC can handle. The AI-native SDLC is not an ideology — it is the engineering response to a changed constraint set.

Why this problem exists in AI-native development

The structure of every pre-AI SDLC encodes an assumption: the bottleneck is at generation. Human developers produce code at a rate that code review can track. Sprint planning, story pointing, and velocity measurement all assume generation is the variable to be managed. Review is sized to generation: roughly one reviewer per N developers, N sized so review capacity can keep up.

That ratio has inverted. A team of 10 engineers using AI coding agents does not produce 10x the code output — it can produce 50x or 100x. Review capacity did not scale with that change. A team that could review 30 PRs per week now receives 200. The traditional SDLC has no answer to this other than "hire more reviewers" — which is exactly the wrong scaling axis.

The bottleneck has flipped. In human-paced development, generation is the constraint; review is the scale. In AI-native development, generation is unbounded; governance is the constraint. Any SDLC that doesn't redesign around the new bottleneck will fail at scale — not suddenly, but through progressive architectural erosion as review capacity is exceeded.

The failure mode is not dramatic. It is gradual. Review queue depth grows. Reviewers start approving more and scrutinizing less. Architectural violations slip through — not because reviewers are careless, but because they are overwhelmed and optimizing for throughput over quality. The codebase begins to drift. Downstream agents encounter the drifted patterns and treat them as precedent. The drift compounds.

By the time the team notices, the architectural erosion has been compounding for weeks. Remediation requires understanding what the codebase should look like, finding every place where it diverged, and correcting those divergences — in a codebase that has continued to grow throughout the remediation effort. The AI-native SDLC prevents this by making governance the primary engineering investment, not a post-hoc remediation effort.

The common misread: treating AI as better autocomplete

The most common failure in transitioning to AI-assisted development is treating AI coding agents as accelerated autocomplete — tools that speed up existing workflows rather than tools that require new workflows. This produces two distinct failure modes, operating on different timescales.

Failure mode 1: review queue collapse. Teams that apply existing SDLC processes to AI-volume output quickly find their review queues overwhelmed. The PR volume is too high for the review capacity the team budgeted. Reviewers begin to batch-approve, reducing the effective scrutiny per PR. The review gate, which was the primary governance mechanism, becomes porous. This failure mode surfaces quickly — within weeks of full AI adoption — and is highly visible as PR cycle times lengthen and queue depths grow.

Failure mode 2: architectural standard erosion. Less visible, but structurally more serious. The governance layer — the system of conventions, documentation, and architectural norms that the team maintains — was designed for human-paced adoption. Onboarding engineers read the docs, absorb conventions, and apply them. AI agents don't absorb conventions; they need them injected. When the injection mechanism doesn't exist, every AI generation is unconstrained by the team's decisions. Standards erode not through deliberate violation but through absence of constraint. This failure mode surfaces slowly — over months — and is often mistaken for team discipline issues rather than infrastructure gaps.

Both failure modes have the same root cause: the SDLC was designed for human generation speed and was not redesigned for AI generation speed. The governance mechanisms that were adequate for 30 human-authored PRs per week are not adequate for 200 AI-authored PRs per week. The solution is not to slow down AI generation — it is to build governance infrastructure that operates at AI generation speed.

How this fits the AI SDLC

The AI-native SDLC has a specific layer structure. Understanding where each function lives in that structure is the prerequisite for understanding what to build and in what order.

The AI-native SDLC makes layer 5 — governance and architectural control — a first-class engineering concern. In traditional SDLCs, governance was a cultural and process concern: documentation, conventions, architectural review meetings. In the AI-native SDLC, governance is infrastructure: code, tests, enforcement mechanisms, and quality metrics. It requires the same engineering investment as any other layer.

The specific components of the governance layer in an AI-native SDLC:

Decision memory: A persistent, structured corpus of architectural decisions, encoded as machine-evaluable records — not documentation, but enforceable constraints. The source of truth for what the team has decided.
Retrieval system: The mechanism that, given a file path and task, surfaces the relevant decisions from the corpus. At governance scale, retrieval must be deterministic and fast — it runs in the critical path before generation.
Injection layer: The hook or context-injection mechanism that delivers relevant decisions to the agent before generation. The pre-tool-use hook in Claude Code is the canonical example. This is where governance before generation is implemented.
Enforcement verification: The evaluation layer that checks whether the agent's output respected the injected decisions. Produces verdicts (PASS, FAIL, WEAK) and feeds into the CI gate.
Quality measurement: Benchmark suites that measure governance system quality — recall rates, pass rates, WEAK_RETRIEVAL counts. Without measurement, governance quality is unknowable and therefore unimprovable.

Teams that build this infrastructure before scaling AI generation report a consistent outcome: review queue depth stabilizes, architectural violations decrease, and the human review layer can concentrate on the decisions and trade-offs that require human judgment. Teams that skip the infrastructure and scale AI generation first consistently report the opposite sequence.

Originally published at mnemehq.com.

Executable Architectural Intent: The Promotion Path From Docs to Constraints

Theo Valmis — Fri, 29 May 2026 13:27:21 +0000

Most project knowledge wants to be findable. A smaller, more important subset has to be binding. Executable architectural intent is the name for that subset — the slice of architectural knowledge that has been promoted out of documentation and into the layer where it can actually constrain what an AI agent does.

Teams writing for AI coding agents accumulate project knowledge fast: ADRs, PRDs, design notes, AGENTS.md files, NotebookLM corpora, Cursor rules, repo-native wikis. Almost all of it is useful, and almost all of it is reference material. It tells the agent what the system is.

A different question sits behind that body of work: which parts of this knowledge are not optional? Which decisions, if the agent ignored them, would make the result wrong even if it ran, passed tests, and looked locally plausible?

That subset has to live somewhere else. It is no longer documentation. It is executable architectural intent.

The operational definition

Executable architectural intent is the slice of project knowledge that has been promoted from documentation into machine-evaluable, enforceable constraints that bind AI agent behavior at generation time.

Three properties distinguish it from ordinary project knowledge:

It is binding, not informational. The agent does not have discretion to weigh it against task pressure. A run that violates it is a failed run, regardless of whether the rest of the output looks right.
It is machine-evaluable. A hook or a CI gate can produce a deterministic verdict against the resulting artifact. The constraint is not a sentence in a doc; it is something a check can answer with pass or fail.
It is traceable. Every verdict can be linked back to the source decision — the ADR, PRD, or policy it derives from — so enforcement is auditable rather than opaque.

Reference knowledge tells the agent what the system is. Executable architectural intent tells the agent what it is not allowed to change about that system while doing the task.

Promotion: from documentation to constraint

Executable architectural intent does not appear by default. It exists because a team explicitly promoted a piece of knowledge from the wiki layer into the governance layer. The shape of that promotion:

An architectural choice is made — usually in an ADR, PRD, design doc, or policy note.

The decision lives in the project wiki or LLM-readable corpus. Agents can read it. Compliance is left to discretion.

The decision is encoded as a scoped, retrievable, enforceable constraint that enters the agent loop and produces deterministic verdicts.

Most architectural knowledge stops at stage two and should. Promotion to stage three is a deliberate act, applied selectively to the rules that genuinely have to bind agent behavior.

The five-criteria bar

The signal that a piece of project knowledge is ready to be promoted is that it meets all five of the following criteria.

Explicit. Stated as a checkable rule, not implied across paragraphs of rationale.
Scoped. Tied to the part of the system it actually governs, so it is retrieved when relevant and silent when not.
Retrievable at the right moment. Surfaced into the agent's context before it commits to a candidate change, not after.
Enforceable before code lands. Evaluable by a hook or CI gate that returns a deterministic verdict against the resulting artifact.
Auditable back to the source decision. Every verdict can be traced to the ADR, PRD, or policy it derives from, so enforcement is reviewable rather than opaque.

If any one of these is missing, the rule is still doing useful work in the wiki, but it is not yet executable intent. It is documentation that has aspirations.

Why this category exists

Without an explicit name for this layer, teams end up putting binding rules into surfaces that were never designed to enforce them: long CLAUDE.md files, system prompts, prose in ADRs, PR review checklists. Each of those is a useful surface for its own purpose. None of them produces a deterministic verdict at the moment of generation.

The result is a familiar failure mode: the rule was documented, the agent read it, the agent generated code that contradicted it, and the violation was caught (or missed) only in review. The cause was not missing knowledge. It was the absence of an enforcement surface for the knowledge that was already there.

See: Your LLM Wiki Is a Library, Not a Law for the trend-capture version of this argument.

How it differs from adjacent concepts

Layer	What it does	What it does not do
LLM wiki / NotebookLM	Organizes project knowledge for agent retrieval	Cannot return a deterministic verdict at generation time
Prompt files (CLAUDE.md, AGENTS.md)	Reminds the agent what generally matters in the repo	Cannot prevent a session from ignoring or paraphrasing the rule
ADRs & PRDs	Records architectural and product decisions for humans	Cannot, on their own, enforce themselves on a generated artifact
Executable architectural intent	Binds agent behavior at generation and at merge	Does not replace any of the layers above — it sits on top of them

Executable architectural intent is the layer that turns selected pieces of project knowledge into rules the agent cannot route around. It is not a replacement for the wiki or for ADRs. It is the promotion path that some of their contents need to take.

How Mneme implements it

Mneme is the infrastructure layer for executable architectural intent. Architectural decisions are encoded in a version-controlled corpus, retrieved through a deterministic scoping layer before agents generate, and validated at the hook and CI levels after generation. Each verdict carries provenance back to the source decision, so enforcement is reviewable rather than opaque.

The wiki keeps doing its job. ADRs keep doing theirs. Mneme handles the promoted slice: the constraints whose violation would make the run wrong.

Originally published at mnemehq.com.

Agent Verification: Proving Architectural Intent Survived an Autonomous Run

Theo Valmis — Fri, 29 May 2026 13:25:14 +0000

An agent succeeded. The tests passed. The deploy went out. None of those facts tell you whether the system's architectural intent survived. Agent verification is the engineering discipline of answering that question deterministically — not by inspecting logs after the fact, but by evaluating the run's outputs against pre-registered contracts that say what must remain true.

The defining problem of autonomous engineering is not whether agents can complete work. They can. It is whether the work, once completed, preserved the system the team is operating. Execution success is not architectural correctness. A long-running agent can ship features that pass every test, satisfy every reviewer, and deploy without incident — and still leave the codebase incrementally less coherent than it started. Verification is the layer that closes that gap.

Why execution success is not architectural correctness

Tests, build pipelines, deploys, and incident dashboards all answer one question: did the system stay up? That is the question they were designed for, and they answer it well. None of them answer the question that becomes load-bearing when generation is autonomous: was the change architecturally allowed to exist?

An autonomous agent shipping a feature can:

Introduce a forbidden dependency that the team explicitly decided to remove six months ago. The tests still pass. The build still ships. The decision is now silently violated.
Cross a layering boundary — a controller calling directly into a data layer the architecture forbids it from touching. The new call works. The boundary that existed for reasons does not.
Replace a governed pattern with a sensible-looking alternative. The replacement is functionally equivalent and locally cleaner. It also breaks an invariant that downstream systems depend on.
Mutate an infrastructure standard — how services are exposed, configured, or deployed — in a way that drifts away from the team's established pattern without anyone noticing in review.

Every one of those failures is undetectable by the execution gate. They surface, if they surface at all, weeks later as drift telemetry, an incident postmortem, or a senior engineer's complaint that "the codebase doesn't feel right anymore." That delay is the cost of having no verification gate.

What agent verification verifies

Agent verification operates on three categories of property. Each is structurally different from the others; each requires a different kind of contract to evaluate.

1. Architectural intent

The decisions the team has accumulated about how the system is structured — ADRs, layering rules, dependency policies, allowed patterns, deprecated patterns. Verification of architectural intent answers: does this change respect the active architectural decision graph? The contract is the decision graph itself, resolved deterministically against the change's scope.

2. Operational constraints

The constraints the agent is allowed to operate within during the run — rate limits on external APIs, security boundaries on which tools may touch which resources, allowed write surfaces, mandatory approval gates. Verification of operational constraints answers: did the run stay inside the operational envelope the team defined for autonomous work? The contract is the envelope specification.

3. System invariants

Properties that must hold true regardless of the specific change — "every public endpoint has authentication," "no service writes directly to another service's database," "every migration has a rollback path." Verification of invariants answers: did the run preserve every property that must always be true? The contract is the set of invariants, evaluated against the post-change state.

The three categories are independent. A change can satisfy architectural intent and operational constraints while violating an invariant; or satisfy invariants while drifting from intent. Verification has to evaluate each separately.

The contract is the substrate

Verification is only as good as the artifacts it evaluates against. A verification gate that runs without a structured contract is just opinion-as-CI — a senior engineer's heuristics encoded as a script, fragile, and unable to grow with the team. A verification gate that runs against a verification contract — a pre-registered, machine-evaluable assertion about what must remain true — produces a verdict that has the same shape every time the same conditions hold.

This is the substrate that makes verification an engineering discipline rather than a review style. The contract is committed to the repository alongside the code. The verification gate reads the contract and the change. The verdict is reproducible: same contract, same change, same verdict. That property — deterministic enforcement — is what makes verification something a team can trust at scale.

Verification across long-running runs

The case for agent verification gets sharper as runs get longer. A single PR from a junior engineer is governed by review, and a missed violation surfaces in the next refactor. A long-running autonomous workflow that touches dozens of files across many sessions does not have that backstop. Each session makes locally reasonable choices. The cumulative effect is drift — and drift is exactly what verification is designed to catch.

The asymmetry matters: as agent autonomy increases, the gap between execution success and architectural correctness widens. Verification is the layer that keeps that gap measurable and closable.

What verification is not

The category boundary is sharp. Verification is not the same as any of the adjacent disciplines it touches.

Discipline	Question it answers	What verification adds
Unit tests	Does the code do what the test asserts?	Was the code allowed to exist at all?
Eval harnesses	Did the model output match a benchmark?	Did the change satisfy the team's architectural contract?
Observability	What ran, when, how long, at what cost?	Was the run's effect on the system permitted?
Code review	Does a human approve the diff?	Does the diff pass a deterministic, scalable check?
Linters & static analysis	Are there obvious bugs or style errors?	Are the team's specific architectural decisions intact?

None of these disciplines compete with verification. Each answers a different question, and a serious team runs several of them in parallel. Verification fills the gap where the other gates do not have a structured answer.

Where verification sits in the runtime stack

Verification is the top layer of the agent infrastructure stack. It runs after the agent has produced output and before that output is treated as canonical — pre-commit, pre-PR, in CI, before deploy. Its inputs are the agent's diff and side effects; its evaluation substrate is the verification contract; its output is a verdict that gates progression of the run.

The companion layer beneath it is governance infrastructure — the layer that defines what must remain true. Governance encodes the team's intent; verification proves whether intent survived. Without governance, verification has nothing to evaluate against. Without verification, governance is documentation.

Governance defines what must remain true. Verification proves that it did. One layer without the other is incomplete.

The discipline, in one sentence

Tests answer whether code works. Eval answers whether output is good. Observability answers what happened. Verification answers whether intent survived — and is what makes "the agent completed the run" mean something the team can trust as architecturally correct, not just operationally green.

Originally published at mnemehq.com.