Harness Engineering Still Needs Governance

#aigovernance #claudecode #ai #devtools

The industry has moved from prompt engineering to harness engineering: execution systems that coordinate models, tools, memory, and retries across long-running agent loops. Harnesses solve how agents act. They do not solve whether the actions stay within architectural boundaries. As autonomous workflows scale, the missing layer is governance infrastructure.

The shift: prompts, harnesses, execution systems

Three years ago, the prevailing question was how to phrase a prompt. The model was the product, and prompt engineering was the surface where teams competed.

That framing no longer matches what is being built. OpenAI's recent harness work on Codex, Anthropic's pattern for long-running managed agents, Cursor's background-agent runtime, Claude Code's session-aware workflow, and the open-source frameworks — LangGraph, CrewAI, AutoGen — all converge on the same architectural move: the model is wrapped inside an execution system that handles tools, memory, retries, planning, and continuity.

Single calls have given way to execution loops. Assistants have given way to autonomous workflows. The model is no longer the product boundary. The execution environment is.

That is the right move. Anyone building agentic systems in 2026 needs a harness. But it is also incomplete. As soon as agents are doing real work across many sessions, on real codebases, the harness reveals what it does not cover: architectural intent.

What harness engineering solves

Harness engineering handles context lifecycle management, tool orchestration, retries and error recovery, planning and execution loops, memory injection and continuity, and observability and execution coordination.

This is a major architectural advancement over prompt engineering. Treating the execution environment as the product boundary lets autonomous workflows survive contact with real systems: rate limits, partial failures, long horizons, multi-tool coordination, and cross-session continuity.

None of what follows is an argument against any of that. The argument is that this layer alone is not enough.

The governance gap

A harness can make an agent faster, more persistent, more autonomous, and more capable. It cannot, by construction, make the agent architecturally aligned. Continuity is not constraint. Orchestration is not enforcement. Memory of past decisions is not authority to refuse a future one.

The failures look like this:

An ADR is bypassed. The repo has a recorded decision — "do not introduce a runtime ORM" — that the agent does not read at session start. The agent introduces an ORM because it solves the immediate ticket.
A forbidden dependency reappears. A package was removed for a documented reason. A later session reintroduces it because the prohibition lives only in a stale doc, not in an enforcement hook.
A governed system is rewritten. The agent refactors a module that had a specific layering contract. The new version is functionally equivalent and passes tests, but violates the layering rule that was the entire point of the original design.
Layering boundaries are crossed. A controller starts calling into a data layer that the architecture forbids it from touching directly.
Naming conventions drift. Each session is internally consistent. Across sessions, the naming gradually changes.
Infrastructure patterns mutate. A standard for how services are exposed, configured, or deployed is silently replaced by a sensible-looking alternative that the rest of the system does not expect.

None of these failures are caused by the harness being bad. They are caused by the harness being the wrong place to enforce architectural decisions. The harness's job is to keep work moving. Its incentives are continuity and throughput, not refusal.

Harnesses preserve execution continuity. They do not preserve architectural intent.

Why observability is insufficient

The most common response to the governance gap is to lean harder on observability. Trace every tool call. Log every diff. Pipe agent activity into a dashboard. If we can see what the agent did, we can correct it.

That argument confuses two different questions.

Observability answers what happened. Governance answers what should have been allowed. These are not the same problem.

Logs are not policy. A log records that a forbidden dependency was added. It does not refuse the add.
Traces are not invariants. A trace shows the call graph. It does not declare which call graphs are valid.
Visibility is not enforcement. A dashboard surfaces drift after it occurs. It does not block the change that produced the drift.

Observability is necessary — you cannot govern what you cannot see — but it sits on the wrong side of the action. By the time the trace reaches the dashboard, the commit has already happened. Governance has to sit in front of the action it constrains, with a deterministic rule about whether to allow it.

Governance propagation across execution surfaces

Long-running, autonomous agents do not only write source code. They write everywhere the workflow touches:

Branch names and PR titles — auto-generated by the harness, often outside the team's branch and title taxonomy.
Commit messages and tags — workflow-generated commits accumulate in history.
CI metadata and pipeline config — written by agents the same way they write code, but with stricter governance constraints.
Deployment artifacts and release notes — manifests, container tags, generated changelogs.
Generated configuration — feature flags, routing rules, scaling policies.
Agent-produced documentation — READMEs, ADR drafts, runbooks that become the next agent's training context.

Governance must propagate across every surface touched by autonomous execution. A governance layer that enforces ADR compliance in src/ but ignores commit messages, PR titles, CI config, and generated docs is governing a fraction of the agent's output.

The next layer: governance infrastructure

The clean way to think about the emerging stack is to stop treating it as a model-plus-tooling problem and start treating it as a layered system:

Models           — produce candidate output
  ↓
Harnesses        — coordinate execution, retries, tools
  ↓
Execution        — long-running loops, sessions, memory
  ↓
Governance       — defines and enforces architectural constraints
  ↓
Verification     — tests, builds, deploy-time checks

Each layer answers a question the layer above it cannot:

Harnesses answer how does the agent act?
Execution systems answer how does the agent keep working across time?
Governance answers which actions are allowed, and according to which decisions?
Verification answers did the resulting system still pass its objective checks?

Governance is its own layer because the problem it solves is not solvable inside any of the others. Models produce text. Harnesses coordinate. Memory recalls. None of them can deterministically resolve which ADR governs a given change, or block output that violates the active decision graph.

This is what makes governance infrastructure a category, not a feature. It cannot be folded into the harness without giving the harness an enforcement responsibility that conflicts with its continuity responsibility. It cannot be folded into observability without losing its blocking authority.

Where Mneme fits

Mneme is a deliberately narrow layer in this stack. It does not orchestrate tools. It does not retry calls. It does not manage memory or context. It does one thing: it compiles the repository's ADR corpus into a deterministic decision graph and enforces it at the boundaries where agents make consequential changes.

Repo-native. ADRs live in the repository. The decision graph is rebuilt from them on every check.
Deterministic enforcement. Given the same decision graph and the same change, the result is the same every time.
Governance before generation. mneme check --mode warn at session start tells the agent which decisions are active before it writes anything. mneme check --mode strict at pre-commit and CI is the enforcement gate.

Harnesses help agents act. Governance ensures they act within architectural boundaries. Verification confirms the result. All three layers are infrastructure. None of them substitute for the others.

Long-running agents need more than memory and orchestration. They need enforceable architectural boundaries. The next phase of agent infrastructure is the layer that provides them.

Originally published at https://mnemehq.com/insights/harness-engineering-still-needs-governance/