Kwansub Yun

Posted on Feb 27

Your Agentic Stack Has Two Layers. It Needs Three.

#ai #agents #devops #productivity

Disclosure: This article was created with the help of a specialized reasoning system, then reviewed and verified by the author to ensure technical accuracy. #aibotwroteit

The better the model gets, the more expensive the mistakes become.

A weak model fails fast and obviously. You catch it in the first output. A capable model — one that can hold context across a long task, research its way through obstacles, write production-quality code — fails slowly, confidently, and at scale.

By the time you notice, it has built an entire wrong thing. With clean code. Passing tests. Coherent architecture. In exactly the wrong direction.

Better models didn't solve the failure mode. They made it larger.

And the failure isn't in the code. It's in the conversation — or the absence of one — that happened before the first prompt.

New to AI agents? The glossary — plain-language definitions for every term in this article

🔸CR-EP(Context Resonance Enforcement Protocol): The Intent Governance framework proposed in this article; it is a full ruleset including WQG, Drift Taxonomy, Hard Lines, and Enforcement Logs to manage agent intent.

🔸AI Agent / Agentic AI: An AI system that doesn't just answer questions but takes actions. It can write files, run code, search the web, call APIs, and string multiple steps together autonomously (e.g., Claude Code, Cursor).

🔸Velocity Bias: The tendency during fast technology cycles to treat governance and documentation as friction to be minimized rather than essential work.

🔸MCP (Model Context Protocol): A standard interface defining how an AI agent talks to external tools and systems (databases, APIs, file systems). Think of it as USB-C for AI tool connections. Standardized in late 2024.

🔸Agent Skills / SKILL.md: Reusable instruction sets that define what an agent can do for specific types of tasks.

🔸Intent Governance / Intent Layer: The rules and structures that define why an agent is performing a task. It halts execution when the implementation ("how") drifts from the objective ("why").

🔸CONTEXT.md: A Markdown file in your project repo holding the Why (goals, metrics), Hard Lines (non-negotiable constraints), and a Change Log. It serves as the agent's briefing document and the team’s architecture record.

🔸Why Quality Gate (WQG): A checkpoint validating three criteria: measurable outcome (WQ-1), specific user (WQ-2), and failure consequence (WQ-3). If any fail, the agent stops for clarification.

🔸Drift / Drift Taxonomy: When implementation diverges from intent. Types include: Scope Creep (DT-1), Direction Flip (DT-2), and Implicit Assumption (DT-3).

🔸Hard Lines: Constraints the agent can never cross without explicit human approval (e.g., security rules, performance floors).

🔸Step-Down Points: Specific moments where an agent must stop and ask a human before continuing (e.g., schema changes, major dependency upgrades).

🔸Cold Start Problem: The deadlock where you need the agent to understand the project, but the agent needs a "Why" before it can start. Solved by a 5-question bootstrap interview.

🔸Enforcement Log: A structured record appended to responses showing which gates ran. Example:
[CR-EP v2.3 | WQ: PASS/PASS/PASS | Drift: NONE | Status: ✅ RESONANT]

🔸Deterministic vs Probabilistic: Deterministic: Same input always produces the same output (regex, code gates). Probabilistic: Input may produce different outputs (LLM prompts).

🔸ADR (Architecture Decision Record): A format for recording significant decisions. CONTEXT.md maps directly to this: Why = Decision, Hard Lines = Constraints, Change Log = Status.

1.The Invisible Hole in Your Stack

We have spent the last two years perfecting the "How" and the "What" of agentic workflows.

We have standardized how agents communicate with our databases and file systems via the Model Context Protocol (MCP) and built extensive libraries of Agent Skills to define what these systems can execute.

But as we moved these agents from "cool demos" to "production infrastructure," we discovered a terrifying silence at the top of the stack. We built the muscles and the hands, but we forgot the frontal lobe.

1) How We Visualize the Stack Today

Currently, most developers visualize their agentic stack in two primary layers:

MCP / Tool Calls      ← "How does the agent talk to external systems?"
Agent Skills          ← "What can the agent do?"

Both layers are well-served. There are established frameworks, emerging standards, and a massive body of community knowledge supporting them.

But there is a missing floor in this building.
When you look at the architecture of a real-world agent failure, the error never happens in the "How" or the "What." It happens in the "Should."

2) The Missing Layer: Intent Governance

[ THE MISSING LAYER ]     ← "Should the agent do this? Toward what end?"
MCP / Tool Calls      ← "How does the agent talk to external systems?"
Agent Skills          ← "What can the agent do?"

We call this missing layer Intent Governance.

This is not a critique of MCP or skill libraries—they are elite at execution. But they operate at the moment of action. They catch execution errors **(e.g., "I couldn't connect to the database"). They are, by design, blind to **intent errors.

3) Execution vs. Intent

An intent error occurs when an agent builds a technically perfect solution that violates core business logic, security models, or architectural invariants simply because that context was never explicitly transmitted.

The Execution Success: The code is clean, the tests pass, and the architecture is coherent.
The Intent Failure: The agent built the right thing, in the wrong direction, for the wrong reason.

Without governance, capable models fail slowly and confidently. By the time drift is detected, the agent has already built a wrong system at scale—the origin of today’s most expensive production failures.

2. Why Is This Gap Invisible to AI Experts?

If the gap is real and the failure mode is common, why hasn't the engineering community filled it? Why are even the most seasoned AI developers missing this?

The honest answer is four things happening simultaneously.

1) The Shipping Trap (Velocity Bias)

The agent boom of 2025–2026 has rewarded speed above every other metric. Every framework, library, and conference talk focuses on "what agents can do now." In this high-velocity environment, governance is often viewed as friction rather than foundational work.

Teams optimizing for shipping speed don't stop to verify the direction—they assume the intent is obvious and focus entirely on execution. The intent layer is deprioritized as a "future problem" that never gets addressed.

2) The "Reasoning" Blind Spot

There is a persistent belief among AI experts that intent misalignment is simply a model quality problem. The assumption is that if a model is "smart" enough (e.g., high reasoning scores), it will eventually infer what you really want.

This feels correct as models improve—right up until the moment a production system fails catastrophically after six months of operation.

The failure is surprising in retrospect because the individual outputs were high-quality, but high-quality execution does not equate to mind-reading.

3) The Probabilistic Trap (The Complexity of Rigor)

Most teams attempting intent governance fall into a circular logic: "Ask the LLM to check if the LLM's work is aligned."

This is not enforcement; it is simply adding another probabilistic LLM call subject to the same intent ambiguity.

Building a governance layer that actually holds—deterministic rules that audit behavior and code-level gates that intercept tool calls—requires a different design philosophy than writing a better system prompt.

It is significantly harder, and the gap between "adding instructions" and "building a governance layer" isn't obvious until you are debugging a production incident.

4) The Infrastructure Lag

The necessary tooling simply wasn't ready. The Model Context Protocol (MCP) only standardized in November 2024, and the discipline of "Context Engineering" emerged in the same year.

The vocabulary for discussing intent governance at the developer workflow level is only forming now. You cannot build a community or a standardized solution for a problem that doesn't have shared language yet.

5) The Result

The result, as of early 2026: search GitHub for "agent governance" or "intent alignment," and you will find runtime interceptors and enterprise policy engines.

What you won't find is a dev-workflow-native, repo-level, deterministic intent layer that lives next to the code and versions with it.

The agent ecosystem has been sprinting in one direction for eighteen months, and we haven't yet turned around to ask where we are going.

3. Why 2026 Is the Inflection Point

These blind spots have created a strategic vacuum that is now hitting the reality of production.

We are discussing Intent Governance today because we have reached a tipping point. The "honeymoon phase"—where developers celebrated any working code an AI produced—is over. We have entered the era of Production Incidents.

Agents at Scale: In 2024, agentic work was exploratory. In 2026, tools like Claude Code and Cursor are the primary way code is written on an increasing number of teams. In an experimental era, bad output was a "learning experience." In a production era, bad output is an incident.
The Failure Cycle is Completing: We are now seeing the fallout of eighteen months of un-governed agent usage. Post-mortems are circulating on engineering blogs regarding session hijacking, bloated monorepos, and mismatched architectural decisions. These are the inevitable result of high-velocity execution without an intent layer.
The Competitive Advantage: Right now, intent governance is an "advanced" concept. In twelve months, it will be a standard expectation—just like CI/CD or PR reviews. Teams building these practices now gain institutional knowledge that cannot be "prompt-engineered" later.

4. What Intent Errors Actually Look Like

These aren't hypothetical bugs. They are "successful" executions that result in system-wide failures because the Should was never enforced.

The Fintech Auth Trap: An agent is asked to "add user authentication." It uses JWT and stores tokens in localStorage. The code is clean and passes all tests. Six months later, a security audit reveals a session hijacking vulnerability. The agent made a "reasonable" choice that was fatally wrong for a high-stakes threat model that was never communicated.
The Compounding Scope Creep: While building a recommendation API, an agent "helpfully" adds a notification system and a scheduler. Each addition is technically sound, but the codebase quietly doubles in complexity without a single architectural decision being consciously made.
The Silent Direction Flip: A team chooses Server-Side Rendering (SSR) for SEO. An agent, optimizing for performance, refactors the core rendering to a client-side SPA pattern. The logic is technically sound for latency, but it destroys the foundational SEO strategy that the agent didn't know existed.

In every case, the agent did its job. The governance layer was simply missing.

5. The Core Insight: Why Before How

The failures described in the previous section share a common root: the agent figured out the How, but had to guess the** Why**.

While the 2026 Inflection Point has given us agents with elite execution capabilities, they remain limited by their training data. They understand how to structure APIs or optimize queries, but they do not possess your specific business context, threat models, or non-negotiable constraints.

Without an explicit transfer of intent, the agent interpolates. It fills the gaps with the most statistically probable patterns from its corpora, resulting in code that is technically coherent but contextually wrong.

To prevent these Production Incidents, you must move the "Why" out of ephemeral chat messages and into a structured document that the agent reads before touching a single line of code. However, a "Why" is only effective if it is precise. Most intent statements are too vague to act as a functional Intent Governance layer.

A "Why" that successfully constrains an agent requires three specific components:

Component	Vague (fails)	Specific (passes)
Measurable outcome	"improve performance"	"p95 response time under 200ms"
Target user	"users"	"mobile users aged 18–30 on iOS"
Failure consequence	"it would be bad"	"policy violations trigger app store removal"

If any of these three are missing, the agent will interpolate the gap. It will default to the most common answer found in its training—which is almost certainly not the correct answer for your specific production environment.

6. Balancing Rigor with Exploration

A common critique of any governance protocol is that it only works for well-defined tasks. Real software development, however, rarely starts with a clean definition. If a protocol is too rigid, it becomes "security theater"—something developers bypass just to get their work done.

Take a common request: "Explore whether a graph-based recommendation engine outperforms our current filter." At this stage, there is no measurable metric or specific user yet. The "Why" is simply "to find out."

The answer is not to remove governance from exploration, but to grant it a dedicated mode with an explicit resolution contract. This prevents the "Exploration Mode" from becoming a permanent bypass for actual governance.

To maintain velocity without losing direction, discovery work must be:

Explicitly Declared: "Discovery mode" is a conscious choice, not a byproduct of forgetting to write a Why.
Time-Boxed: Every spike must have an expiration date. At a predefined point, discovery must either produce a concrete Why or be flagged as permanent Drift.
Strictly Bounded: Exploration is not a license to touch everything. Even in discovery, certain Hard Lines—like modifying production auth logic or billing schemas—remain non-negotiable and locked.

The critical design question for any Intent Governance system is this: What are the conditions that force a transition from exploration to committed direction?

Without a clear trigger for this transition, exploration becomes a loophole that eventually leads to the same production failures we are trying to avoid.

7. Operationalizing Intent: Solving Real-World Hurdles

Theory often hits a wall when it meets a complex production codebase.

To move from a "good idea" to a runnable system, we must address the three friction points that cause developers to abandon governance.

1) The Cold Start Problem: Breaking the Deadlock

A major failure mode that governance protocols often ignore is the chicken-and-egg deadlock of the first session.

To write a high-quality "Why," you need a deep understanding of the project. To gain that understanding quickly, you want to use the agent. But the agent, operating under strict governance, requires that "Why" before it can start. This is the Cold Start Problem.

The solution is a bootstrap interview. Instead of requiring a finished document upfront, the agent initiates a five-question dialogue to draft the context for you:

Problem Definition: What problem are you solving, in one sentence?
Target Audience: Who specifically has this problem?
Success Criteria: How will you know you've solved it?
Risk Analysis: What's the worst thing that could happen if you build this wrong?
Hard Lines: What decisions must not be made without your direct input?

Five questions. Five minutes.
The agent synthesizes these answers into a draft CONTEXT.md. It flags gaps—usually the measurable metrics, as most people answer question 3 vaguely—and waits for your review. This removes the "blank page" friction without compromising the integrity of the gate.

2) The Truthfulness Problem: Form vs. Reality

This is a harder challenge that most governance frameworks refuse to acknowledge: A "Why" can be perfectly well-formed but factually wrong.

A statement like "Achieving p95 under 200ms will reduce churn by 30%" passes all three criteria: it has a measurable metric, a causal relationship, and a business impact.

However, it is an assumption, not a fact. Whether performance actually drives that churn is a product research question, not a governance one.

Intent Governance validates the form of the Why—it cannot validate the truth of the assumptions inside it. No automated system can do that.

This distinction matters because it clarifies exactly what governance is for:

It is not a substitute for user research, data analysis, or product judgment.
It is a structural guarantee that when you make those judgments, they are transmitted to the agent clearly enough to actually constrain its behavior.

The human remains responsible for the quality of the hypothesis; the governance layer is responsible for ensuring the agent understands and follows it.

3) The Hierarchy Problem: Scaling Intent

Governance designed for a single project file breaks down in large, distributed codebases. A monorepo with fifteen microservices has fifteen different definitions of success and fifteen different sets of constraints.

Yet, they all share non-negotiable company-wide rules, such as "No PII in logs" or "No credentials in version control".

The pattern that works in 2026 is hierarchical context files, merged by path:

Root File: Holds immutable, company-wide rules.
Service-level Files: Define service-specific Whys and constraints.
Feature-level Files: Hold task-specific context for immediate sprints.

When the agent starts a task, it traverses the directory tree, collects every context file it finds, and merges them.

The rule is absolute: A child can override the "Why," but parent "Hard Lines" are additive, and immutable lines are non-overridable.

The result is a payment service that has a different "Why" than a recommendation engine, but both inherit and must obey the root security constraints.

8. Automation Is the Long-Term Answer

For governance to survive in high-velocity teams, it must be ambient rather than deliberate. If updating context is a manual chore, it will be skipped when you are moving fast—which is exactly when you need it most.

We can change the calculus of intent through three automation points:

Pre-commit hook: Run gate validation before every commit. If the "Why" gate fails, the commit is blocked. This makes "forgetting" to update the context structurally impossible.
PR suggestion: When a Pull Request opens without a corresponding context update, the system analyzes the PR description and suggests a CONTEXT.md update based on the diff.
Gate health report: Weekly aggregations of gate fire rates and "Step-Down" trigger counts turn compliance into a legible metric.

There is a fourth automation that is rarely discussed: staleness detection. An outdated CONTEXT.md is more dangerous than a missing one. A missing file throws an error; an outdated file produces confidently wrong code because the agent follows a "Why" that is no longer true.

The signal to watch is the ratio between code commit frequency and context update frequency. If a codebase has forty commits in three weeks without a change to the context file, something has probably drifted.

A CLI tool that surfaces this automatically—calculating git log density against the last-modified date—is a small project with massive ROI for production safety.

9. The Mental Model: Communication Over Checklists

At its core, Intent Governance is a communication protocol for an agent that cannot read your mind. To scale AI agents in production, we must change how we perceive their role—moving from "tools that execute" to "collaborators that require intent".

1) Intent is a Task-Level Concern

There is a specific reason this approach belongs in your repository, versioned alongside your code, rather than in an enterprise policy layer or a central wiki.

Intent is not an organizational concern—it is a task-level concern. The "Why" for a payment retry service is fundamentally different from the "Why" for an onboarding flow. Centralizing governance loses the specificity that makes it useful.

By keeping CONTEXT.md next to the code it governs, you ensure that the agent’s constraints are as precise as the logic it is writing.

2) The compounding cost of Interpolation

Every piece of context you provide is a decision you have made consciously. Conversely, every piece of context you omit is an interpolation you have authorized the agent to make on your behalf.

Think about what this means at scale. A team using Claude Code or Cursor for eight hours a day makes hundreds of small architectural decisions every week. Without intent governance, each of those is an interpolation from training data—statistically "average" choices that may not fit your specific architecture.

Over months, these interpolations compound like technical debt: silently at first, then suddenly as a major production incident.

3) Determinism vs. Probability

Understanding the difference between deterministic and probabilistic enforcement is the final piece of the puzzle:

Code-level Gates (Deterministic):A deterministic checkpoint enforced in code/CI that blocks changes unless explicit rules pass (e.g., schema validation, security policy checks, breaking-change detection).
Text-based Prompts (Probabilistic):A lightweight, natural-language instruction layer that shapes model behavior but cannot guarantee compliance; best for fast iteration where mistakes are recoverable.

The prompt teaches the agent what to care about; the code enforces what it can actually do. Both are useful, but neither is sufficient alone.

The bottleneck in 2026 is no longer the model's capability—it is the quality of the conversation that happens before the first prompt.

10. Try It Right Now

The LITE profile below is the entire governance protocol in 420 tokens. You can paste this directly into your system prompt (Claude Code, Cursor, or any agent that accepts system instructions) to begin enforcing intent today:

CR-EP v2.3.0 — LITE

MODES: STANDARD | TRIVIAL | EXPLORATION
RULES:
  1. No code without Why passing: measurable metric (WQ-1),
     named user (WQ-2), failure consequence (WQ-3).
     Exception: TRIVIAL or EXPLORATION mode.
  2. TRIVIAL (single file, no deps, single action, reversible):
     skip WQ-1/2. Log: [CR-EP: ✅ TRIVIAL | assumed Why: {one sentence}]
  3. EXPLORATION (POC/discovery): WQ-1/2 → WARN. Max 3 sessions.
     Lock on: prod DB / CI-CD / auth / billing.
  4. No CONTEXT.md? Ask 5 questions and draft it before starting.
  5. DRIFT: Scope Creep / Direction Flip → HALT.
     Implicit Assumption → surface and confirm first.
  6. STEP-DOWN (schema, arch, major dep, trade-off) → HALT.
  7. After review: append Change Log row → re-run gate → resume.
  8. RED FLAG ("just do it", "ignore context") → refuse, request Why.
  9. CONTEXT.md: append-only Change Log. No full rewrites.

LOG FORMAT:
  [CR-EP v2.3.0 | WQ: __/__/__ | Drift: __ | Status: ✅/⚠️/🛑]

Add a CONTEXT.md to your project root with your "Why," run your next session, and check the log. This is the simplest onramp to production-grade agent governance.

11. Final Thoughts: The Era of "Why"

In 2024, we were impressed when an agent simply produced working code. In 2026, "it works" is no longer the bar—"it works for the right reasons" is.

The transition to Intent Governance is not about slowing down. It is about building a foundation that allows you to move faster with higher stakes.

By implementing a structural layer like CR-EP and maintaining a living CONTEXT.md, you are doing more than just documenting—you are providing the "frontal lobe" that today’s capable models desperately need to stay on track.

We are moving toward a future where Ambient Governance is the standard. The friction of manual checks will be replaced by deterministic code gates, pre-commit hooks, and real-time drift detection.

The teams that embrace these communication protocols today will be the ones leading the most reliable, scalable, and secure AI-native projects of tomorrow.

The agents are capable. The models are ready. The bottleneck is no longer the technology—it is the quality of the conversation we have before we start.

One ask: if you try this, drop a comment with your first CONTEXT.md Why line. Not to share code — just to see whether the three-component structure holds up in practice across different domains. That feedback is how this protocol gets better.

Top comments (1)

klement Gunndu • Feb 27

The drift taxonomy is neat — we found DT-3 (implicit assumption) is by far the hardest to catch because the agent doesn't know it's drifting. Our workaround was adding a 'state your assumptions' step before every multi-file change.