DEV Community: Vinh Nguyen

Why AI agents keep violating your product rules

Vinh Nguyen — Fri, 15 May 2026 15:28:04 +0000

TL;DR: Agents "violate product rules" mostly because they can't see which behaviors are confirmed decisions vs. temporary implementation details. Modern harnesses (AGENTS.md, memory, specs, monitors) steer code execution, but they still don't provide product truth. A behavior spec adds that missing layer with trust levels and explicit must/must-not behaviors.

I started noticing a pattern with coding agents in my codebase. The agent would make a change that looked reasonable — cleaner code, passing tests — but it broke a product decision I'd already made.

Most of the time, these weren't bugs. The agent was correct based on what it could see. It just didn't know what I knew — product decisions, trade-offs, things I'd explicitly confirmed or rejected that existed nowhere in the code.

The agent saw a Stripe integration. It didn't know I was actively evaluating switching to Polar. It saw a simple email/password auth flow and didn't know that was intentionally minimal because I hadn't decided on the permission model yet. It saw a 24-hour refund window and couldn't tell whether that was a deliberate product decision or a temporary hack.

These aren't coding mistakes. They're product context failures.

The state of the art is good — and it's not enough

The ecosystem for giving agents code-level context has gotten genuinely impressive in 2026:

AGENTS.md / CLAUDE.md — Claude Code, Codex CLI, and others read these for conventions, build commands, architecture patterns
Cline's memory bank — structured markdown files persisted across sessions for project knowledge
Spec-driven development — GitHub's spec-kit, Kiro, and others turning specs into agent instructions
Agent discovery — both Claude Code and Codex CLI now do initial codebase exploration before they start working

All of these solve the "how should I write code in this repo" problem. And they're getting really good at it.

What none of them solve: what should this product do and why.

The harness engineering gap

Both OpenAI and Anthropic published engineering posts in early 2026 saying the same thing: the model isn't the bottleneck — the harness is.

OpenAI's harness engineering series describes building their entire product with zero hand-written code. Their approach relies on structured documentation directories that serve as the single source of truth for agents, with architectural constraints enforced mechanically via custom linters and CI — not just prompt instructions.

Anthropic's post on effective harnesses for long-running agents describes the same pattern — every new agent session starts with no memory, so you need persistent structured context that agents can quickly load and reason about.

Both companies built these harnesses by hand for their own teams. Neither productized it.

This validates the problem from the two biggest AI companies — but it also reveals the gap. They're describing code-level harnesses. Neither addresses the product truth layer.

Four layers of agent context

When you look at the tools available today, there are really four distinct layers of context that agents need:

Layer	What it provides	Example tools
Code conventions	How to write code in this repo	AGENTS.md, .cursorrules
Session memory	What happened in prior sessions	Cline memory bank, claude-progress.txt
Implementation specs	What to build (feature level)	spec-kit, Kiro, SDD
Product truth	What the product promises — what's decided, what's uncertain, and why	Missing

Each layer is useful. None replaces the others. But right now, most repos have layers 1-3 covered and layer 4 is completely absent.

That's the layer where your product decisions live — and it's the layer agents violate most often.

What product violations look like

When an agent has AGENTS.md but no product truth layer, it knows how to work but not what to protect. So it:

Refactors billing logic and silently changes the grace period from 14 days to 30
"Simplifies" an auth check that was handling a deliberate edge case
Auto-fills a field that the product intentionally leaves empty for compliance reasons
Removes a validation rule because it looks redundant — but it was load-bearing
Infers behavior from code patterns and builds confidently on wrong assumptions

The agent isn't being careless. It's following the instructions it has. The problem is that nobody wrote down which behaviors are off-limits.

Product truth as a behavior spec

The fix isn't more documentation. It's a different kind of artifact — one that captures product-level decisions in a format both humans and agents can reason about.

That's what a behavior spec is — formally a Product Behavior Contract (PBC). It's Markdown-first, machine-readable, and designed to sit alongside your existing agent context:

What must happen — required outcomes for each behavior
What must not happen — forbidden actions and states
Edge cases — the exceptions that look like bugs but are deliberate
Trust levels — which behaviors are confirmed vs. provisional vs. still being explored

A PBC doesn't replace AGENTS.md or memory banks or feature specs. It fills the specific gap those tools leave open — the product truth layer.

When your agent can read a behavior spec that says "the 24-hour refund window is a confirmed product decision, not a temporary value," it stops treating that number as refactorable. When it sees "auth is intentionally minimal — permission model is still in exploration," it stops building assumptions on top of an undecided foundation.

Code tells you what is. Product truth tells you what should be.

The distinction matters because code is descriptive — it shows the current state. Product truth is prescriptive — it says what the product promises and where those promises are still uncertain.

An agent that only reads code will always assume the current implementation is intentional. An agent that also reads a behavior spec can distinguish between "this is a deliberate design decision" and "this is a temporary hack that should be replaced."

That's the difference between an agent that writes correct code and an agent that builds the right product.

The PBC spec is open source at github.com/stewie-sh/pbc-spec. You can browse example contracts in the PBC viewer.

If you're using AGENTS.md, memory banks, or spec-driven development and still finding that your agents miss the product layer — this is the artifact that's missing.

Disclosure: this article was drafted with AI assistance and reviewed, edited, and fact-checked by the author before publication.

Behaviors, decisions, execution: three layers of AI-safe engineering memory

Vinh Nguyen — Tue, 28 Apr 2026 14:03:48 +0000

TL;DR: AI-safe engineering needs three durable, machine-readable layers: a behavior layer (what the product must do), a decisions layer (why the team chose it), and an execution layer (what's actually running and whether it matches). Execution has mature tooling. Some domains have decision substrates: authority frozen into deterministic data. Most product teams don't have that kind of oracle, so the behavior layer is the empty one — and it's the layer that decides whether agents stay honest.

Last week OpenAI shipped workspace agents in ChatGPT: Codex-powered agents for teams.

I watched the launch and felt something specific. The labs have decided that workflow orchestration is the next category to annex. Every layer of the AI engineering stack now has a serious player chasing it — except one.

That blank layer is the reason I keep writing about behavior contracts.

I've written before about why agent-context tooling alone misses product truth. This post zooms out one level — from agent operation to the durable memory an AI-native team needs to preserve over time.

The three layers

After a year of shipping with AI assistants, of rebuilding products from scratch and watching teams fight the same fights, I've come to believe AI-native engineering rests on three layers of durable memory:

1. Behavior — what the product must do, regardless of who or what writes the code.
What's settled. What's still being worked out. Which edge cases are intentional. Which rules are hard caps and which are soft warnings.

2. Decisions — why the team chose this and not that.
Trade-offs, assumptions, rejected options, the reasoning that survives long after the people leave.

3. Execution — what's actually running in production, and whether it matches the behavior contract.
Drift detection, runtime verification, and the checks that prove the AI's output is correct rather than just plausible.

These aren't three names for one thing. They're three different altitudes. They answer three different questions. They have different failure modes. And right now, in 2026, only the execution side has broadly mature tooling.

The execution layer has mature tooling

Telemetry, observability, eval frameworks, validation harnesses — the runtime side of AI engineering has moved fast.

But Erik Fehn's Project Phoenix helped sharpen the boundary for me. Phoenix is not simply execution-layer tooling. It is closer to a decision substrate: authority frozen into deterministic data, then execution layered on top.

In Erik's PPR_Agent domain, that authority comes from FDA-mandated cardiac device implant records behind a deterministic SQLite layer. Swap the model. Rewrite the interface. The numbers don't change. The invariants survive because they were never only in the code.

His phrase has stuck with me: the substrate is the decisions.

That places Phoenix at the boundary between decisions and execution. The substrate carries authority. Execution checks whether the AI's output stays inside it.

Most teams don't have an external oracle like that. Their product decisions live in Slack, PRs, tickets, stale docs, code-shaped assumptions, and the heads of whoever was there when the feature shipped. That's where the behavior layer matters.

The decisions layer is being prototyped

The middle layer — why we chose this — has historically lived in Slack threads, PR descriptions, and the heads of two or three senior people who'll eventually leave.

Yauheni Kurbayeu's Provenance Manifesto is the most serious attempt I've seen to make decisions a first-class artifact in the SDLC. He calls the pain organizational context amnesia — the slow erosion of "why we built it this way" as people rotate, teams reorganize, and AI agents start touching code without ever meeting the humans who reasoned about it first.

His direction is right. A decision log that captures assumptions, risks, owners, and lineage is the right shape for the layer. The work is early — file-based prototypes, a graph-based long-term direction — but the frame is correct.

The decisions layer is starting to crystallize. It's not solved, but the people working on it know what they're working on.

The behavior layer is empty

This is the gap.

What's missing from most AI-native teams is a durable artifact that says: here is what the product must do, written in a form a human and an agent can both read, consult, and update.

Not a wiki page. Wikis optimize for explanation, not coverage.
Not a ticket backlog. Tickets optimize for coordination, not long-term truth.
Not a test suite. Tests optimize for catching regressions, not communicating intent.
Not a system prompt. System prompts decay and get rewritten by whoever shipped last.

The behavior layer needs an artifact with a few specific properties:

Markdown-first, so humans can read and edit it without tooling.
Structured enough that an agent can resolve "what's the rule for billing on the free tier?" without grepping the codebase.
Versioned, so changes to product intent are reviewable.
Honest about uncertainty — explicit about what's settled, what's still being worked out, and what's unknown.

That artifact is what we've been calling a Product Behavior Contract (PBC). More generally, this is the behavior-spec layer: a durable record of what a system must continue to do. PBC is the product/software version of that idea.

The format is open at github.com/stewie-sh/pbc-spec. The reference tooling is open. The bet is that the behavior layer is too important to be locked inside any single vendor's workspace.

A note on altitude: behavior lives at more than one layer of a company.

There is team behavior: how PRs get reviewed, how incidents are run, what we promise specific accounts.

There is policy behavior: what we never log, where data must stay, what counts as a refundable failure.

And there is the layer this post is about: product behavior — what the running software must continue to do.

All three are durable. All three are mostly implicit. I'm focused on product behavior because it is the most verifiable: you can check whether running code matches a behavior contract.

Why the labs aren't building a portable behavior layer

The labs are building workflow orchestration because that's where the demos look most impressive — and where enterprise budget already exists.

But workflow orchestration without a behavior contract is guardrails without a target. The agent runs somewhere, but it doesn't know what it's supposed to do. So it does what it's been doing for two years: it does something plausible, hopes the human didn't notice, and quietly pushes the team's actual product intent further from what's running in production.

You feel this when an AI assistant "fixes" a bug by removing an edge case that turns out to have been intentional. You feel it when an agent ships a feature that technically passes review but contradicts a soft rule three people in the room would have caught. You feel it most when a new hire — human or agent — restarts the same investigation a previous teammate already finished, because nothing durable captured the answer.

Workflow gets faster. Without a behavior layer, wrongness gets faster too.

What this means for AI-native teams

You don't need every layer perfect. You need each layer present.

Most teams I talk to have:

Execution layer: partial. Telemetry exists. Eval is sometimes wired up. Drift detection is rare.
Decisions layer: partial. Some teams have ADRs. Most don't. Slack is the substrate by default.
Behavior layer: usually empty. There's no artifact that says what the product must do, in a form an agent can read and a human can argue with.

The cheapest move in 2026 is to claim the layer the labs aren't going to build for you. The behavior layer is small enough that one person can start it in an afternoon. It's also load-bearing: a behavior contract gives the decisions layer something to anchor against, and gives the execution layer a target to verify.

When a team has an external oracle, the substrate can carry authority. When it doesn't, the behavior contract becomes the closest durable artifact: a reviewed, versioned spec of what the system must continue to do.

You don't have to use any specific tool to do this. You can write a markdown file in your repo today.

But I'd argue you should write something. The labs are shipping faster than anyone's product intent can keep up. The team that has a durable, navigable behavior layer is the team whose AI agents stay honest.

Three layers. Three altitudes. Different builders, different artifacts, same meta-problem: preserving intent, reasoning, and correctness in AI-accelerated engineering.

The PBC spec is open source at github.com/stewie-sh/pbc-spec. You can browse example contracts in the PBC viewer.

Which of the three is your team weakest at right now — behavior, decisions, or execution?

I'd guess behavior, but I want to be wrong.

Disclosure: this article was drafted with AI assistance and reviewed, edited, and fact-checked by the author before publication.

Vibe coding got you here. Now what?

Vinh Nguyen — Tue, 21 Apr 2026 14:00:00 +0000

TL;DR: Vibe coding ships fast, but it often fails to preserve the reasoning behind product behaviors — and agents refactoring later will "reasonably" change things users rely on. A lightweight behavior spec (.pbc.md) makes "what must happen / must not happen" explicit so future you (and future agents) know what to protect, starting with high-risk modules like billing and auth.

Vibe coding works. If you've used it seriously, you know this. You describe a feature, the model drafts it, you push it. Things that used to take a day take an hour. You ship faster than you ever have.

The problem isn't the speed. The problem is what gets left behind.

What "left behind" means

When you build normally — even scrappily — the reasoning behind a decision usually ends up somewhere. In a comment, a commit message, a Slack thread, a ticket. Imperfect, but traceable.

When you vibe code, you're in a flow state. You describe what you want, the model gives you something close, you adjust, you ship. The gap between intention and implementation is so small that writing it down feels redundant.

Weeks later, you have a product that works but you can't fully explain why it works the way it does.

Not the code — you can read the code. The reasoning. Why does the cancellation flow work like this? Why is the grace period five days and not seven? Why does the export fail silently on empty results instead of returning an error?

The code remembers what you built. It doesn't remember why.

Why this compounds with AI coding agents

The original vibe coding problem — reasoning drift — is bad on its own. AI coding agents make it worse in a specific way.

Agents don't just read your code. They make inferences about intent. When you ask Claude Code or Codex to refactor a module, they read the existing implementation and decide what behavior to preserve and what to change. Without explicit constraints, that decision is partly guesswork.

The agent isn't being careless. It's doing exactly what you asked. But it's filling in missing context with reasonable-looking assumptions — and reasonable assumptions about code aren't the same as correct assumptions about product behavior.

The result: a refactor that passes tests and breaks something real. An "improvement" that changes behavior your users depend on. A simplification that removes an edge case that was load-bearing.

You can't blame the agent for this. The rules were never written down.

The missing layer

There's a layer between "what the code does" and "what the product promises" that most teams never formalize. PRDs describe intent. Tests verify implementation. Neither one captures the behavior spec — the durable record of what your product guarantees and why.

For teams that have been building for years, this layer lives in accumulated memory. For vibe-coded products, it often doesn't exist at all.

That's the gap. And the longer you wait to address it, the more expensive it gets.

What to do about it

You don't need to slow down to fix this. You need a format that fits the way you actually work.

That's what behavior specs are — formally Product Behavior Contracts (PBC) — a lightweight Markdown format for capturing what your product promises to do. Not a PRD. Not tests. Not Gherkin. Just the smallest artifact that makes product reasoning explicit and version-controlled.

The pattern is:

PRD explains why. PBC specifies what. Code and tests prove how.

A .pbc.md file sits in your repo. It documents behaviors, rules, edge cases, and the decisions behind them — in plain Markdown your whole team can read. It's structured enough that tools can parse it, lightweight enough that you'll actually maintain it.

The goal isn't to slow down the vibe. It's to leave something behind for future you — and for every agent that touches your codebase next.

Start small

You don't need to spec the whole product at once. Start with the module that would cause the most damage if an agent got it wrong. Billing. Auth. Entitlements.

Write down three things for each behavior: what must happen, what must not happen, and the edge cases that matter. That's it. That's your behavior spec.

Vibe coding got you here. A behavior spec is how you stay here.

For what a behavior spec actually looks like, see What is a Product Behavior Contract?.

The PBC spec is open source at github.com/stewie-sh/pbc-spec. You can browse example behavior contracts in the PBC viewer.

Disclosure: this article was drafted with AI assistance and reviewed, edited, and fact-checked by the author before publication.

Beyond CLAUDE.md and AGENTS.md: when your coding agent needs a behavior spec

Vinh Nguyen — Wed, 15 Apr 2026 16:31:47 +0000

TL;DR: CLAUDE.md and AGENTS.md are excellent at steering how agents write code. They were never designed to capture what the product promises to do. When agents refactor, extend, or integrate code, they need a behavior spec — not more workflow instructions. That's the layer above instruction files.

Your agent refactors the billing module and changes the grace period from 14 days to 30. The code is cleaner. The tests pass. The product promise is broken.

Your agent simplifies the auth flow and removes an edge case check. It looked redundant in the code — but it was handling a deliberate compliance requirement that existed nowhere except in your head.

You had CLAUDE.md or AGENTS.md in the repo. You had coding conventions written down. The agent followed those perfectly — and still broke the product.

This isn't a model capability problem. The models are better than they've ever been. It's a context architecture problem — instruction files tell agents how to write code, but not what the product promises to do.

What CLAUDE.md and AGENTS.md actually are

These files work. They solve a real problem. But it's worth being precise about which problem.

CLAUDE.md tells Claude Code how to operate in your repo. Use pnpm, not npm. Run tests before committing. Don't modify generated files. Keep responses concise. It's a workflow configuration — process knowledge that changes when your conventions change.

AGENTS.md does the same for Codex and other agents: coding conventions, build commands, architecture patterns, file organization rules. OpenAI's AGENTS.md spec has been adopted across tens of thousands of repos because it solves the "how to work here" problem well.

Both are instruction files. They answer: "How should an agent behave in this repo?"

That's a useful question. But it's not the question that causes production incidents.

The question they don't answer

The question that causes production incidents is: "What does this product promise to do?"

When an agent refactors your billing module, it doesn't need to know whether to use pnpm or npm. It needs to know that the 14-day grace period is a confirmed product decision — not a magic number to clean up. It needs to know that the empty tax_id field is intentionally blank for compliance reasons — not a bug to fix. It needs to know that the auth flow is deliberately minimal because the permission model hasn't been decided yet — not because nobody got around to adding OAuth.

CLAUDE.md doesn't have this information. Neither does AGENTS.md. Not because they're badly designed — because they were designed for a different purpose.

Where instruction files hit their ceiling

The ceiling isn't one thing. It's a pattern that shows up in three ways:

1. No semantic structure. CLAUDE.md and AGENTS.md are freeform prose. A human reads them and infers what matters. An agent reads them and treats every line as equally weighted. "Use TypeScript" and "never change the grace period without product owner approval" have the same format — a bullet point. One is a preference. The other is load-bearing.

2. No trust signal. Everything in an instruction file has the same status: written down. There's no way to distinguish a confirmed product decision from a provisional assumption from an active experiment. An agent treats them all as current truth — and they aren't.

3. No verification path. After an agent runs, there's no way to check whether it honored the product constraints. You can lint code style. You can run tests. But "did the agent preserve the billing contract?" requires a human to review the diff line by line and remember every product decision in their head.

These aren't flaws in CLAUDE.md or AGENTS.md. They're the natural limits of instruction files trying to carry behavior specs they were never built for.

What sits above instruction files

The layer above instruction files is a behavior spec — an artifact that captures what the product promises to do, in a format that's both human-reviewable and machine-readable.

A behavior spec (.pbc.md — formally a Product Behavior Contract) sits in your repo alongside CLAUDE.md and AGENTS.md, but it answers different questions:

	CLAUDE.md / AGENTS.md	.pbc.md
Answers	"How should the agent work here?"	"What does the product promise?"
Contains	Conventions, commands, patterns	Behaviors, rules, states, edge cases
Changes	When your workflow changes	When a product decision changes
Audience	Agents + new contributors	Everyone — product owner, eng, QA, agents
Structure	Freeform prose	Markdown with typed semantic blocks

Here's what the same knowledge looks like in each format:

In CLAUDE.md:

# Billing rules
- Grace period is 14 days
- Don't change billing logic without approval
- Tax ID is required for invoices

In a .pbc.md behavior spec:

## Grace period enforcement

### When
A subscription payment fails

### Then
- System enters a 14-day grace period
- User retains full access during grace period
- Daily retry attempts are made against the payment method
- On day 14, if no successful payment: downgrade to free tier

### Invariants
- Grace period duration must be exactly 14 days — not configurable per plan
- No data deletion occurs during grace period
- Grace period cannot be extended manually by support

### Edge cases
- If user upgrades plan during grace period: new payment attempt immediately
- If payment method is removed during grace period: grace period continues (retry stops)

The CLAUDE.md version tells an agent "don't touch this." The behavior spec tells the agent (and the product owner, and QA, and the next developer) exactly what the product promises — in enough detail to verify whether the promise is still being kept.

The market knows something is missing

This isn't a theoretical gap. The pain is already showing up across the ecosystem:

Anthropic published hooks documentation — deterministic controls that run outside the model, because they recognized that prompt-level instructions aren't reliable enough for enforcement
OpenAI published a harness engineering series describing how they built their entire product with agents — structured documentation directories as the source of truth, architectural constraints enforced mechanically via linters and CI — then didn't productize the approach
Gartner started covering agentic AI governance as a distinct market category

The vocabulary is fragmenting — guardrails, governance, policies, behavior, intent — but the underlying need is converging: teams need a structured way to specify what agents are and aren't allowed to do, above the instruction file layer.

How to start

You don't need to spec your entire product on day one. Start with the module where an agent mistake would hurt most — usually billing, auth, or entitlements.

Create billing.pbc.md in your repo
Write the 3-5 behaviors that are non-negotiable (grace period, refund window, upgrade logic)
For each behavior: what must happen, what must not happen, edge cases
Point your CLAUDE.md or AGENTS.md at it: "Read *.pbc.md files before modifying any billing, auth, or entitlement logic"

That last step is the bridge — your existing instruction files become the pointer to the behavior spec. They work together, not against each other.

The stack, not the replacement

The right mental model isn't "PBC replaces CLAUDE.md." It's a stack:

Layer 4: Behavior specs (.pbc.md)                  — product truth — what it promises
Layer 3: Feature specs / PRDs                      — what we plan to build
Layer 2: Session memory / context                  — what we're doing now
Layer 1: Instruction files (CLAUDE.md, AGENTS.md)  — how to work here

Each layer is useful. None replaces the others. Most repos have layers 1-3 covered. Layer 4 is the one that prevents the production incident where an agent does exactly what it was told — and breaks a product promise nobody wrote down.

The PBC spec is open source at github.com/stewie-sh/pbc-spec. You can browse example contracts in the PBC viewer.

If your instruction files are working for code conventions but failing for product decisions — this is the layer that's missing.

Disclosure: this article was drafted with AI assistance and reviewed, edited, and fact-checked by the author before publication.

AI agent context still misses the product layer

Vinh Nguyen — Sun, 29 Mar 2026 15:35:04 +0000

If you spend time around AI coding tools, the conversation has clearly shifted.

People are talking less about prompts and raw model quality, and more about the surrounding system: repo rules, memory, harnesses, evals, and monitoring.

That shift is correct. But even the better AI agent stacks still miss one important layer: product context.

Now the serious work is happening one layer above the model:

OpenAI is writing about harness engineering and internal monitoring for coding agents.
Anthropic is writing about effective harnesses for long-running agents and even parallel agent teams building a C compiler.
Every tool ecosystem now has some version of workflow rules, repo instructions, memory files, and spec-driven coding.

Taken together, these point to the same conclusion: the model is only part of the system. Reliable agentic coding depends on the surrounding stack.

That's real progress. But it still leaves one missing layer.

The modern agent stack is getting better at telling agents how to work. It still does a poor job telling them what the product must continue to do.

What AI agent context solves today

Most serious AI-native repos are already building some version of the same stack:

Repo instructions like AGENTS.md, CLAUDE.md, or tool-specific rules tell the agent how to work in this codebase.
Memory files preserve what happened in prior sessions so the next run doesn't start cold.
Harnesses manage long-running work, handoffs, tool access, and task decomposition.
Evals and monitors check whether the agent stayed within technical or safety boundaries.

Each layer solves a real problem.

Repo instructions reduce workflow mistakes. Memory reduces repeated exploration. Harnesses help agents make progress across long tasks. Evals and monitors catch bad outputs and suspicious behavior.

If your goal is better software engineering execution, this stack makes sense.

Why better harnesses still don't protect product decisions

Here's the problem: an agent can follow every repo rule, use the right harness, pass the tests, and still break the product.

Not by writing obviously bad code. By changing something that looked reasonable from the code alone.

That happens because most product decisions are not explicit in the repo:

Is the 14-day refund window a confirmed policy or a placeholder?
Is the current permission model intentional or just the minimal thing that shipped first?
Is an empty field a bug, a compliance requirement, or a deliberate product choice?
Is this validation rule load-bearing, or leftover code that should be removed?

The agent sees implementation. It does not automatically see product intent, trust level, or business significance.

That is why teams end up saying the same thing after an agent made a "wrong" change: the code was plausible, but it violated something the team had already decided.

This is not a prompt quality problem. It is a missing artifact problem.

What is missing from AI agent context today?

The missing layer is product truth.

Not a PRD. Not a sprint spec. Not a memory log. Not a test suite.

Product truth answers a narrower and more durable question:

What does this product actually promise to do right now, and which behaviors are confirmed enough that agents should treat them as protected?

That layer needs to capture things like:

confirmed product behaviors
forbidden states and actions
deliberate edge cases
areas that are still provisional
decisions that are actively being explored and should not be treated as settled

Without that layer, every agent is forced to infer product meaning from implementation details.

Sometimes that works. Sometimes it silently introduces product drift.

Why AGENTS.md and memory banks are not enough

This is where teams get confused, because all of these artifacts look similar from the outside. They're usually text files in the repo. They're all readable by both humans and agents. They all seem like "context."

But they operate at different levels:

AGENTS.md tells the agent how to behave as a contributor.
Memory banks preserve what happened across sessions.
Feature specs describe what the team plans to build.
Tests verify implementation behavior in specific scenarios.
Monitors look for dangerous or misaligned actions.

None of those directly answer: which product behaviors are intentional, protected, and safe to build on top of?

You can have all five and still leave the core product layer implicit.

That's why a repo can feel "well-instrumented" for agents and still be fragile when they touch billing, entitlements, onboarding logic, permissions, or compliance-sensitive flows.

What a Product Behavior Contract adds

A Product Behavior Contract adds the missing product layer without replacing the rest of the stack.

It sits alongside your existing agent context and makes the behavioral contract explicit:

What must happen
What must not happen
Which edge cases are deliberate
Which behaviors are confirmed, provisional, or still being explored
Which source files provide evidence

That changes the quality of agent decisions.

When the contract says a billing limit is confirmed, the agent stops treating it as an arbitrary number it can refactor freely. When the contract says the permission model is still under exploration, the agent stops extending it as if the design were settled. When a behavior is marked provisional, humans and agents both know not to overfit around it.

This is the difference between code context and product context.

Code context tells the agent what exists.
Product context tells the agent what must remain true.

The agent stack is converging. The product layer is next.

My read of the current ecosystem is that OpenAI, Anthropic, and the broader tool market are all converging on the same architecture:

give agents better maps of the repo
help them work across long time horizons
break work into clearer sub-tasks
evaluate and monitor their behavior more rigorously

That is the right direction.

But as agents get better at execution, the cost of missing product truth goes up, not down.

A stronger coding agent can now move faster, touch more files, and refactor more confidently. If the product layer is still implicit, that extra capability just lets it make bigger product mistakes more efficiently.

The next mature AI-native repo will not stop at workflow rules, harnesses, and evals. It will also include a durable product artifact that says what the software is actually supposed to do.

That's the role of a product behavior contract.

The format is open source. The PBC viewer lets you browse structured contracts in the browser, and the PBC spec is public if you want to see the model directly.

If you're already using AGENTS.md, memory files, or spec-driven coding, this is the next layer I'd look at: a durable artifact for product truth, not just execution rules.

Disclosure: this article was drafted with AI assistance and reviewed, edited, and fact-checked by the author before publication.