DEV Community: SAURABH SHUKLA

Model Context Protocol Through The Agent Stack Lens: What Broke, What's Fixed July 28, and What to Check Before Your Next mcp.json

SAURABH SHUKLA — Sat, 25 Jul 2026 03:39:11 +0000

If you've added an MCP server to a claude_desktop_config.json or an mcp.json file this year by copy-pasting a connection string, this one's for you. This isn't a "is MCP good or bad" post — it's a breakdown of exactly what broke at the protocol level in 2026, what's shipping in six days to fix part of it, and the specific checks worth adding to your own review process before the next server goes in.

On April 15, 2026, OX Security disclosed a flaw sitting inside every official Model Context Protocol SDK — Python, TypeScript, Java, Rust. All four. Anthropic confirmed the behavior was intentional. Then it declined to change it.

The flaw let anyone who could influence a server's configuration file run arbitrary shell commands on the host machine. OX Security counted more than 200,000 vulnerable instances sitting inside a supply chain of over 150 million downloads.

I connect new MCP servers most weeks. I never once asked whether the protocol itself, the wiring underneath every server I trust, shipped with a design decision nobody was willing to walk back. I do now.

Where MCP Sits On The Agent Stack

I map every AI system I build against the EchoNerve Agent Stack™ — six layers: Models, Tools, Memory, Agents, Workflows, Applications. MCP lives almost entirely in the Tools layer, where a model reaches outside itself and calls a database, a file system, an API.

Before MCP, every one of those connections was custom code. MCP replaced that with one shared interface — any MCP-compatible client can call any compliant server the same way. The Tools layer decides what an agent can touch in the real world. A bad Models-layer output produces a wrong sentence. A bad Tools-layer connection produces a wrong action, against a real system.

Here's what that looks like concretely. A .mcp.json entry like this hands an agent full read/write access to a production database in one config block, with zero review gate in front of it:

{
  "mcpServers": {
    "postgres-prod": {
      "command": "npx",
      "args": ["-y", "@some/postgres-mcp-server"],      "env": {
        "DATABASE_URL": "postgres://user:pass@prod-host:5432/db"      }
    }
  }
}

Before MCP, that access took a custom integration and usually a second engineer's sign-off. MCP compressed the whole process into a copy-pasted block. The convenience is real. So is the fact that the review step got compressed right along with it.

The Flaw Anthropic Won't Fix

MCP's STDIO transport — the local process interface most desktop and dev tools use — passes configuration values straight into shell execution, with no sanitization step in between. If an attacker can influence that configuration file (a malicious npm package, a compromised toolchain, an insider with write access), their command runs on your machine. It runs even if the target MCP server never starts successfully.

The vulnerability now carries a formal identifier: CVE-2026-30623. OX Security found it present across all four officially supported SDKs simultaneously — it's not a bug in one team's implementation, it's a decision baked into the reference architecture every downstream server copied.

A quick sanity check you can run today: audit which of your configured servers actually run over STDIO with environment values you don't fully control, versus a remote transport with a real auth boundary in front of it.

# crude pass at spotting STDIO servers with inline secrets in your config
jq -r '.mcpServers | to_entries[] | select(.value.command != null) | .key' ~/.config/*/mcp.json 2>/dev/null

Adapt the path to wherever your client stores its config — the point is just to stop trusting a server by name and start checking transport + credential exposure per entry.

Tool Poisoning Doesn't Need The CVE

A second attack pattern hit the Tools layer just as hard, and it doesn't need a protocol-level bug. Tool poisoning hides a malicious instruction inside a tool's own description field — the text the model reads to decide how and when to call it. The model follows it. Nothing about the request looks abnormal from the outside.

In March 2026, more than 340 developers installed a server published as mcp-jira-sync before anyone caught it. Its list_issues tool carried a hidden instruction in its description field that caused connected agents to include the contents of the local ~/.aws/credentials file in every API call sent back to the attacker's server. Nobody had to click anything.

Koi Security found a similar pattern in an npm package called postmark-mcp, which shipped fifteen clean releases before version 1.0.16 quietly added one line that BCC'd every email an agent sent through it to an attacker's inbox.

An independent census scored 17,468 indexed MCP servers on documentation, maintenance, and reliability. Only 12.9% cleared a high-trust bar. A separate review of ~1,400 servers found 38.7% shipped with no authentication at all.

The Spec Update Landing July 28

The 2026-07-28 MCP specification release candidate rebuilds authorization around OAuth 2.1 and OpenID Connect. Clients will validate the token issuer on every authorization response per RFC 9207, closing a real mix-up attack class where one auth server's response gets replayed against a different server pretending to be it. The spec also drops mandatory sticky sessions, so a remote MCP server can sit behind an ordinary round-robin load balancer.

What it doesn't do: retroactively patch the 200,000 vulnerable STDIO instances, or stop a maintainer from shipping a poisoned tool description post-update. It gives server authors a standard way to handle authorization. It doesn't do the job for them.

Checklist Before Your Next mcp add

Transport check — STDIO server with config you don't fully control gets extra scrutiny before it touches anything with real credentials nearby.
Read the tool descriptions — treat every field the model reads like a PR from a stranger, not just the code.
Ignore install count as a trust signal — check maintenance activity and auth requirements instead.
Log tool calls — parameters and identity attached, per the NSA's May 2026 guidance, even for servers you trust today.
Post–July 28: verify actual OAuth 2.1 adoption, not just an "MCP-compatible" label on a landing page.

MCP isn't going away, and it solved a real integration problem — dozens of hand-built connections collapsed into one shared interface. But the Tools layer is where an agent's decisions become actions against real systems, and 2026 proved that layer needs its own audit discipline.

Full breakdown with the six-layer framework, the complete numbers, and the mermaid diagrams: Model Context Protocol Through The Agent Stack Lens

Why Your AI Agent's Context Window Isn't Memory (And What to Build Instead)

SAURABH SHUKLA — Sat, 18 Jul 2026 05:35:37 +0000

Originally published at echonerve.com

Canonical URL: https://echonerve.com/why-ai-agents-need-memory/

If you're building agents on top of Claude, GPT, or Gemini and relying on a large context window to carry state across a session, there's a benchmark you should know about before you scale that pattern into production.

The context rot problem

Chroma's July 2025 study ran 18 frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3, and others — through needle-retrieval, distractor, haystack-structure, and conversational QA tests. Performance degraded as input length grew, well before any model hit its hard context limit, even on trivially simple tasks. No errors thrown — just steadily worse output, which is the failure mode that's hardest to catch in production because nothing tells you it's happening.

The stranger result: across all 18 models, performance was better on shuffled documents than on logically coherent ones. If you're piping structured logs, ordered conversation history, or a well-organized knowledge base into a huge context window expecting it to behave like a database, this finding says that structure may be working against you.

Working memory vs. external memory vs. procedural memory

The Agent Stack framework (EchoNerve's model for AI systems: Models -> Tools -> Memory -> Agents -> Workflows -> Applications) treats memory as three distinct components:

Working memory:    the context window itself
                    -> lifetime: one session
                    -> failure mode: context rot as it fills

External memory:    files, vector stores, databases
                    -> lifetime: permanent, retrieved on demand
                    -> failure mode: stale or unfindable entries

Procedural memory:  standing instructions (e.g. a CLAUDE.md /
                     system-prompt-level ruleset)
                    -> lifetime: permanent, loaded every session
                    -> failure mode: never written down at all

Most agent implementations only ever build the first one — and it's the one the benchmark data says degrades hardest under load.

Retrieval beats stuffing - with numbers

LoCoMo (1,540 questions: single-hop, multi-hop, open-domain, temporal) and LongMemEval (500 questions) are the benchmarks purpose-built to test exactly this. Mem0's 2026 published results: 91.6% on LoCoMo while averaging under 7,000 tokens per retrieval, versus a full-context-stuffing baseline that requires ~500,000 tokens on the same benchmark. p95 latency: 1.44s for retrieval vs. 17.12s for stuffing - a 91% reduction. These are vendor-reported numbers (discount accordingly), but they point the same direction as Chroma's independent, adversarial findings: small relevant retrievals outperform large stuffed windows on accuracy, latency, and token cost simultaneously.

What to actually build

Three realistic substrate options as of mid-2026:

Hosted memory services (e.g. Mem0) - fastest to integrate, retrieval quality without owning infra, but a core layer of your stack sits behind a third-party API.
Open-source stateful frameworks (e.g. Letta, formerly MemGPT) - the agent itself is a persistent, stateful object; more control, more infra to operate.
Plain files - markdown/JSON in a versioned store, loaded selectively per task. Least sophisticated at scale, but every memory entry is human-readable, diffable in git, and auditable by opening the file.

The wrong answer is the default: no substrate at all, everything crammed into the context window every time - which is the exact configuration Chroma's study describes, and the one most agents in production still run.

Why this matters beyond output quality

There's a second reason to build this deliberately: auditability. Anthropic's Managed Agents (April 2026) shipped persistent, versioned memory stores with audit trails - memories as files you can export, diff, and inspect. As autonomous agents multiply (Gartner projects 150,000+ per Fortune 500 company by 2028), a memory layer you can actually inspect becomes the closest thing to a flight recorder for what an agent did and why.

Full writeup with sources and the complete framework: https://echonerve.com/why-ai-agents-need-memory/

A 5-Stage Filter for AI News Signal (Or: How to Stop Overreacting to Benchmark Numbers)

SAURABH SHUKLA — Mon, 13 Jul 2026 02:05:54 +0000

If you follow AI releases and benchmarks for work - deciding what to adopt, what to evaluate, what to write internal recommendations about - there's a specific failure mode worth naming: treating the first data point as if it were already confirmed.

Case in point: Anthropic's Mythos model found 10,000+ critical bugs in production code in a single month via its own HackerOne bounty program. Most coverage concluded "AI has surpassed human security researchers" within hours. Three weeks later, a human researcher found CVE-2026-46242, a critical Linux kernel race condition, sitting in the exact subsystem Mythos had already searched and partially cleared. The 10,000 number measured discovery. It said nothing about verification, and verification capacity didn't scale anywhere near the same rate.

I built a small filter to stop making this mistake myself. Five stages, each requiring strictly more evidence than the last:

def classify_signal(data_point, related_signals, adoption_data=None):
    """
    Signal Framework: classify an AI news item by evidence strength,
    not by how confident the headline sounds.
    """
    if not related_signals:
        return "SIGNAL"  # one data point, no conclusion yet - just log it

    independent_sources = {s.source for s in related_signals if s.independent}
    if len(independent_sources) < 2:
        return "SIGNAL"  # repeated mentions of ONE source != a pattern

    if not adoption_data:
        return "PATTERN"  # 2+ independent signals, same direction - still tentative

    if not adoption_data.changes_operating_behavior:
        return "TREND"  # confirmed by real data - write with specific numbers now

    return "MARKET_SHIFT"  # changing how people/companies actually operate

The part that actually matters is the independent_sources check. The Leaderboard Illusion paper (arXiv:2504.20879) found Meta tested 27 private Llama 4 variants before choosing which score to publish, and OpenAI and Google alone hold close to 20% of all Chatbot Arena battle data each, with 83 open-weight models combined holding under 30%. A hundred articles citing the same arena number aren't a hundred signals. They're one signal, laundered through a hundred bylines, and the framework only works if that gets caught before you build an argument on top of it.

Same failure mode, different domain: OpenAI's own Feb 23, 2026 audit of SWE-bench Verified found frontier models were reciting memorized gold-patch diffs, not reasoning - and 59.4% of "unsolved" problems had broken tests unrelated to the actual problem statement. Once OpenAI switched to the contamination-resistant SWE-bench Pro, scores that read ~80% on the original benchmark dropped to ~23%. An entire industry had promoted a Signal to a Trend without the independent-confirmation step.

Practical version if you want to run this yourself: log three raw signals every Monday in a plain file, no analysis, just the data point and source. Don't let yourself draw a conclusion for four weeks. Once you have signals across two-plus weeks, check whether they're actually independent before calling it a Pattern. Only promote to Trend once you have adoption data or a structural change you can point to - not more headlines.

Full framework, plus a worked four-week example (MCP's 97M downloads to CLAUDE.md's 220K stars to NatureBench to the Mythos miss to the "Verification Gap"), is at echonerve.com: https://echonerve.com/the-echonerve-signal-framework-how-to-spot-important-ai-trends-before-everyone-else/

The Knowledge Flywheel: A Retrieval/Synthesis Split for AI-Assisted Research (and a Filter Rule for Your Knowledge Base)

SAURABH SHUKLA — Sat, 04 Jul 2026 03:57:41 +0000

If you're using an LLM as a research assistant — reading docs, summarizing papers, synthesizing findings across a codebase or a stack of PDFs — there's a specific failure mode worth knowing about, and it now has a benchmark attached to it.

NatureBench (published June 23, 2026) ran AI coding agents against 90 tasks pulled directly from peer-reviewed Nature papers across six scientific domains, on a containerized pipeline called NatureGym built to remove the environment fragmentation that made earlier agent benchmarks unreliable. The agents understood the assignments. They still picked the wrong method most of the time — not because they misread the task, but because they defaulted to the nearest approach already present in their training data. The paper's term for this is "methodological translation." Functionally, it's retrieval bias wearing a reasoning costume.

I ran into a developer-scale version of the same thing building a research pipeline on top of Claude. Simple test if you want to replicate it: hand a model N related source documents (I used 12 competitor teardowns) and ask it to identify the single pattern connecting all of them. In my run, it returned N/2 pairwise summaries — accurate, well-organized, and completely disconnected from each other. It had aggregated. It hadn't synthesized. I found the actual cross-document pattern myself in about 20 minutes by rereading two of the twelve side by side.

The practical takeaway: separate your retrieval step from your synthesis step, explicitly, in your pipeline.

Here's the rule I now enforce, roughly:

def research_pipeline(sources):
    retrieved = ai_summarize(sources)          # delegate freely — models are good at this
    insight = write_insight_by_hand(retrieved)  # do NOT delegate this step
    if not passes_filter(insight):
        return None  # doesn't make the next cycle faster — discard
    return insight

def passes_filter(insight):
    # the only question that matters for a knowledge base entry:
    return makes_next_research_cycle_faster(insight)

The write_insight_by_hand step is a hard constraint, not a suggestion — five sentences, max, written before the model touches the material further. That constraint is the entire fix NatureBench's data points to: the agents' failure rate wasn't a reasoning-capacity problem, it was an unfiltered-retrieval problem. A stronger model retrieves the wrong method faster; it doesn't retrieve the right one without a synthesis step in the loop.

I frame the whole pipeline as six stages (Research → Insights → Content → Distribution → Feedback → Knowledge Base) — I call it the Knowledge Flywheel™ — where the "Knowledge Base" stage is just a persistence layer with one filter function attached: an entry survives review only if it demonstrably speeds up a future research cycle. Tracked this over a month of my own output: 11 of 14 pieces started from zero context; 3 built on a prior validated insight. That ratio is the whole argument for building the filter function instead of just accumulating notes.

If you're building any kind of RAG-adjacent research tool or agent pipeline, the actionable version of this is small: add an explicit, human-authored synthesis checkpoint between retrieval and output generation, and don't let anything into persistent storage that hasn't passed a "does this make the next run cheaper" test.

Full framework write-up (six-stage diagram, failure modes, benchmark details) is at echonerve.com

The Cowork Loop: A Software Pattern for AI Workflows That Actually Compound

SAURABH SHUKLA — Sun, 28 Jun 2026 04:08:35 +0000

If you've spent time building with LLMs, you've hit this wall: you get your agent or workflow running, the outputs are decent, and then... they stay decent. Six months later, the same prompts produce roughly the same quality. The model hasn't gotten worse. The workflow hasn't improved.

The reason is almost always the same: you're missing Phase 4.

The pattern most AI workflows skip

Here's the loop most developers run without naming it:

Write a system prompt and user prompt (Brief)
The model generates output (Generate)
You read the output and decide if it's good (Review)
You ship it and close the session

That's phases 1–3. Phase 4 — Refine — is the one that compounds.

Refine is not about modifying the output. It's about updating the system that produced it. Before closing the session, you capture what you learned: what the system prompt was missing, what framing produced better output, what output format made evaluation faster. Two sentences to a shared context file.

This is exactly analogous to writing a retrospective after a sprint. Most solo AI workflows don't have one.

The Cowork Loop™: four phases

Phase 1 — Brief

The quality of your output is determined at this phase, not phase 2. A strong Brief is a complete context transfer: standing context (what's always true), session context (what's true right now), and the task (specific enough to have one reasonable interpretation).

In practice, this means loading a persistent context file at the start of every relevant session. Here's a minimal CLAUDE.md structure:

# Context

## About this project
[project name, goal, constraints]

## Output standards
[what good output looks like for this workflow]

## Audience
[who the output is for, what they need]

## Style rules
[positive: what to do / negative: what to avoid]

## Recent signals
[updated Phase 4 captures — what's working, what to change]

The ## Recent signals section is where Phase 4 writes to. This is the accumulation layer.

Phase 2 — Generate

The model executes within the constraints you've set. Best practices:

Request structured output where possible — it speeds up Phase 3 significantly
Ask the model to flag uncertainty explicitly ("If you're uncertain about X, say so")
Set output scope precisely — over-generation is harder to evaluate than precise generation

Phase 3 — Review

The human evaluation layer. Four questions:

Does it answer the right question (not just the question typed)?
Is the reasoning sound — do conclusions follow from evidence?
Does it meet the quality bar for this workflow?
What's the delta between "good enough" and "excellent"?

Question 4 is what most people skip. Finding that delta is what Phase 4 acts on.

If the output is directionally wrong, go back to Phase 1 with a sharper Brief. Refining a wrong direction produces a more polished wrong direction.

Phase 4 — Refine

Two actions: improve the current output, and update the shared context.

Updating the context is the one that compounds. Add the Phase 3 delta to your context file before closing the session. Not a full rewrite — two sentences:

2026-06-24: Leading with a specific date/event in the hook produces better engagement than leading with a thesis statement. Update default hook template.

Next session, that signal is loaded in the Brief. The next output starts ahead of where today's ended.

Over 90 sessions, the ## Recent signals section becomes a distilled record of everything you've learned about what produces good output for this workflow. It's self-documenting institutional memory.

Why OpenAI just built this into infrastructure

On June 4, 2026, OpenAI shipped Dreaming V3 — a background process that automatically synthesizes ChatGPT conversation history and carries the important context forward into new sessions. Free for every user, compute cost reduced 5x.

That's Phase 4 automated at the platform level.

The engineering insight is correct: Phase 4 is the step most people skip, and automating it removes the friction that causes skipping.

The limitation: automated synthesis is bounded by the quality of what went in. Unstructured conversations produce structured summaries of unstructured thinking. Deliberate Cowork Loop passes — where Phase 3 explicitly named what to capture and Phase 4 wrote it down — produce richer material for the synthesis to work with.

If you're building workflows on top of ChatGPT, Dreaming V3 and the Cowork Loop™ are complementary, not competing. The automation gets better material; you get better synthesis.

Minimum viable implementation

Create a context.md (or CLAUDE.md) file for your most recurring AI workflow
Write the five things you re-explain most often — that's your initial standing context
At the end of your next session, add two sentences: what the Brief was missing, what worked
Load that file at the start of every relevant session going forward

Do this for three weeks. Then read your ## Recent signals section. You've built a Brief calibrated to your actual workflow — not a default template, but a real system refined by real sessions.

That's the Cowork Loop. The compounding takes care of itself after that.

Full framework writeup (with failure modes and the CLAUDE.md structure I actually use) at the canonical version: echonerve.com/the-echonerve-cowork-loop

The Agent Stack™: Why Your AI Agent Breaks in Production (A 5-Layer Debugging Framework)

SAURABH SHUKLA — Fri, 19 Jun 2026 05:03:27 +0000

If you've ever deployed an AI agent that worked perfectly in testing and became unreliable in production, this framework is for you.

The standard debugging instinct is to blame the model or the prompt. After 18 months of building AI-assisted workflows, I've found the failure is almost never there. It's in the stack — and usually in the layers that don't get written about.

Here's the framework I use: the Agent Stack™.

The 5 Layers

Every AI system — from a simple Claude workflow to a multi-agent production deployment — is composed of five layers. Each has its own failure modes. Weakness in any single layer degrades the entire system.

Layer 5: Human Layer     ← strategic oversight checkpoints
Layer 4: Behavior Layer  ← governs how the agent acts
Layer 3: Tools Layer     ← external system access
Layer 2: Memory Layer    ← context persistence
Layer 1: Model Layer     ← underlying LLM capability

Layer 1: Model

The most discussed, least important for most reliability problems.

Frontier model gap on standard benchmarks (MMLU, HumanEval): ~3-5%. That spread is smaller than the behavioral variance you get from inconsistent prompting on the same model.

Production failure mode: Blaming the model when the architecture is broken. A more capable model inside a broken system produces faster, more convincing wrong answers.

Fix: Treat model selection as a replaceable architectural decision, not a foundation. Design the system first.

Layer 2: Memory

Where most deployments fail silently.

LLMs are stateless by default. Every session starts at zero. For single tasks, fine. For ongoing workflows — content pipelines, research programs, team-level operations — statelessness is a fundamental architectural flaw.

Three components to design explicitly:

Working memory: the context window. Finite, active, temporary.
External memory: structured files/databases the agent retrieves from on-demand. This is where organizational knowledge lives.
Procedural memory: persistent instructions (system prompts, CLAUDE.md) encoding how tasks should be done.

Production failure mode: Re-explaining the same background every session. Agents that "forget" decisions made last week. Inconsistent behavior because the agent is operating on different context each time.

Fix for external memory:

# context.md (loaded at session start)
## Organization
- Name: [org name]
- Primary products: [...]
- Key terminology: [...]

## Current project
- Goal: [...]
- Constraints: [...]
- Decisions made: [...]

Load this at the start of relevant sessions. Compound value every day.

Layer 3: Tools

MCP crossed 97M monthly SDK downloads in March 2026. Over 10,000 servers in public registries. This layer is increasingly well-solved at the infrastructure level.

What MCP doesn't solve: which tools to connect, in what sequence, with what authorization scope.

Production failure mode: Connecting 15 MCP servers with no coherent policy. The agent has access to email, Slack, GitHub, a CRM, a database — and no architectural understanding of what it should do with any of them.

Fix: tools policy (one sentence each)

## Tools Policy
- Email (MCP): read and draft only; never send without explicit human approval
- GitHub (MCP): read access; PR comments allowed; never merge autonomously
- Database (MCP): read queries only; write requires explicit task authorization

Layer 4: Behavior

The highest-leverage layer. The most consistently skipped.

This is the Karpathy/CLAUDE.md insight. In January 2026, Andrej Karpathy documented that AI coding agents "make silent wrong assumptions, overcomplicate simple solutions, and edit code without understanding full scope." By April, a developer encoded four behavioral principles in a 65-line markdown file. It hit 100K GitHub stars in days. Combined mirrors: 220K stars.

Every developer who starred it recognized their own agents.

What to specify in a behavior layer:

# Behavior Guidelines

## Task framing
- Ask clarifying questions when scope is ambiguous; don't assume
- Confirm intent before starting tasks with irreversible side effects

## Output standards
- Code changes: minimal scope — touch only what the task requires
- Written output: [format, length, quality criteria]

## Scope limits
- Do not modify files outside the current task scope
- Do not access [X] without explicit authorization

## Behavioral invariants (hold across all tasks)
- Never delete without confirmation
- Never send external messages autonomously
- Flag uncertainty before proceeding on irreversible actions

Start here. One hour of behavior layer design will outperform any model upgrade.

Layer 5: Human

Not everywhere. Not nowhere. At specific designed checkpoints.

Four patterns:

Approval gates: hard stops before irreversible actions (send email, deploy code, delete data)
Review loops: scheduled aggregate review before output is acted on
Escalation triggers: conditions that surface a task to a human rather than completing it
Feedback channels: mechanisms to correct agent behavior and update memory

The calibration heuristic: invisible on routine tasks, unmissable on consequential ones. If a human reviews every output, the agent has too little autonomy. If no human is ever in the loop, the agent has too much.

The Production Failure Pattern

Most teams have 2 of 5 layers: Model + Tools.

Memory: absent. Every session starts from zero.
Behavior: absent or minimal. Agent runs on default training behavior (optimized for generic helpfulness, not your standards).
Human: ad hoc. Someone reviews things sometimes.

Result: decent output in isolation, inconsistent at scale. Conclusion: "AI isn't ready." Real diagnosis: the stack wasn't designed.

A 5-Minute Audit

Ask one question per layer:

Model: Do you know why you chose your current model, and what it handles better/worse than alternatives?
Memory: Does your agent have the context it needs without you re-explaining every session?
Tools: Have you explicitly scoped what each tool can and cannot do?
Behavior: Have you written explicit guidelines — not just a task prompt, but behavioral rules for ambiguity, scope, and quality?
Human: Have you defined exactly when you review output, what triggers escalation, and how corrections feed back into the system?

Can't answer 2+? You have an architectural gap. That's where your reliability problems live.

Full breakdown with framework diagrams and the complete audit on echonerve.com (canonical URL): https://echonerve.com/the-echonerve-agent-stack-a-new-way-to-understand-ai-systems/

What layer is the actual bottleneck in your production deployments?