DEV Community

SAURABH SHUKLA
SAURABH SHUKLA

Posted on

The Agent Stack™: Why Your AI Agent Breaks in Production (A 5-Layer Debugging Framework)

If you've ever deployed an AI agent that worked perfectly in testing and became unreliable in production, this framework is for you.

The standard debugging instinct is to blame the model or the prompt. After 18 months of building AI-assisted workflows, I've found the failure is almost never there. It's in the stack — and usually in the layers that don't get written about.

Here's the framework I use: the Agent Stack™.


The 5 Layers

Every AI system — from a simple Claude workflow to a multi-agent production deployment — is composed of five layers. Each has its own failure modes. Weakness in any single layer degrades the entire system.

Layer 5: Human Layer     ← strategic oversight checkpoints
Layer 4: Behavior Layer  ← governs how the agent acts
Layer 3: Tools Layer     ← external system access
Layer 2: Memory Layer    ← context persistence
Layer 1: Model Layer     ← underlying LLM capability
Enter fullscreen mode Exit fullscreen mode

Layer 1: Model

The most discussed, least important for most reliability problems.

Frontier model gap on standard benchmarks (MMLU, HumanEval): ~3-5%. That spread is smaller than the behavioral variance you get from inconsistent prompting on the same model.

Production failure mode: Blaming the model when the architecture is broken. A more capable model inside a broken system produces faster, more convincing wrong answers.

Fix: Treat model selection as a replaceable architectural decision, not a foundation. Design the system first.


Layer 2: Memory

Where most deployments fail silently.

LLMs are stateless by default. Every session starts at zero. For single tasks, fine. For ongoing workflows — content pipelines, research programs, team-level operations — statelessness is a fundamental architectural flaw.

Three components to design explicitly:

  • Working memory: the context window. Finite, active, temporary.
  • External memory: structured files/databases the agent retrieves from on-demand. This is where organizational knowledge lives.
  • Procedural memory: persistent instructions (system prompts, CLAUDE.md) encoding how tasks should be done.

Production failure mode: Re-explaining the same background every session. Agents that "forget" decisions made last week. Inconsistent behavior because the agent is operating on different context each time.

Fix for external memory:

# context.md (loaded at session start)
## Organization
- Name: [org name]
- Primary products: [...]
- Key terminology: [...]

## Current project
- Goal: [...]
- Constraints: [...]
- Decisions made: [...]
Enter fullscreen mode Exit fullscreen mode

Load this at the start of relevant sessions. Compound value every day.


Layer 3: Tools

MCP crossed 97M monthly SDK downloads in March 2026. Over 10,000 servers in public registries. This layer is increasingly well-solved at the infrastructure level.

What MCP doesn't solve: which tools to connect, in what sequence, with what authorization scope.

Production failure mode: Connecting 15 MCP servers with no coherent policy. The agent has access to email, Slack, GitHub, a CRM, a database — and no architectural understanding of what it should do with any of them.

Fix: tools policy (one sentence each)

## Tools Policy
- Email (MCP): read and draft only; never send without explicit human approval
- GitHub (MCP): read access; PR comments allowed; never merge autonomously
- Database (MCP): read queries only; write requires explicit task authorization
Enter fullscreen mode Exit fullscreen mode

Layer 4: Behavior

The highest-leverage layer. The most consistently skipped.

This is the Karpathy/CLAUDE.md insight. In January 2026, Andrej Karpathy documented that AI coding agents "make silent wrong assumptions, overcomplicate simple solutions, and edit code without understanding full scope." By April, a developer encoded four behavioral principles in a 65-line markdown file. It hit 100K GitHub stars in days. Combined mirrors: 220K stars.

Every developer who starred it recognized their own agents.

What to specify in a behavior layer:

# Behavior Guidelines

## Task framing
- Ask clarifying questions when scope is ambiguous; don't assume
- Confirm intent before starting tasks with irreversible side effects

## Output standards
- Code changes: minimal scope — touch only what the task requires
- Written output: [format, length, quality criteria]

## Scope limits
- Do not modify files outside the current task scope
- Do not access [X] without explicit authorization

## Behavioral invariants (hold across all tasks)
- Never delete without confirmation
- Never send external messages autonomously
- Flag uncertainty before proceeding on irreversible actions
Enter fullscreen mode Exit fullscreen mode

Start here. One hour of behavior layer design will outperform any model upgrade.


Layer 5: Human

Not everywhere. Not nowhere. At specific designed checkpoints.

Four patterns:

  • Approval gates: hard stops before irreversible actions (send email, deploy code, delete data)
  • Review loops: scheduled aggregate review before output is acted on
  • Escalation triggers: conditions that surface a task to a human rather than completing it
  • Feedback channels: mechanisms to correct agent behavior and update memory

The calibration heuristic: invisible on routine tasks, unmissable on consequential ones. If a human reviews every output, the agent has too little autonomy. If no human is ever in the loop, the agent has too much.


The Production Failure Pattern

Most teams have 2 of 5 layers: Model + Tools.

Memory: absent. Every session starts from zero.
Behavior: absent or minimal. Agent runs on default training behavior (optimized for generic helpfulness, not your standards).
Human: ad hoc. Someone reviews things sometimes.

Result: decent output in isolation, inconsistent at scale. Conclusion: "AI isn't ready." Real diagnosis: the stack wasn't designed.


A 5-Minute Audit

Ask one question per layer:

  1. Model: Do you know why you chose your current model, and what it handles better/worse than alternatives?
  2. Memory: Does your agent have the context it needs without you re-explaining every session?
  3. Tools: Have you explicitly scoped what each tool can and cannot do?
  4. Behavior: Have you written explicit guidelines — not just a task prompt, but behavioral rules for ambiguity, scope, and quality?
  5. Human: Have you defined exactly when you review output, what triggers escalation, and how corrections feed back into the system?

Can't answer 2+? You have an architectural gap. That's where your reliability problems live.


Full breakdown with framework diagrams and the complete audit on echonerve.com (canonical URL): https://echonerve.com/the-echonerve-agent-stack-a-new-way-to-understand-ai-systems/

What layer is the actual bottleneck in your production deployments?

Top comments (0)