SAURABH SHUKLA

Posted on Jun 19

The Agent Stack™: Why Your AI Agent Breaks in Production (A 5-Layer Debugging Framework)

#ai #machinelearning #productivity #discuss

If you've ever deployed an AI agent that worked perfectly in testing and became unreliable in production, this framework is for you.

The standard debugging instinct is to blame the model or the prompt. After 18 months of building AI-assisted workflows, I've found the failure is almost never there. It's in the stack — and usually in the layers that don't get written about.

Here's the framework I use: the Agent Stack™.

The 5 Layers

Every AI system — from a simple Claude workflow to a multi-agent production deployment — is composed of five layers. Each has its own failure modes. Weakness in any single layer degrades the entire system.

Layer 5: Human Layer     ← strategic oversight checkpoints
Layer 4: Behavior Layer  ← governs how the agent acts
Layer 3: Tools Layer     ← external system access
Layer 2: Memory Layer    ← context persistence
Layer 1: Model Layer     ← underlying LLM capability

Layer 1: Model

The most discussed, least important for most reliability problems.

Frontier model gap on standard benchmarks (MMLU, HumanEval): ~3-5%. That spread is smaller than the behavioral variance you get from inconsistent prompting on the same model.

Production failure mode: Blaming the model when the architecture is broken. A more capable model inside a broken system produces faster, more convincing wrong answers.

Fix: Treat model selection as a replaceable architectural decision, not a foundation. Design the system first.

Layer 2: Memory

Where most deployments fail silently.

LLMs are stateless by default. Every session starts at zero. For single tasks, fine. For ongoing workflows — content pipelines, research programs, team-level operations — statelessness is a fundamental architectural flaw.

Three components to design explicitly:

Working memory: the context window. Finite, active, temporary.
External memory: structured files/databases the agent retrieves from on-demand. This is where organizational knowledge lives.
Procedural memory: persistent instructions (system prompts, CLAUDE.md) encoding how tasks should be done.

Production failure mode: Re-explaining the same background every session. Agents that "forget" decisions made last week. Inconsistent behavior because the agent is operating on different context each time.

Fix for external memory:

# context.md (loaded at session start)
## Organization
- Name: [org name]
- Primary products: [...]
- Key terminology: [...]

## Current project
- Goal: [...]
- Constraints: [...]
- Decisions made: [...]

Load this at the start of relevant sessions. Compound value every day.

Layer 3: Tools

MCP crossed 97M monthly SDK downloads in March 2026. Over 10,000 servers in public registries. This layer is increasingly well-solved at the infrastructure level.

What MCP doesn't solve: which tools to connect, in what sequence, with what authorization scope.

Production failure mode: Connecting 15 MCP servers with no coherent policy. The agent has access to email, Slack, GitHub, a CRM, a database — and no architectural understanding of what it should do with any of them.

Fix: tools policy (one sentence each)

## Tools Policy
- Email (MCP): read and draft only; never send without explicit human approval
- GitHub (MCP): read access; PR comments allowed; never merge autonomously
- Database (MCP): read queries only; write requires explicit task authorization

Layer 4: Behavior

The highest-leverage layer. The most consistently skipped.

This is the Karpathy/CLAUDE.md insight. In January 2026, Andrej Karpathy documented that AI coding agents "make silent wrong assumptions, overcomplicate simple solutions, and edit code without understanding full scope." By April, a developer encoded four behavioral principles in a 65-line markdown file. It hit 100K GitHub stars in days. Combined mirrors: 220K stars.

Every developer who starred it recognized their own agents.

What to specify in a behavior layer:

# Behavior Guidelines

## Task framing
- Ask clarifying questions when scope is ambiguous; don't assume
- Confirm intent before starting tasks with irreversible side effects

## Output standards
- Code changes: minimal scope — touch only what the task requires
- Written output: [format, length, quality criteria]

## Scope limits
- Do not modify files outside the current task scope
- Do not access [X] without explicit authorization

## Behavioral invariants (hold across all tasks)
- Never delete without confirmation
- Never send external messages autonomously
- Flag uncertainty before proceeding on irreversible actions

Start here. One hour of behavior layer design will outperform any model upgrade.

Layer 5: Human

Not everywhere. Not nowhere. At specific designed checkpoints.

Four patterns:

Approval gates: hard stops before irreversible actions (send email, deploy code, delete data)
Review loops: scheduled aggregate review before output is acted on
Escalation triggers: conditions that surface a task to a human rather than completing it
Feedback channels: mechanisms to correct agent behavior and update memory

The calibration heuristic: invisible on routine tasks, unmissable on consequential ones. If a human reviews every output, the agent has too little autonomy. If no human is ever in the loop, the agent has too much.

The Production Failure Pattern

Most teams have 2 of 5 layers: Model + Tools.

Memory: absent. Every session starts from zero.
Behavior: absent or minimal. Agent runs on default training behavior (optimized for generic helpfulness, not your standards).
Human: ad hoc. Someone reviews things sometimes.

Result: decent output in isolation, inconsistent at scale. Conclusion: "AI isn't ready." Real diagnosis: the stack wasn't designed.

A 5-Minute Audit

Ask one question per layer:

Model: Do you know why you chose your current model, and what it handles better/worse than alternatives?
Memory: Does your agent have the context it needs without you re-explaining every session?
Tools: Have you explicitly scoped what each tool can and cannot do?
Behavior: Have you written explicit guidelines — not just a task prompt, but behavioral rules for ambiguity, scope, and quality?
Human: Have you defined exactly when you review output, what triggers escalation, and how corrections feed back into the system?

Can't answer 2+? You have an architectural gap. That's where your reliability problems live.

Full breakdown with framework diagrams and the complete audit on echonerve.com (canonical URL): https://echonerve.com/the-echonerve-agent-stack-a-new-way-to-understand-ai-systems/

What layer is the actual bottleneck in your production deployments?

Top comments (2)

Mateo Ruiz • Jun 19

This framework aligns with a pattern I've seen repeatedly: teams spend weeks comparing models while the real reliability issues live in memory design, tool governance, behavior rules, and human review processes.

The most valuable point here is that production failures are often architectural failures masquerading as model failures. Swapping to a stronger model rarely fixes missing context, poorly scoped tools, or undefined escalation paths.

We've encountered similar challenges while helping clients move AI agents from prototype to production at IT Path Solutions. In many cases, introducing structured memory, clearer behavioral constraints, and explicit human-in-the-loop checkpoints delivered bigger reliability gains than changing the model itself.

The 5-layer breakdown is useful because it gives teams a practical debugging lens. When an agent becomes unreliable, the question shouldn't be "Which model should we try next?" but "Which layer of the system is actually failing?"

SAURABH SHUKLA • Jul 4

This tracks with what I keep seeing too, and the reframe at the end is the actual takeaway: "which layer is failing" is a completely different diagnostic question than "which model should we try."

One thing I'd add from watching teams make this mistake repeatedly: the model swap is seductive precisely because it's the easy change. Swapping models is a config update — an afternoon of work, no meetings required. Writing behavioral constraints, designing an actual escalation path, structuring memory so context survives a session — that's real system design, and it forces you to admit nobody ever defined what "correct behavior" means for this agent in the first place. Teams reach for the model swap partly because it lets them skip that harder, more uncomfortable conversation.

Curious which layer you see undefined most often on the client side once you're actually in the prototype-to-production trenches. My guess would be Behavior — it's the layer that feels the least "technical" to write down, so it's the one that gets left as tribal knowledge in someone's head instead of an explicit spec. Memory at least feels like an engineering problem, so people build it. Behavior feels like a documentation problem, so it gets skipped.