DEV Community

Cover image for 88% of AI Agent Failures Have Nothing to Do With the Model
Serhii Panchyshyn
Serhii Panchyshyn Subscriber

Posted on

88% of AI Agent Failures Have Nothing to Do With the Model

You keep rewriting the prompt. Upgrading the model. Adding more instructions. The agent still gets it wrong. After hundreds of failed traces, I found where the failures actually live.

Everyone in AI is waiting for the next model drop. GPT-5 will fix it. Claude 4 will fix it. Gemini Ultra Pro Max will definitely fix it this time.

I've been building production AI agents for three years. Not weekend projects. Not demos that work when the CEO is watching. Agents that process real transactions, talk to real APIs, and break at 2 AM on a Sunday in ways that make you mass-update your resume on LinkedIn.

And I keep seeing the same pattern. A team ships an agent. It works great in staging. It hits production and starts doing weird things. Wrong answers. Confident wrong answers. The kind where the agent sounds so sure that your support team forwards it to the customer before anyone checks.

The team's first move? Upgrade the model. Rewrite the prompt. Add more instructions. "You are a helpful, accurate assistant that ALWAYS checks the database before responding and NEVER makes assumptions..."

Two weeks later, same failures. Different words.

I started keeping a spreadsheet. Every time an agent failed in production, I'd read the full trace. Not the output. The whole chain. What the user asked. What the agent retrieved. What tools it called. What context it actually had when it made its decision.

After a few hundred traces, the pattern was so obvious I felt stupid for not seeing it earlier.

The model almost never made a bad decision given what it could see. The problem was what it could see.

That realization changed how I build everything.


1. The Context Stack

Here's the mental model I use now. I call it the Context Stack. Every AI agent failure maps to one of these layers, and the layers are ordered by how often they're actually the problem versus how often teams blame them.

Layer What it controls How often it's actually the cause
Perception What data can the agent see? ~38%
Retrieval quality Is the right data surfaced at the right time? ~22%
Boundaries Is the agent scoped to the right domain? ~18%
Memory Does the agent retain the right things across turns? ~8%
Prompting Are the instructions clear and complete? ~10%
Model reasoning Did the LLM actually make a logic error? ~12%

That bottom row. 12%. That's how often the model itself is the problem. Everyone optimizes for the 12% and ignores the 88% above it.

These aren't made-up numbers. They come from actually reading failure traces on a production system and categorizing every single one. Tedious work. Not glamorous. But it changed my entire approach to agent architecture.

Let me walk through each layer.


2. Perception: The Agent Is Looking at the Wrong Wall

Perception is the most fundamental layer and the one teams skip most often. It answers one question: when the agent made its decision, could it physically see the information it needed?

I'll give you a real example. I was working on an agent that answered questions about company operations. Customer asks a straightforward question. Agent gives a confident, detailed, completely wrong answer. The team's reaction: "The model is hallucinating."

I pulled the trace. Here's what actually happened.

The agent's retrieval system was connected to a knowledge base. That knowledge base had two environments: staging and production. The staging environment had test data from six months ago. Production had current data. Due to a config issue, the agent in production was querying the staging knowledge base.

The model didn't hallucinate. It gave a perfect answer to the wrong data. It was looking at the wrong wall and describing what it saw with total accuracy.

No prompt rewrite fixes this. No model upgrade fixes this. No chain-of-thought, no few-shot examples, no "think step by step." The agent literally could not see the correct information.

Here's the thing that makes perception failures so dangerous: they look exactly like reasoning failures from the outside. The output is wrong. The agent sounds confident. If you only look at the input and output, you'd swear the model screwed up. You have to read the full trace to see that the model did fine with what it had.

This is why I'm obsessive about one principle: before you debug the brain, debug the eyes.

Questions I now ask before touching a prompt:

  • Is the agent connected to the right data source?
  • Is that data source current?
  • Is there an environment mismatch between where you tested and where it's running?
  • Can you prove the agent saw the correct document by reading the trace?

Boring questions. They catch 38% of failures.


3. Retrieval: The Silent Killer

Perception asks: can the agent see the right data source? Retrieval asks: given that it can, does it actually pull the right document at the right time?

This is where RAG pipelines go to die.

Everyone's seen the RAG demo. "Look, I uploaded 500 documents and the agent can answer questions about all of them!" Beautiful. Ship it. Then a customer asks about your enterprise refund policy and the agent responds with the SMB refund policy because the embedding similarity between "enterprise refund" and "SMB refund" is 0.94.

The retrieval system grabbed a document that was semantically close but factually wrong. The model, being a good model, answered based on what it was given. From the outside, it looks like the model confused enterprise and SMB tiers. From the inside, the model never had a chance.

I've seen this pattern so many times it has a name in my head: the retrieval costume. A retrieval failure puts on a reasoning costume and walks around your production system pretending to be an intelligence problem.

Here's what retrieval failures actually look like when you track them:

Missing documents. The knowledge base simply doesn't have the document the agent needs. Nobody uploaded the updated policy. Nobody added the new product spec. The agent can't retrieve what doesn't exist.

Wrong chunking. The document exists, but the chunking strategy split a critical paragraph across two chunks. The agent gets the first half of the answer. It confabulates the second half because it doesn't know there's more.

Bad ranking. The right document is in the results, but it's ranked #7 and the agent only sees the top 3. This one is infuriating because the system technically "found" the answer and then buried it.

Query decomposition bugs. The user asks a compound question. The retrieval system decomposes it into sub-queries. One sub-query is malformed. The agent gets great context for half the question and garbage for the other half.

Stale data. The document exists, the chunking is fine, the ranking is fine, but the document itself is outdated. Policy changed last month. Nobody updated the knowledge base. The agent confidently serves the old policy.

None of these are model problems. All of them produce outputs that look like model problems.

The fix is boring and effective: golden sets.

Build a spreadsheet. 50 to 100 questions where you know the exact right answer and which document it should come from. Run them through your retrieval pipeline weekly. Measure two things:

  1. Retrieval recall. Did the correct document appear in the top-K results?
  2. Answer accuracy. Given the retrieved documents, did the agent produce the right answer?

When retrieval recall drops, you have a retrieval problem. When recall is high but accuracy drops, now you have a model or prompting problem. Without this split, you're guessing. And guessing leads to prompt rewrites that don't fix anything.

# pseudocode for the eval loop that actually matters
for query in golden_set:
    retrieved_docs = retrieval_pipeline(query.text)
    recall = query.expected_doc_id in [d.id for d in retrieved_docs[:K]]
    answer = agent.respond(query.text, context=retrieved_docs)
    accuracy = judge(answer, query.expected_answer)

    log(query, recall, accuracy)

    # this is the insight:
    # if recall=False and accuracy=False → retrieval problem
    # if recall=True and accuracy=False → model/prompt problem
    # if recall=True and accuracy=True → working as intended
    # if recall=False and accuracy=True → got lucky, still fix retrieval
Enter fullscreen mode Exit fullscreen mode

That last case is the sneaky one. The agent got the right answer despite not retrieving the right document. Maybe it used general knowledge. Maybe a different document happened to contain the same info. It looks like a pass. It's a ticking time bomb.


4. Boundaries: General Agents Are a Seductive Mistake

"Build one agent that can do everything."

I understand the appeal. One interface. One system. One thing to maintain. Every executive loves hearing this pitch. It sounds like the future.

It is the opposite of the future.

Here's what happens when you build a general agent. The system prompt needs to describe every capability. Tool definitions for CRM lookups, order tracking, document search, scheduling, analytics, email drafting, billing inquiries. I've seen system prompts with 40+ tool definitions. That's 8,000 to 15,000 tokens of tool descriptions before the user says a single word.

Now the model has to do two things on every single turn:

  1. Figure out which of 40 tools is relevant
  2. Actually answer the question

The first task is a classification problem that gets harder as you add tools. The model burns reasoning capacity deciding what to do before it starts doing it. Tool selection errors cascade. The agent picks the CRM tool when it should have picked the order tool. Gets back customer metadata instead of order status. Now it either hallucinates an order status from CRM data or makes a second tool call. Latency doubles. Accuracy drops.

The specialist alternative:

Build small agents scoped to specific domains. An order agent with 4 tools: order lookup, shipment tracking, status update, and escalation. A billing agent with 3 tools: invoice lookup, payment history, and dispute creation. A support agent with tools scoped to the knowledge base and ticket system.

Each specialist has a tight context. Few tools. Clear boundaries. Fast tool selection because there's nothing to be confused about.

Then build a router. The router is intentionally dumb. Its only job: look at the user's message and decide which specialist handles it. That's a simple classification task. Models are great at simple classification.

router:
  agents:
    - name: orders
      description: "Order status, tracking, shipping, delivery"
      tools: [order_lookup, shipment_track, status_update, escalate]
    - name: billing
      description: "Invoices, payments, refunds, disputes"
      tools: [invoice_lookup, payment_history, dispute_create]
    - name: support
      description: "Product questions, how-to, troubleshooting"
      tools: [kb_search, ticket_create, ticket_update]

  fallback: support
  max_reroutes: 1
Enter fullscreen mode Exit fullscreen mode

The router is thin. The specialists are sharp. The system is fast.

I've seen this pattern cut agent latency by 60% and improve accuracy by 15-20%. Not because the model got smarter. Because each agent sees only what it needs.

There's a deeper principle here. In software architecture, we've known for decades that bounded contexts beat monoliths. Microservices. Domain-driven design. Single responsibility. We know this stuff. Then AI comes along and everyone forgets everything and tries to build a God Agent that does it all.

Don't build the God Agent. Build specialists with boundaries.


5. Memory Is Usually Overdesigned

Memory is the AI feature that demos best and scales worst.

"Look, the agent remembers that you prefer window seats and you're allergic to shellfish!" Amazing in a demo. Now run 500 concurrent sessions. Each one accumulating conversation history. Each one doing similarity search against a vector store of past interactions. Each one injecting "relevant" memories into an already-stuffed context window.

What happens in practice: the memory system retrieves 15 "relevant" past interactions. Most of them are tangentially related at best. The context window is now 80% memory and 20% actual current task. The model gets confused. It starts referencing things from three conversations ago that have nothing to do with the current question.

The agent got dumber because you made it remember too much.

Here's the memory architecture that actually works in production. Three rules:

Rule 1: Keep recent history short. Last 15 to 20 messages of the current conversation. That's it. Don't load the entire session history.

Rule 2: Consolidate, don't dump. When a tool returns a 2,000-token response, summarize the key findings into 200 tokens before putting it back into context. Raw tool output is context poison. The model doesn't need the full JSON blob. It needs "Found 3 matching orders. Most recent: #4521, shipped March 15, arriving March 22."

Rule 3: Long-term memory is on-demand, not always-on. Don't preload past conversations. If the current query explicitly references something historical ("what did we discuss last week about the billing issue"), fetch it then. Otherwise, leave it out.

I had a phase where I built an elaborate memory system. Embeddings for every message. Semantic search on every turn. Importance scoring. Decay functions. The whole research paper fantasy.

Then I replaced it with the three rules above and ran evals side by side. The simple approach won on accuracy. It won on latency. It won on cost. The fancy memory system made the agent marginally better at remembering trivia and measurably worse at the actual task.

The hardest lesson in AI engineering: more context is not better context. Relevant context is better context. Every token in the context window is competing for the model's attention. Fill it with noise and the signal gets lost.


6. Agent-Readable Software: The Frontier Nobody's Talking About

We spend enormous energy making agents smarter. We barely think about making our software easier for agents to work with.

This is the equivalent of building faster cars in the 1920s while the roads are still dirt paths. At some point, you have to pave the road.

I've started thinking about this as Agent Experience (AX). Same way we talk about User Experience (UX) for humans and Developer Experience (DX) for programmers. AX is the experience your software provides to the AI agents that interact with it.

Here's what bad AX looks like:

// Bad AX: function name tells the agent nothing
async function process(data: any) { ... }

// Bad AX: error message is useless to an agent
throw new Error("Something went wrong");

// Bad AX: 2000-line file where the agent can't find anything
// api-handlers.ts (2,347 lines)
Enter fullscreen mode Exit fullscreen mode

Here's what good AX looks like:

// Good AX: the function name IS the documentation
async function getActiveEnterpriseCustomersByRegion(
  regionCode: string
): Promise<EnterpriseCustomer[]> { ... }

// Good AX: structured error with machine-readable context
throw new AgentReadableError({
  code: "CUSTOMER_NOT_FOUND",
  entity: "customer",
  id: customerId,
  suggestion: "Verify customer ID format. Expected: CUST-XXXXX"
});

// Good AX: small, focused files
// enterprise-customers.ts (180 lines)
// smb-customers.ts (140 lines)  
// customer-utils.ts (90 lines)
Enter fullscreen mode Exit fullscreen mode

The difference isn't aesthetic. It's functional. An agent working with good AX makes fewer tool selection errors, fewer parameter errors, and recovers faster from failures because the error messages tell it what to try next.

Specific AX patterns that I've seen make a measurable difference:

Explicit naming over conventions. Agents don't know your naming conventions. They can't infer that getData means "get customer data from the primary database filtered by active status." Say what you mean.

Structured errors over string messages. When an agent encounters an error, it needs to decide what to do next. { code: "RATE_LIMITED", retryAfter: 30 } lets it make that decision. "Error occurred" does not.

Smaller files with clear boundaries. Large files are expensive to load into context and hard for agents to navigate. 200-line files with clear responsibilities are operationally superior to 2,000-line files. This is also just good software design. Agents are forcing the issue.

Semantic commit messages and PR descriptions. If agents are reading your git history to understand code changes, "fix bug" is worthless. "Fix race condition in concurrent order updates where two agents modify the same shipment record" gives the agent enough context to understand the change without reading the diff.

Idempotent operations. Agents retry things. They call the same endpoint twice because the first response was slow. If your API creates duplicate records on duplicate calls, agents will create chaos. Idempotency isn't just good API design. It's agent-safety.

This is the unlock. The teams that are getting the most out of AI agents aren't just building better agents. They're rebuilding their software to be legible to agents. The returns compound. Every function you rename, every error you structure, every file you break up makes every future agent interaction more reliable.


7. Traces: Where the Real Work Lives

I saved this for last because it's the least exciting and the most important.

If you can't inspect the full trace of what your agent did, you don't understand your product. Period.

A trace is the complete record of an agent's execution. User input. Retrieval results. Tool calls and responses. Model reasoning. Final output. Every step, in order, with timestamps.

Most teams check accuracy from a dashboard. "We're at 91%." Great. What's in the 9%? "Uh... wrong answers?"

That's not debugging. That's astrology with better graphs.

The 9% is where your product lives. Every failed trace is a specific, fixable problem. And the fix is almost never "make the model smarter." Here's what you actually find when you read traces:

  • The retrieval returned 5 documents but the relevant one was #4 and the agent only used #1 and #2
  • A tool returned a timeout error and the agent interpreted the timeout as "no results found"
  • The user's question was ambiguous and the agent picked the wrong interpretation (this one IS a prompting fix)
  • The knowledge base was missing a document that would have answered the question
  • The agent called the right tool with the wrong parameters because the parameter name was ambiguous

Each of these has a different fix. If you don't read the trace, you apply the wrong fix. You rewrite the prompt when the problem was a missing document. You upgrade the model when the problem was a tool timeout being misinterpreted.

Build evals that split retrieval from reasoning. Run them on a schedule. When something breaks, read the trace before you touch anything. This is the practice that separates production AI teams from demo AI teams.


The Uncomfortable Truth

The AI industry runs on a narrative that's convenient for model providers and expensive for everyone else: the next model will fix your problems.

It won't. I don't say this as speculation. I say this because I've upgraded models on production agents and measured the impact. Going from one generation to the next typically moves accuracy by 2-4 percentage points. Fixing the retrieval pipeline, scoping agent boundaries, and building proper evals? 15-25 points.

The model is the least impactful thing you can change in most agent architectures. It's also the easiest, which is why everyone reaches for it first.

The hard work is in the Context Stack. Can the agent see the right data? Does the retrieval system surface the right documents? Are the agent's boundaries tight enough? Is memory helping or hurting? Is your software readable to agents? Can you inspect what went wrong when it goes wrong?

None of this is glamorous. None of it generates Twitter engagement. Nobody's writing breathless blog posts about how they fixed their chunking strategy and accuracy went up 18%.

But that's the work. That's what production AI actually looks like.

The model is already smart enough. The question is whether you've built the context stack to let it be useful.

Start there. And read the traces.


I'm building AI agents in production and writing about what actually works versus what demos well. If this matched something you've seen in your own work, let's connect. The best conversations I have are with people who've also stared at a failed trace at midnight wondering why the agent confidently told a customer the wrong thing.

Top comments (0)