DEV Community

Cover image for Why AI Agents Fail: 3 Failure Modes That Cost You Tokens and Time
Elizabeth Fuentes L for AWS

Posted on

Why AI Agents Fail: 3 Failure Modes That Cost You Tokens and Time

AI agents don't fail like traditional software — they don't crash with a stack trace. They fail silently: returning incomplete answers, freezing on slow APIs, or burning tokens calling the same tool over and over. The agent appears to work, but the output is wrong, late, or expensive.

This series covers the three most common failure modes with research-backed solutions. Each technique has a runnable demo that measures the before/after difference.

Working code: github.com/aws-samples/sample-why-agents-fail

The demos use Strands Agents with OpenAI (GPT-4o-mini). The patterns are framework-agnostic — they apply to LangGraph, AutoGen, CrewAI, or any framework that supports tool calling and lifecycle hooks.

This Series: 3 Essential Fixes

  1. Context Window Overflow — Memory Pointer Pattern for large data
  2. MCP Tools That Never Respond — Async handleId pattern for slow external APIs
  3. AI Agent Reasoning Loops — DebounceHook + clear tool states to block repeated calls

What Happens When Tool Outputs Overflow the Context Window?

Context window overflow occurs when a tool returns more data than the LLM can process — server logs, database results, or file contents that exceed the token limit. The agent does not crash with an error. It silently degrades: truncating data, losing context, or producing incomplete answers.

Research from IBM quantifies this: a Materials Science workflow consumed 20M tokens and failed. The same workflow with memory pointers used 1,234 tokens and succeeded.

Comparison of an AI agent without Memory Pointer Pattern versus with it, showing how large data stays outside the context window

The fix — Memory Pointer Pattern: Store large data in agent.state, return a short pointer to the context. The next tool resolves the pointer to access the full data:

from strands import tool, ToolContext

@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 24) -> str:
    """Fetch logs. Stores large data as pointer to avoid context overflow."""
    logs = generate_logs(app_name, hours)  # Could be 200KB+

    if len(str(logs)) > 20_000:
        pointer = f"logs-{app_name}"
        tool_context.agent.state.set(pointer, logs)
        return f"Data stored as pointer '{pointer}'. Use analyze tools to query it."
    return str(logs)

@tool(context=True)
def analyze_error_patterns(data_pointer: str, tool_context: ToolContext) -> str:
    """Analyze errors — resolves pointer from agent.state."""
    data = tool_context.agent.state.get(data_pointer)
    errors = [e for e in data if e["level"] == "ERROR"]
    return f"Found {len(errors)} errors across {len(set(e['service'] for e in errors))} services"
Enter fullscreen mode Exit fullscreen mode

The LLM never sees the 200KB — it only sees "Data stored as pointer 'logs-payment-service'" (52 bytes).

Why Strands Agents? The ToolContext API provides agent.state as a native key-value store scoped to each agent — no global dictionaries, no external infrastructure. For multi-agent workflows, invocation_state shares data across agents in a Swarm with the same API.

Metric Without pointers With Memory Pointers
Data in context 214KB (full logs) 52 bytes (pointer)
Agent behavior Truncates or fails Processes all data
Errors detected Partial Complete

Bar chart showing token usage across context management strategies

Full demo: 01-context-overflow-demo — single-agent and multi-agent (Swarm) implementations with notebooks.


Why Do AI Agents Freeze When Calling External APIs?

AI agents freeze when MCP tools call slow or unresponsive external APIs. The agent blocks on the tool call, the user sees no progress, and after 7 seconds many implementations return a 424 error. MCP (Model Context Protocol) enables agents to call external tools, but does not handle timeout or retry by default.

Synchronous MCP tool call showing agent blocked while waiting for slow API

The fix — Async handleId pattern: The tool returns immediately with a job ID. The agent polls a separate check_status tool:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("timeout-demo")
JOBS = {}

@mcp.tool()
async def start_long_job(task: str) -> str:
    """Return handle immediately — prevents timeout."""
    job_id = str(uuid.uuid4())[:8]
    JOBS[job_id] = {"status": "processing", "task": task}
    asyncio.create_task(_process_job(job_id))  # Background work
    return f"Job started. Handle: {job_id}. Use check_job_status to poll."

@mcp.tool()
async def check_job_status(job_id: str) -> str:
    """Poll job status — returns 'processing' or 'completed' with result."""
    job = JOBS.get(job_id)
    if not job:
        return f"FAILED: Job '{job_id}' not found"
    return f"{job['status'].upper()}: {job.get('result', 'Still processing...')}"
Enter fullscreen mode Exit fullscreen mode
Scenario Response time UX
Fast API (1s) 3s total OK
Slow API (15s) 18s blocked Agent frozen
Failing API 424 error after 7s Agent crashes
Async handleId ~4s (immediate + poll) Agent responsive

Timeline visualization showing four MCP response patterns

Why Strands Agents? The MCPClient connects to any MCP server with one line. The agent discovers tools at runtime via list_tools_sync() — no hardcoded tool list. When the MCP server implements the async pattern, the agent automatically polls without extra orchestration code.

Full demo: 02-mcp-timeout-demo — local MCP server with all 4 scenarios and notebook.


Why Do AI Agents Repeat the Same Tool Call?

AI agent reasoning loops happen when the agent calls the same tool repeatedly with identical parameters, making no progress. The root cause is ambiguous tool feedback — responses like "more results may be available" make the agent think another call will produce better results. Research shows agents can loop hundreds of times without delivering an answer.

Diagram showing how ambiguous tool feedback causes loops versus how clear states and DebounceHook prevent them

Fix 1 — Clear terminal states: Tools return explicit SUCCESS or FAILED instead of ambiguous messages:

# Ambiguous (causes loops)
return f"Found flights: {results}. More results may be available."

# Clear (agent stops)
return f"SUCCESS: Booked flight {conf_id} for {passenger}. Confirmation sent."
Enter fullscreen mode Exit fullscreen mode

Fix 2 — DebounceHook: Detect and block duplicate tool calls at the framework level:

from strands.hooks.registry import HookProvider, HookRegistry
from strands.hooks.events import BeforeToolCallEvent

class DebounceHook(HookProvider):
    """Block duplicate tool calls in a sliding window."""
    def __init__(self, window_size=3):
        self.call_history = []
        self.window_size = window_size

    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.check_duplicate)

    def check_duplicate(self, event: BeforeToolCallEvent) -> None:
        key = (event.tool_use["name"], json.dumps(event.tool_use.get("input", {})))
        if self.call_history.count(key) >= 2:
            event.cancel_tool = f"BLOCKED: Duplicate call to {event.tool_use['name']}"
        self.call_history.append(key)
        self.call_history = self.call_history[-self.window_size:]
Enter fullscreen mode Exit fullscreen mode
Strategy Tool calls Result
Ambiguous feedback (baseline) 14 calls No definitive answer
DebounceHook 12 calls (2 blocked) Completes with blocks
Clear SUCCESS states 2 calls Immediate completion

Bar chart showing tool calls across different strategies

Why Strands Agents? The HookProvider API intercepts tool calls via BeforeToolCallEvent before they execute. Setting event.cancel_tool blocks execution at the framework level — the LLM cannot bypass it. This makes hooks composable: stack DebounceHook, LimitToolCounts, and custom validators on the same agent.

Full demo: 03-reasoning-loops-demo — all 4 scenarios with hooks and notebook.


Prerequisites

You need Python 3.9+, uv (a fast Python package manager), and an OpenAI API key.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens

# Pick any demo
cd 01-context-overflow-demo   # or 02-mcp-timeout-demo, 03-reasoning-loops-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"

uv run python test_*.py
Enter fullscreen mode Exit fullscreen mode

Each demo is self-contained with its own dependencies, test script, and Jupyter notebook.


Frequently Asked Questions

What are the most common AI agent failure modes?

The three most common failure modes are context window overflow (tool returns more data than the LLM can process), MCP tool timeouts (external APIs block the agent indefinitely), and reasoning loops (agent repeats the same tool call without progress). Each failure mode causes token waste and degrades response quality.

How do I reduce AI agent token costs?

The two most effective techniques are memory pointers and clear tool states. The Memory Pointer Pattern stores large tool outputs in external state and passes short references to the LLM context — reducing token usage from 200KB+ to under 100 bytes per tool call. Clear terminal states (SUCCESS/FAILED) in tool responses prevent the agent from retrying completed operations, which can reduce tool calls from 14 to 2.

Can I use these patterns with frameworks other than Strands Agents?

Yes. The Memory Pointer Pattern works with any framework that supports tool context (passing state between tools). The async handleId pattern is an MCP server design pattern — it works with any MCP-compatible agent. DebounceHook requires lifecycle hooks, which are available in LangGraph, AutoGen, and CrewAI with different APIs.


References

Research

Implementation


Which failure mode have you hit in your agents? Share in the comments.


Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr

Top comments (0)