Elizabeth Fuentes L for AWS

Posted on Mar 24 • Edited on Apr 30 • Originally published at builder.aws.com

Why AI Agents Fail: 3 Failure Modes That Cost You Tokens and Time

#ai #agents #tutorial #python

AI agents don't fail like traditional software — they don't crash with a stack trace. They fail silently: returning incomplete answers, freezing on slow APIs, or burning tokens calling the same tool over and over. The agent appears to work, but the output is wrong, late, or expensive.

This series covers the three most common failure modes with research-backed solutions. Each technique has a runnable demo that measures the before/after difference.

Working code: github.com/aws-samples/sample-why-agents-fail

The demos use Strands Agents with OpenAI (GPT-4o-mini). The patterns are framework-agnostic — they apply to LangGraph, AutoGen, CrewAI, or any framework that supports tool calling and lifecycle hooks.

This Series: 3 Essential Fixes

Context Window Overflow — Memory Pointer Pattern for large data
MCP Tools That Never Respond — Async handleId pattern for slow external APIs
AI Agent Reasoning Loops — DebounceHook + clear tool states to block repeated calls

What Happens When Tool Outputs Overflow the Context Window?

Context window overflow occurs when a tool returns more data than the LLM can process — server logs, database results, or file contents that exceed the token limit. The agent does not crash with an error. It silently degrades: truncating data, losing context, or producing incomplete answers.

Research from IBM quantifies this: a Materials Science workflow consumed 20M tokens and failed. The same workflow with memory pointers used 1,234 tokens and succeeded.

The fix — Memory Pointer Pattern: Store large data in agent.state, return a short pointer to the context. The next tool resolves the pointer to access the full data:

from strands import tool, ToolContext

@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 24) -> str:
    """Fetch logs. Stores large data as pointer to avoid context overflow."""
    logs = generate_logs(app_name, hours)  # Could be 200KB+

    if len(str(logs)) > 20_000:
        pointer = f"logs-{app_name}"
        tool_context.agent.state.set(pointer, logs)
        return f"Data stored as pointer '{pointer}'. Use analyze tools to query it."
    return str(logs)

@tool(context=True)
def analyze_error_patterns(data_pointer: str, tool_context: ToolContext) -> str:
    """Analyze errors — resolves pointer from agent.state."""
    data = tool_context.agent.state.get(data_pointer)
    errors = [e for e in data if e["level"] == "ERROR"]
    return f"Found {len(errors)} errors across {len(set(e['service'] for e in errors))} services"

The LLM never sees the 200KB — it only sees "Data stored as pointer 'logs-payment-service'" (52 bytes).

Why Strands Agents? The ToolContext API provides agent.state as a native key-value store scoped to each agent — no global dictionaries, no external infrastructure. For multi-agent workflows, invocation_state shares data across agents in a Swarm with the same API.

Metric	Without pointers	With Memory Pointers
Data in context	214KB (full logs)	52 bytes (pointer)
Agent behavior	Truncates or fails	Processes all data
Errors detected	Partial	Complete

Full demo: 01-context-overflow-demo — single-agent and multi-agent (Swarm) implementations with notebooks.

Why Do AI Agents Freeze When Calling External APIs?

AI agents freeze when MCP tools call slow or unresponsive external APIs. The agent blocks on the tool call, the user sees no progress, and after 7 seconds many implementations return a 424 error. MCP (Model Context Protocol) enables agents to call external tools, but does not handle timeout or retry by default.

The fix — Async handleId pattern: The tool returns immediately with a job ID. The agent polls a separate check_status tool:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("timeout-demo")
JOBS = {}

@mcp.tool()
async def start_long_job(task: str) -> str:
    """Return handle immediately — prevents timeout."""
    job_id = str(uuid.uuid4())[:8]
    JOBS[job_id] = {"status": "processing", "task": task}
    asyncio.create_task(_process_job(job_id))  # Background work
    return f"Job started. Handle: {job_id}. Use check_job_status to poll."

@mcp.tool()
async def check_job_status(job_id: str) -> str:
    """Poll job status — returns 'processing' or 'completed' with result."""
    job = JOBS.get(job_id)
    if not job:
        return f"FAILED: Job '{job_id}' not found"
    return f"{job['status'].upper()}: {job.get('result', 'Still processing...')}"

Scenario	Response time	UX
Fast API (1s)	3s total	OK
Slow API (15s)	18s blocked	Agent frozen
Failing API	424 error after 7s	Agent crashes
Async handleId	~4s (immediate + poll)	Agent responsive

Why Strands Agents? The MCPClient connects to any MCP server with one line. The agent discovers tools at runtime via list_tools_sync() — no hardcoded tool list. When the MCP server implements the async pattern, the agent automatically polls without extra orchestration code.

Full demo: 02-mcp-timeout-demo — local MCP server with all 4 scenarios and notebook.

Why Do AI Agents Repeat the Same Tool Call?

AI agent reasoning loops happen when the agent calls the same tool repeatedly with identical parameters, making no progress. The root cause is ambiguous tool feedback — responses like "more results may be available" make the agent think another call will produce better results. Research shows agents can loop hundreds of times without delivering an answer.

Fix 1 — Clear terminal states: Tools return explicit SUCCESS or FAILED instead of ambiguous messages:

# Ambiguous (causes loops)
return f"Found flights: {results}. More results may be available."

# Clear (agent stops)
return f"SUCCESS: Booked flight {conf_id} for {passenger}. Confirmation sent."

Fix 2 — DebounceHook: Detect and block duplicate tool calls at the framework level:

from strands.hooks.registry import HookProvider, HookRegistry
from strands.hooks.events import BeforeToolCallEvent

class DebounceHook(HookProvider):
    """Block duplicate tool calls in a sliding window."""
    def __init__(self, window_size=3):
        self.call_history = []
        self.window_size = window_size

    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.check_duplicate)

    def check_duplicate(self, event: BeforeToolCallEvent) -> None:
        key = (event.tool_use["name"], json.dumps(event.tool_use.get("input", {})))
        if self.call_history.count(key) >= 2:
            event.cancel_tool = f"BLOCKED: Duplicate call to {event.tool_use['name']}"
        self.call_history.append(key)
        self.call_history = self.call_history[-self.window_size:]

Strategy	Tool calls	Result
Ambiguous feedback (baseline)	14 calls	No definitive answer
DebounceHook	12 calls (2 blocked)	Completes with blocks
Clear SUCCESS states	2 calls	Immediate completion

Why Strands Agents? The HookProvider API intercepts tool calls via BeforeToolCallEvent before they execute. Setting event.cancel_tool blocks execution at the framework level — the LLM cannot bypass it. This makes hooks composable: stack DebounceHook, LimitToolCounts, and custom validators on the same agent.

Full demo: 03-reasoning-loops-demo — all 4 scenarios with hooks and notebook.

Prerequisites

You need Python 3.9+, uv (a fast Python package manager), and an OpenAI API key.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens

# Pick any demo
cd 01-context-overflow-demo   # or 02-mcp-timeout-demo, 03-reasoning-loops-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"

uv run python test_*.py

Each demo is self-contained with its own dependencies, test script, and Jupyter notebook.

Frequently Asked Questions

What are the most common AI agent failure modes?

The three most common failure modes are context window overflow (tool returns more data than the LLM can process), MCP tool timeouts (external APIs block the agent indefinitely), and reasoning loops (agent repeats the same tool call without progress). Each failure mode causes token waste and degrades response quality.

How do I reduce AI agent token costs?

The two most effective techniques are memory pointers and clear tool states. The Memory Pointer Pattern stores large tool outputs in external state and passes short references to the LLM context — reducing token usage from 200KB+ to under 100 bytes per tool call. Clear terminal states (SUCCESS/FAILED) in tool responses prevent the agent from retrying completed operations, which can reduce tool calls from 14 to 2.

Can I use these patterns with frameworks other than Strands Agents?

Yes. The Memory Pointer Pattern works with any framework that supports tool context (passing state between tools). The async handleId pattern is an MCP server design pattern — it works with any MCP-compatible agent. DebounceHook requires lifecycle hooks, which are available in LangGraph, AutoGen, and CrewAI with different APIs.

References

Research

Solving Context Window Overflow in AI Agents — IBM Research, Nov 2025
Towards Effective GenAI Multi-Agent Collaboration — Amazon, Dec 2024
Resilient AI Agents With MCP — Octopus, May 2025
Language models can overthink — The Decoder, Jan 2025

Implementation

Strands Agent State — ToolContext and agent.state
Strands MCP Tools — Connect any MCP server
Strands Hooks — Lifecycle events and tool cancellation

Which failure mode have you hit in your agents? Share in the comments.

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Elizabeth Fuentes L

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

Top comments (2)

Peter Vivo • Mar 24

Thx this is very interesting theme about MCP connection which is the backbone of current agent use.
From a real work experience, I feel the cli based LLM ( favorite over simple LLM or IDE attached ) context handling is significant increased a previous 4 month period. I am using codex, gemini cli copilot cli

klement Gunndu • Mar 27

The memory pointer pattern is solid. One thing worth adding: reasoning loops also fire when tool responses are technically valid but semantically empty — returning well-formed JSON with zero useful data tricks the agent into thinking it's progressing.