DEV Community

Cover image for AI Context Window Overflow: Memory Pointer Fix
Elizabeth Fuentes L for AWS

Posted on

AI Context Window Overflow: Memory Pointer Fix

Context window overflow occurs when an AI agent's tool outputs exceed the token limit the large language model (LLM) can process at once. The agent doesn't crash; it silently truncates data, loses earlier context, or produces incomplete results. This post shows how the Memory Pointer Pattern fixes it: from single-agent to multi-agent coordination where 145KB of data never enters any LLM context.

This demo uses Strands Agents. The Memory Pointer Pattern is framework-agnostic and can be applied with LangGraph, AutoGen, or other agent frameworks that support tool context.

Working code: github.com/aws-samples/sample-why-agents-fail

Series: Why AI Agents Fail

  1. Context Window Overflow (this post) — Memory Pointer Pattern for large data
  2. MCP Tools That Never Respond — Async pattern for slow external APIs
  3. AI Agent Reasoning Loops — Detect and block repeated tool calls

The Problem: Agents Can't Handle Large Tool Outputs

When an AI agent calls a tool that returns large data (server logs, database results, file contents), the response can overflow the LLM's context window. The agent doesn't crash with a clear error. It silently degrades: truncating data, losing context, or failing to complete the task.

Research from IBM (Solving Context Window Overflow in AI Agents, 2025) quantifies this:

  • In Materials Science workflows, tool outputs can reach 2M+ elements
  • Traditional approach consumed 20,822,181 tokens and failed
  • The same workflow with memory pointers used 1,234 tokens and succeeded
  • That's a reduction of over 16,000x in this workflow

Community observation (Context Window Limits Explained, Airbyte 2025) confirms teams discover these limits "the hard way" through silent errors. The agent appears to work but produces incomplete or wrong results.

The concept of passing references instead of raw data has also been validated in multi-agent settings. Research from Amazon (Towards Effective GenAI Multi-Agent Collaboration, 2024) introduces "payload referencing," where agents exchange pointers to shared data instead of embedding large payloads in messages. This improved performance on code-intensive tasks by 23% and achieved 90% end-to-end goal success rates in enterprise benchmarks. This is exactly what we implement below with Strands Swarm.

Why This Happens

The agent loop: User Query flows to LLM, then Tool Call, then Tool Output (214KB), then back to LLM. Large tool output causes context overflow

When the tool output is small (a few KB), this works fine. But when a tool returns 200KB of server logs:

  1. The full output gets injected into the conversation
  2. The LLM's context window fills up
  3. Older context (including the original question) gets pushed out
  4. The LLM can't reason about the data because it can't see it all
  5. The agent either fails or produces incomplete answers

Solution 1: Single Agent with Strands ToolContext

The first approach uses agent.state, a native key-value store scoped to each agent instance. Tools write large data there via ToolContext and return a short pointer string to the context:

from strands import Agent, tool, ToolContext

# context=True injects ToolContext as the last parameter — required to access agent.state
@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 24) -> str:
    """Fetch application logs. Returns a memory pointer for large datasets."""
    logs = generate_logs(app_name, hours)  # Could be 200KB+

    if len(str(logs)) > 20_000:  # Threshold: store externally above 20KB
        pointer = f"logs-{app_name}"
        # Store the full payload in agent.state — it never enters the LLM context
        tool_context.agent.state.set(pointer, logs)
        # Return only the pointer key (52 bytes) — this is all the LLM sees
        return f"Data stored as pointer '{pointer}'. Use analyze tools to query it."
    return str(logs)  # Small enough to return directly

@tool(context=True)
def analyze_error_patterns(data_pointer: str, tool_context: ToolContext) -> str:
    """Analyze errors — resolves pointer from agent.state."""
    # Retrieve the full dataset from agent.state using the pointer key
    data = tool_context.agent.state.get(data_pointer)
    errors = [e for e in data if e["level"] == "ERROR"]
    # Return a summary (not raw data) — keeps the response small
    return f"Found {len(errors)} errors across {len(set(e['service'] for e in errors))} services"
Enter fullscreen mode Exit fullscreen mode

The LLM never sees the 200KB. It only sees "Data stored as pointer 'logs-payment-service'" (52 bytes). The next tool reads the full data from agent.state and returns a summary. Strands handles this natively, with no global dicts, no hashlib, no external infrastructure.

Single Agent Results

Metric Without Pointers With Memory Pointers
Data in context 214KB (full logs) 52 bytes (pointer)
Agent behavior Truncates/fails Processes all data
Errors detected Partial Complete (all services)

Bar chart comparing token usage with and without Memory Pointer Pattern across four context management strategies

Solution 2: Multi-Agent with Strands Swarm

Strands Swarm data flow: Collector, Analyzer, and Reporter agents sharing 145KB of data through invocation_state without entering any LLM context window

A single agent works for linear pipelines. But real-world incident response involves specialized roles: someone fetches data, someone analyzes it, someone writes the report. Strands Swarm coordinates multiple agents autonomously: define agents with different tools, and the Swarm handles handoffs.

This is the same "payload referencing" pattern from the Amazon multi-agent collaboration paper. Agents exchange pointers to shared data instead of passing raw payloads. The difference is that Strands Swarm handles the coordination automatically, and provides invocation_state as the official API for sharing data across agents.

from strands import Agent, tool, ToolContext
from strands.multiagent import Swarm

# invocation_state is a dict shared across all agents in the Swarm — the cross-agent store
@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 6) -> str:
    logs = generate_logs(hours)  # 145KB+
    pointer = f"logs-{app_name}"
    # Store in invocation_state so all downstream agents can access it without re-fetching
    tool_context.invocation_state[pointer] = logs
    # Only the pointer string travels through the LLM context to the next agent
    return f"Stored as '{pointer}'. Hand off to analyzer."

@tool(context=True)
def analyze_error_patterns(logs_pointer: str, tool_context: ToolContext) -> str:
    # Resolve the pointer to the full dataset — no LLM context consumed
    logs = tool_context.invocation_state.get(logs_pointer)
    errors = [l for l in logs if l["level"] == "ERROR"]
    result = {"total_errors": len(errors)}  # additional fields omitted for brevity
    # Store analysis results as another pointer for the reporter agent
    tool_context.invocation_state["error_analysis"] = result
    return json.dumps(result)

# Each agent has a focused role; the Swarm decides the handoff order autonomously
collector = Agent(name="collector", tools=[fetch_application_logs], model=MODEL)
analyzer = Agent(name="analyzer", tools=[analyze_error_patterns, detect_latency_anomalies], model=MODEL)
reporter = Agent(name="reporter", tools=[generate_incident_report], model=MODEL)

swarm = Swarm([collector, analyzer, reporter], entry_point=collector)
result = swarm("Fetch logs, analyze, and generate incident report.")
Enter fullscreen mode Exit fullscreen mode

The Swarm automatically:

  • Starts with the collector, which fetches 145KB of logs and stores them in invocation_state
  • The collector hands off to the analyzer with the pointer "logs-payment-service"
  • The analyzer runs error and latency analysis, stores results in invocation_state, hands off to the reporter
  • The reporter generates the final incident report

No orchestration code or manual handoff logic is needed. Each agent has its own tools and the Swarm figures out the flow from the agent descriptions and the task. All data sharing happens via tool_context.invocation_state, the same ToolContext API used in single-agent, with a different store.

Swarm Results

Status: COMPLETED
Agents: collector → analyzer → reporter
Time: ~14s
Shared store:
  logs-payment-service: 145,310 bytes
  error_analysis: 135 bytes
  latency_analysis: 70 bytes
Enter fullscreen mode Exit fullscreen mode

145KB of logs processed by three agents. None of it ever entered any LLM context window.

Follow-up Investigation

After the swarm completes, the data stays in the shared store. A separate investigator agent can drill into specific services without re-fetching:

# The investigator reuses invocation_state populated by the swarm — no data re-fetch needed
investigator = Agent(
    name="investigator",
    tools=[get_error_details, analyze_error_patterns],
    model=MODEL,
)

# Each question resolves the pointer from invocation_state and runs analysis in-memory
investigator("Which service had the most errors?")
investigator("Show me the error logs for cache-layer")
investigator("What status codes are those errors returning?")
# All queries read from the same 145KB already in invocation_state — no re-fetch, no context overflow
Enter fullscreen mode Exit fullscreen mode

When to Use Each Approach

Single agent + agent.state — linear pipelines where one agent handles fetch + analyze + report. Use ToolContext to access tool_context.agent.state from tools.

Swarm + invocation_state — specialized roles, complex workflows, or when you want autonomous coordination. Use ToolContext to access tool_context.invocation_state — the official Strands API for multi-agent data sharing. The Swarm handles handoffs, timeouts, and repetitive handoff detection.

Both — use SlidingWindowConversationManager as additional protection. It automatically trims conversation history and handles ContextWindowOverflowException with retry.

These approaches are part of context engineering for AI agents: the practice of deciding what information enters the LLM's context window and when.

Try It Yourself

You need Python 3.9+, uv, and an OpenAI API key.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens/01-context-overflow-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"

uv run python test_context_overflow.py   # Single-agent: 4 scenarios
uv run python swarm_demo.py              # Multi-agent: Collector → Analyzer → Reporter
Enter fullscreen mode Exit fullscreen mode

Or open test_context_overflow.ipynb in Kiro, VS Code, or your preferred notebook environment.

Key Takeaways

  1. Context overflow is silent — agents don't crash, they produce wrong results
  2. Memory pointers solve it — store large data externally, pass references
  3. >16,000x token reduction — validated by IBM Research on the Materials Science benchmark
  4. Single-agent uses agent.state@tool(context=True) + ToolContext to store and retrieve data outside context
  5. Multi-agent uses invocation_state — same ToolContext API, shared across all agents in the Swarm. No orchestration code needed
  6. Data persists for follow-up — after the pipeline completes, stored data is available for investigation without re-fetching

Frequently Asked Questions

Why do AI agents run out of context?

AI agents run out of context when tool responses are injected directly into the LLM conversation history. Each response consumes tokens. When cumulative tool outputs exceed the model's context window limit, the LLM loses earlier context, truncates data, or fails entirely. This happens silently: the agent appears to work but produces incomplete or wrong results.

What is the Memory Pointer Pattern for AI agents?

The Memory Pointer Pattern stores large tool outputs (logs, datasets, query results) in external state instead of the LLM context window. Tools return a short reference key (the "pointer") that subsequent tools use to retrieve the full data. IBM Research validated this pattern with a reduction of over 16,000x on the Materials Science benchmark.

How does agent.state differ from invocation_state in Strands Agents?

agent.state is scoped to a single agent instance. Use it for linear pipelines where one agent handles all steps. invocation_state is shared across all agents in a Strands Swarm. Use it when multiple specialized agents need to exchange data without passing large payloads through the LLM context.

Can I use the Memory Pointer Pattern with LangGraph or other frameworks?

Yes. The pattern requires two capabilities: a shared key-value store accessible from tools, and the ability to pass short reference strings through the LLM context. LangGraph provides this through its state management, AutoGen through shared memory, and CrewAI through task context. The Strands implementation uses ToolContext as the native API.

References

Research

Implementation


Have you hit context window limits in your agents? What strategies worked for you? Share in the comments.

Next in this series: MCP Tools That Never Respond — async patterns for slow external APIs.


All code in this series is open source under the MIT-0 License. Star the repository to follow updates.


Gracias!

🇻🇪Dev.to - Linkedin - GitHub - Twitter - Instagram - Youtube

Top comments (0)