Elizabeth Fuentes L for AWS

Posted on Apr 13 • Edited on Apr 15

AI Context Window Overflow: Memory Pointer Fix

#aws #ai #python #tutorial

Context window overflow occurs when an AI agent's tool outputs exceed the token limit the large language model (LLM) can process at once. The agent doesn't crash; it silently truncates data, loses earlier context, or produces incomplete results. This post shows how the Memory Pointer Pattern fixes it: from single-agent to multi-agent coordination where 145KB of data never enters any LLM context.

This demo uses Strands Agents. The Memory Pointer Pattern is framework-agnostic and can be applied with LangGraph, AutoGen, or other agent frameworks that support tool context.

Working code: github.com/aws-samples/sample-why-agents-fail

Series: Why AI Agents Fail

Context Window Overflow (this post) — Memory Pointer Pattern for large data
MCP Tools That Never Respond — Async pattern for slow external APIs
AI Agent Reasoning Loops — Detect and block repeated tool calls

The Problem: Agents Can't Handle Large Tool Outputs

When an AI agent calls a tool that returns large data (server logs, database results, file contents), the response can overflow the LLM's context window. The agent doesn't crash with a clear error. It silently degrades: truncating data, losing context, or failing to complete the task.

Research from IBM (Solving Context Window Overflow in AI Agents, 2025) quantifies this:

In Materials Science workflows, tool outputs can reach 2M+ elements
Traditional approach consumed 20,822,181 tokens and failed
The same workflow with memory pointers used 1,234 tokens and succeeded
That's a reduction of over 16,000x in this workflow

Community observation (Context Window Limits Explained, Airbyte 2025) confirms teams discover these limits "the hard way" through silent errors. The agent appears to work but produces incomplete or wrong results.

The concept of passing references instead of raw data has also been validated in multi-agent settings. Research from Amazon (Towards Effective GenAI Multi-Agent Collaboration, 2024) introduces "payload referencing," where agents exchange pointers to shared data instead of embedding large payloads in messages. This improved performance on code-intensive tasks by 23% and achieved 90% end-to-end goal success rates in enterprise benchmarks. This is exactly what we implement below with Strands Swarm.

Why This Happens

When the tool output is small (a few KB), this works fine. But when a tool returns 200KB of server logs:

The full output gets injected into the conversation
The LLM's context window fills up
Older context (including the original question) gets pushed out
The LLM can't reason about the data because it can't see it all
The agent either fails or produces incomplete answers

Solution 1: Single Agent with Strands ToolContext

The first approach uses agent.state, a native key-value store scoped to each agent instance. Tools write large data there via ToolContext and return a short pointer string to the context:

from strands import Agent, tool, ToolContext

# context=True injects ToolContext as the last parameter — required to access agent.state
@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 24) -> str:
    """Fetch application logs. Returns a memory pointer for large datasets."""
    logs = generate_logs(app_name, hours)  # Could be 200KB+

    if len(str(logs)) > 20_000:  # Threshold: store externally above 20KB
        pointer = f"logs-{app_name}"
        # Store the full payload in agent.state — it never enters the LLM context
        tool_context.agent.state.set(pointer, logs)
        # Return only the pointer key (52 bytes) — this is all the LLM sees
        return f"Data stored as pointer '{pointer}'. Use analyze tools to query it."
    return str(logs)  # Small enough to return directly

@tool(context=True)
def analyze_error_patterns(data_pointer: str, tool_context: ToolContext) -> str:
    """Analyze errors — resolves pointer from agent.state."""
    # Retrieve the full dataset from agent.state using the pointer key
    data = tool_context.agent.state.get(data_pointer)
    errors = [e for e in data if e["level"] == "ERROR"]
    # Return a summary (not raw data) — keeps the response small
    return f"Found {len(errors)} errors across {len(set(e['service'] for e in errors))} services"

The LLM never sees the 200KB. It only sees "Data stored as pointer 'logs-payment-service'" (52 bytes). The next tool reads the full data from agent.state and returns a summary. Strands handles this natively, with no global dicts, no hashlib, no external infrastructure.

Single Agent Results

Metric	Without Pointers	With Memory Pointers
Data in context	214KB (full logs)	52 bytes (pointer)
Agent behavior	Truncates/fails	Processes all data
Errors detected	Partial	Complete (all services)

Solution 2: Multi-Agent with Strands Swarm

A single agent works for linear pipelines. But real-world incident response involves specialized roles: someone fetches data, someone analyzes it, someone writes the report. Strands Swarm coordinates multiple agents autonomously: define agents with different tools, and the Swarm handles handoffs.

This is the same "payload referencing" pattern from the Amazon multi-agent collaboration paper. Agents exchange pointers to shared data instead of passing raw payloads. The difference is that Strands Swarm handles the coordination automatically, and provides invocation_state as the official API for sharing data across agents.

from strands import Agent, tool, ToolContext
from strands.multiagent import Swarm

# invocation_state is a dict shared across all agents in the Swarm — the cross-agent store
@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 6) -> str:
    logs = generate_logs(hours)  # 145KB+
    pointer = f"logs-{app_name}"
    # Store in invocation_state so all downstream agents can access it without re-fetching
    tool_context.invocation_state[pointer] = logs
    # Only the pointer string travels through the LLM context to the next agent
    return f"Stored as '{pointer}'. Hand off to analyzer."

@tool(context=True)
def analyze_error_patterns(logs_pointer: str, tool_context: ToolContext) -> str:
    # Resolve the pointer to the full dataset — no LLM context consumed
    logs = tool_context.invocation_state.get(logs_pointer)
    errors = [l for l in logs if l["level"] == "ERROR"]
    result = {"total_errors": len(errors)}  # additional fields omitted for brevity
    # Store analysis results as another pointer for the reporter agent
    tool_context.invocation_state["error_analysis"] = result
    return json.dumps(result)

# Each agent has a focused role; the Swarm decides the handoff order autonomously
collector = Agent(name="collector", tools=[fetch_application_logs], model=MODEL)
analyzer = Agent(name="analyzer", tools=[analyze_error_patterns, detect_latency_anomalies], model=MODEL)
reporter = Agent(name="reporter", tools=[generate_incident_report], model=MODEL)

swarm = Swarm([collector, analyzer, reporter], entry_point=collector)
result = swarm("Fetch logs, analyze, and generate incident report.")

The Swarm automatically:

Starts with the collector, which fetches 145KB of logs and stores them in invocation_state
The collector hands off to the analyzer with the pointer "logs-payment-service"
The analyzer runs error and latency analysis, stores results in invocation_state, hands off to the reporter
The reporter generates the final incident report

No orchestration code or manual handoff logic is needed. Each agent has its own tools and the Swarm figures out the flow from the agent descriptions and the task. All data sharing happens via tool_context.invocation_state, the same ToolContext API used in single-agent, with a different store.

Swarm Results

Status: COMPLETED
Agents: collector → analyzer → reporter
Time: ~14s
Shared store:
  logs-payment-service: 145,310 bytes
  error_analysis: 135 bytes
  latency_analysis: 70 bytes

145KB of logs processed by three agents. None of it ever entered any LLM context window.

Follow-up Investigation

After the swarm completes, the data stays in the shared store. A separate investigator agent can drill into specific services without re-fetching:

# The investigator reuses invocation_state populated by the swarm — no data re-fetch needed
investigator = Agent(
    name="investigator",
    tools=[get_error_details, analyze_error_patterns],
    model=MODEL,
)

# Each question resolves the pointer from invocation_state and runs analysis in-memory
investigator("Which service had the most errors?")
investigator("Show me the error logs for cache-layer")
investigator("What status codes are those errors returning?")
# All queries read from the same 145KB already in invocation_state — no re-fetch, no context overflow

When to Use Each Approach

Single agent + agent.state — linear pipelines where one agent handles fetch + analyze + report. Use ToolContext to access tool_context.agent.state from tools.

Swarm + invocation_state — specialized roles, complex workflows, or when you want autonomous coordination. Use ToolContext to access tool_context.invocation_state — the official Strands API for multi-agent data sharing. The Swarm handles handoffs, timeouts, and repetitive handoff detection.

Both — use SlidingWindowConversationManager as additional protection. It automatically trims conversation history and handles ContextWindowOverflowException with retry.

These approaches are part of context engineering for AI agents: the practice of deciding what information enters the LLM's context window and when.

Try It Yourself

You need Python 3.9+, uv, and an OpenAI API key.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens/01-context-overflow-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"

uv run python test_context_overflow.py   # Single-agent: 4 scenarios
uv run python swarm_demo.py              # Multi-agent: Collector → Analyzer → Reporter

Or open test_context_overflow.ipynb in Kiro, VS Code, or your preferred notebook environment.

Key Takeaways

Context overflow is silent — agents don't crash, they produce wrong results
Memory pointers solve it — store large data externally, pass references
>16,000x token reduction — validated by IBM Research on the Materials Science benchmark
Single-agent uses agent.state — @tool(context=True) + ToolContext to store and retrieve data outside context
Multi-agent uses invocation_state — same ToolContext API, shared across all agents in the Swarm. No orchestration code needed
Data persists for follow-up — after the pipeline completes, stored data is available for investigation without re-fetching

Frequently Asked Questions

Why do AI agents run out of context?

AI agents run out of context when tool responses are injected directly into the LLM conversation history. Each response consumes tokens. When cumulative tool outputs exceed the model's context window limit, the LLM loses earlier context, truncates data, or fails entirely. This happens silently: the agent appears to work but produces incomplete or wrong results.

What is the Memory Pointer Pattern for AI agents?

The Memory Pointer Pattern stores large tool outputs (logs, datasets, query results) in external state instead of the LLM context window. Tools return a short reference key (the "pointer") that subsequent tools use to retrieve the full data. IBM Research validated this pattern with a reduction of over 16,000x on the Materials Science benchmark.

How does agent.state differ from invocation_state in Strands Agents?

agent.state is scoped to a single agent instance. Use it for linear pipelines where one agent handles all steps. invocation_state is shared across all agents in a Strands Swarm. Use it when multiple specialized agents need to exchange data without passing large payloads through the LLM context.

Can I use the Memory Pointer Pattern with LangGraph or other frameworks?

Yes. The pattern requires two capabilities: a shared key-value store accessible from tools, and the ability to pass short reference strings through the LLM context. LangGraph provides this through its state management, AutoGen through shared memory, and CrewAI through task context. The Strands implementation uses ToolContext as the native API.

References

Research

Solving Context Window Overflow in AI Agents — IBM Research, Nov 2025
Towards Effective GenAI Multi-Agent Collaboration — Amazon, Dec 2024
Context Window Limits Explained — Airbyte blog (community observation), Dec 2025
Efficient On-Device Agents via Adaptive Context Management — Nov 2025

Implementation

Strands Agent State — ToolContext and agent.state
Strands Swarm — Multi-agent orchestration
Strands Conversation Management — Sliding window and context overflow

Have you hit context window limits in your agents? What strategies worked for you? Share in the comments.

Next in this series: MCP Tools That Never Respond — async patterns for slow external APIs.

All code in this series is open source under the MIT-0 License. Star the repository to follow updates.

Gracias!

🇻🇪Dev.to - Linkedin - GitHub - Twitter - Instagram - Youtube

Top comments (2)

Archit Mittal • Apr 17

The memory-pointer framing is the right abstraction — treating context as an addressable store with summaries-as-pointers is how most production agents are quietly solving long-running sessions. One edge case to watch: pointer invalidation when the underlying source gets updated. If your pointer references a Notion page that was edited yesterday, your "summary" is now a stale quote. A small version hash on each pointer + periodic revalidation saves you from the agent confidently citing a summary that no longer matches truth. Do you keep TTLs on summaries or refresh on every access?

Suny Choudhary • Apr 14

Nice approach.

Context overflow fixes like this help, but they also highlight a bigger issue, context isn’t just about size, it’s about relevance. Even with memory pointers, if the system keeps pulling in loosely related context, you still get drift.

Feels like the real challenge is deciding what not to carry forward, not just how to store more.