Morgan Willis for AWS

Posted on Jun 9

Your Agent Doesn't Need That 10,000-Token API Response: Context Offloading with Strands

#architecture #security #ai #agents

Context engineering matters for two reasons: reliability and cost. If your agent's context window is full of noise, reasoning quality drops and you're paying for tokens that aren't helping anything. And one of the biggest sources of that noise? Tool results.

HTTP requests, file readers, API clients, and database queries can return really context heavy results. When these verbose tool results enter the conversation, they can crowd out other context and burn up tokens quickly.

You need a way to truncate tool results and only bring in the full context of that tool result when needed. Luckily, Strands Agents just released something that does this for you automatically.

Offloading Noisy Tool Results Automatically

Strands Agents just shipped the ContextOffloader plugin. It's available in both the TypeScript and Python SDKs. It prevents large tool results from consuming your agent's context window automatically. When a tool returns a result that exceeds a configurable token threshold, the plugin stores each content block individually in an external storage backend and replaces it in the conversation with a truncated preview plus per-block references. Each offloaded result includes inline guidance telling the agent to use its available tools to selectively access the data it needs.

You may already be using Conversation Managers for context management, which keeps your overall conversation from exceeding model context limits by trimming or summarizing older messages when the window fills up. That handles the macro problem of making sure you don't blow past the model's total token budget.

The ContextOffloader handles the context engineering task of dealing with individual tool results you don't want to live in the context window for every turn. Instead of waiting for the conversation to overflow and then compressing it reactively, The ContextOffloader externalizes and truncates tool results automatically when they're returned. The agent can still retrieve the full content whenever it needs it, but by default it doesn't keep the whole thing in context at all times. In practice you should use both conversation managers and the context offloading plugin together.

Use SummarizingConversationManager or SlidingWindowConversationManager to safeguard against overall conversation length, and use the ContextOffloader to keep individual tool results from bloating the window in the first place.

What Strands Context Offloader Does

The ContextOffloader sits between your tools and your conversation history. When a tool returns a result, the plugin estimates the token count. If that result exceeds a configurable threshold (default 2,500 tokens), it stores the full content in external storage and replaces it in the conversation with a truncated preview plus a reference ID.

Your agent sees something like: "Here's the first ~1,000 tokens of that file, and here's a reference if you need more." The full content stays out of the context window unless the agent explicitly asks for it.

The agent then uses a retrieval tool provided by the plugin, retrieve_offloaded_content, to selectively pull back specific parts of the stored tool results.

It also doesn't need to pull in the whole thing. It can just bring back the parts it needs.

This is a core part of context engineering: bringing in only the information you need, when you need it. The agent sees the gist from the preview, and pulls in more details when the task requires it. With Strands, you let the model decide when that happens.

Setting it up

Here's how you set it up with the Strands TypeScript SDK.

Basic setup with defaults:

import { Agent } from '@strands-agents/sdk'
import { ContextOffloader, InMemoryStorage } from '@strands-agents/sdk/vended-plugins/context-offloader'

const storage = new InMemoryStorage();

const agent = new Agent({
  plugins: [
    new ContextOffloader({ storage })
  ]
});

With this in place, every tool result over 2,500 tokens gets offloaded automatically, and the agent gets access to retrieve_offloaded_content to retrieve that data when it needs it.

You can also tune the thresholds for max token results and how many tokens to preview in context:

import { Agent } from '@strands-agents/sdk'
import { ContextOffloader, InMemoryStorage } from '@strands-agents/sdk/vended-plugins/context-offloader'

const agent = new Agent({
  plugins: [
    new ContextOffloader({
      storage: new InMemoryStorage(),
      maxResultTokens: 1500,   // offload earlier for smaller context models
      previewTokens: 300,      // shorter previews in conversation
      includeRetrievalTool: true  // default, but explicit here
    })
  ]
});

maxResultTokens is the defined token threshold where this kicks in. Any tool result estimated above this token count gets offloaded to storage. You can lower it if you're working with a smaller context model. previewTokens controls how much of the result stays visible in the conversation as a preview. The agent uses this preview to decide whether it needs to retrieve more, so you want enough context to be useful without defeating the purpose of offloading.

What The Agent Sees

When a tool result gets offloaded, the agent sees something like this in the conversation:

[Offloaded: 1 block, ~10,000 tokens]
Tool result was offloaded to external storage due to size.
Use the preview below to answer if possible.
Use retrieve_offloaded_content to fetch the full content by reference.

{"users":[{"id":1,"name":"Alice","role":"admin"},{"id":2,"name":"Bob","role":"user"},{"id":3,"name":"Charlie","rol...

[Stored references:]  mem_1_tool-123_0 (json, 42,000 bytes)

The preview gives the agent enough to understand what came back. The reference ID lets it retrieve more when it needs to. Sometimes the agent can answer from the preview alone and never needs to pull in the full result.

Using existing tools for retrieval

You can store the tool results in memory, in files, or in external storage services like Amazon S3. When you're using FileStorage, the agent can use its existing tools like shell, grep, and cat to access offloaded content directly from the file system. The offloaded guidance includes the full storage path, so the agent knows where to look:

grep -n "admin" ./artifacts/mem_1_tool-123_0
cat ./artifacts/mem_1_tool-123_0 | head -50
sed -n '45,55p' ./artifacts/mem_1_tool-123_0

This is often preferable because the agent already knows these tools well and can chain them together for more complex queries than the built-in retrieval tool supports. You can even disable the built-in tool entirely and let the agent use its own:

const agent = new Agent({
  tools: [shell],
  plugins: [
    new ContextOffloader({
      storage: new FileStorage('./artifacts'),
      includeRetrievalTool: false
    })
  ]
})

With InMemoryStorage, there's no external access path, so keep the built-in retrieval tool enabled. With S3Storage, the agent can use the AWS CLI if it has access to a shell tool.

Storage backends

The offloaded content has to live somewhere. The ContextOffloader supports three storage backends, and which one you pick depends on your use case:

Backend	Use case	Setup
`InMemoryStorage`	Dev, testing, short-lived agents	Zero config, data gone when process exits
`FileStorage`	Local dev, debugging, agents with access to file systems	Writes to disk, human-readable
`S3Storage`	Production, multi-instance, durable	Needs bucket config, handles concurrent access

For production agents that run across multiple invocations or instances, S3Storage is a good choice:

import { Agent } from '@strands-agents/sdk'
import { ContextOffloader, S3Storage } from '@strands-agents/sdk/vended-plugins/context-offloader'

const storage = new S3Storage({
  bucket: "my-agent-context-store",
  prefix: "offloaded/"
});

const agent = new Agent({
  plugins: [
    new ContextOffloader({ storage })
  ]
});

Starting with FileStorage during development makes sense because you can quickly and easily inspect what's being offloaded and verify the previews make sense for your use case. Once you're confident in the settings, swap to S3 or another externalized storage layer for deployment.

Tradeoffs

A few things to keep in mind. The agent reasons over the preview, not the full result. If the answer is buried deep in a large result and the preview doesn't hint at it, the agent might miss it. Tune previewTokens to balance context usage against information loss for your specific tools. S3Storage incurs S3 PUT/GET and storage charges on every offloaded result, and FileStorage writes to disk each time.

Why this matters

ContextOffloader gives you a sensible default for managing noisy tool results without building a custom context engineering strategy. It's proactive, externalizing content before it bloats your window instead of cleaning up after a failure. It preserves full access to the data so nothing is lost. And it gives the agent the ability to decide what it needs and retrieve just that slice.

If you're hitting context window limits, seeing hallucination from information overload, or just paying for unnecessarily large API calls, drop ContextOffloader into your agent and see how the behavior changes. You might be surprised how much cleaner your agent's reasoning gets when it's not holding onto data it doesn't need right now.

Check out the full documentation and the Strands Agents GitHub to get started.

Top comments (8)

Ken W Alger • Jun 9

This is an exceptional architectural blueprint. For a long time, the dominant narrative in AI development has been "just buy a larger context window and let the model sort it out." This post cleanly exposes why that approach fails at scale on both a cost and reliability level.

What Strands is doing with ContextOffloader maps beautifully to the write-side exclusion patterns we’ve been tracking in the open-source Sovereign Systems Specification. Externalizing noisy tool dumps to an independent data tier (whether that's local disk or an S3 bucket) preserves the integrity of the active reasoning window while maintaining a verifiable audit trail.

The next fascinating frontier here is the data custody boundary. While the enterprise path defaults to elastic cloud storage like S3, the same exact reference-token pattern can be deployed local-first—binding the offloaded blocks to cryptographically signed, isolated vault segments on local hardware. Keeping the data lineage entirely local until an explicit egress gate clears it means you get all the benefits of this context engineering without a single byte of raw text leaking across a network boundary. Thanks for putting this together.

Morgan Willis AWS • Jun 9

Really appreciate this perspective!

What I like about the ContextOffloader approach is that it preserves the agent's ability to reason over prior tool results without forcing every token to remain resident in active context. The model gets references to information rather than the information itself, which is a much more scalable pattern.

Your point about data custody is interesting. The underlying pattern isn't really tied to S3 or any specific storage backend. The same reference-token architecture could absolutely point to local-first, encrypted, or cryptographically signed storage systems, giving organizations much tighter control over where data lives while still benefiting from context offloading and retrieval-on-demand.

Ken W Alger • Jun 9

Exactly right. Once you treat the context window as an active runtime cache rather than a data dump, the underlying storage medium becomes a pluggable infrastructure choice.

The shift to references changes the game. By moving the heavy data lifting out of the active token budget, we can enforce strict custody policies at the storage tier. Whether that means routing to an elastic cloud bucket or keeping it entirely on local silicon behind encrypted, signed boundaries.

It’s great to see Strands building the clean primitives that make this pattern accessible. Looking forward to seeing how the plugin ecosystem around it grows.

Max Quimby • Jun 9

The framing of proactive offloading vs. reactive trimming is the part I wish more people internalized — they solve different problems, and teams keep reaching for conversation summarization when the real bleed is a single 10k-token tool result that never needed to live past one turn.

One thing I've hit running agents over noisy tool calls: the quality of the truncated preview matters more than the threshold. If the preview doesn't carry enough signal for the agent to decide whether the full block is worth retrieving, it just pulls everything back anyway — and you've added a round-trip instead of saving tokens. We ended up generating structured previews (keys + shape + a few representative values) rather than naive head-of-string truncation.

Curious how ContextOffloader decides what goes in the preview — is it configurable per-tool, or a fixed strategy? A paginated API response and a giant file read want very different previews, and getting that wrong is what turns offloading into a net cost.

Morgan Willis AWS • Jun 9

The preview is just based on tokens, so like the first X number of tokens in the payload. You can configure that number though. I would imagine there would be some tuning that would need to be done to get it right for your specific use case. Also though, since this is open source if there was some deep customization you could always write your own plugin based off of this one that previews with your specific use case in mind.

Mehmet Can Farsak • Jun 13

Solid breakdown of context offloading — keeping noisy tool results out of the context window until needed is exactly right. It reminded me of a related problem: agents jumping straight to execution when they should be thinking. I put together a small hook-based plugin (Brainstorm-Mode by mehmetcanfarsak on GitHub) that uses PreToolUse hooks to prevent tool calls during ideation, keeping the agent's context window focused on brainstorming instead of code. Same philosophy — manage what goes in and out.

Kernel Cero • Jun 11

I just read your piece on Context Offloading with Strands, and focusing on the FileStorage + Linux CLI workflow really caught my attention.

Delegating text retrieval to native OS tools like grep, cat, or sed instead of bundling a rigid, custom SDK retrieval tool is a massive architectural win. It perfectly respects the UNIX philosophy: let the agent leverage hyper-optimized, low-level binaries for streaming and filtering data rather than bloating the runtime layer with custom parsing logic.

For local development and debugging, being able to tail or grep through the ./artifacts directory directly to see exactly what the agent is offloading is a massive quality-of-life upgrade.

The real production challenge will be balancing this elegant shell-based approach with container sandboxing and security boundaries when scaling out, but the core design pattern is brilliant.

Great breakdown!

Morgan Willis AWS • Jun 12

Exactly! And right, sandboxing is key because you don’t want the agent to have access to everything. You’re spot on