DEV Community

Cover image for Your AI agent does not need a bigger context window
Diogo Santos
Diogo Santos

Posted on

Your AI agent does not need a bigger context window

Your tool-using agent has dozens of tools, a long conversation history, and a growing pile of tool outputs.

So what happens?

Every LLM call gets the same treatment: shove everything into the prompt and hope the model can sort it out.

That usually leads to three problems:

  • higher cost
  • higher latency
  • worse decisions, because the useful context is buried in noise

The issue is not just context-window size.

It is that different parts of agent execution need different context.

The real problem is not capacity. It is curation.

A common pattern in tool-using agents is to build one giant prompt that includes:

  • the full conversation history
  • the full tool catalog
  • recent tool calls
  • raw tool outputs
  • extra memory just in case

That feels safe, but it is often wasteful.

Most of that context is irrelevant to the step the model is currently performing. And when irrelevant context accumulates, you pay for it twice: once in tokens, and again in model confusion.

Even if a model can technically accept a very large prompt, that does not mean every step should receive one.

Tool-using agents have four phases, and each phase needs different context

In practice, a tool-using agent usually moves through four distinct phases:

Phase What it needs What it usually does not need
Route a compact view of available tools every full schema
Call the selected tool definition + recent relevant turns unrelated tools and old history
Interpret the tool result + the call that produced it the full conversation
Answer relevant turns + dependency chain every raw tool payload

That difference matters.

A routing step does not need the same information as a final answer step. A result-interpretation step does not benefit from seeing the whole tool catalog again.

Yet many agents still feed roughly the same prompt blob into every stage.

That is the inefficiency.

The idea: compile context per phase, under a budget

I built contextweaver around that idea.

It is a Python library that treats context assembly as a compilation problem:

given a specific phase, a specific query, and a fixed budget, build the smallest context pack that still preserves the information the model actually needs.

Instead of concatenating everything, it:

  • selects candidate items
  • preserves dependencies between related items
  • filters or compresses oversized payloads
  • deduplicates overlapping context
  • packs the final result into a hard budget

In other words, it tries to answer:

What is the minimum useful context for this exact step?

A concrete example

Suppose a user asks:

“List all active users.”

A naive system might include:

  • the entire recent conversation
  • all tool schemas
  • the full SQL result
  • raw metadata from previous tool calls

A phase-specific system can do better.

For the call phase, it may only need:

  • the selected database tool schema
  • the current request
  • a small amount of recent context

For the answer phase, it may only need:

  • the relevant user turn
  • the tool call that was executed
  • the summarized result
  • the dependency chain connecting them

That is a much smaller problem than “show the model everything.”

How contextweaver approaches it

The core pipeline looks like this:

Events
→ generate_candidates
→ dependency_closure
→ sensitivity_filter
→ context_firewall
→ score
→ deduplicate
→ select_and_pack
→ render
Enter fullscreen mode Exit fullscreen mode


`

Three parts matter especially in practice.

1. Dependency closure

If a tool_result is selected, its parent tool_call is automatically included.

That prevents a common failure mode: the model sees an output, but not the action or question that produced it.

2. Context firewall

Large tool outputs can be stored out of band and replaced with compact summaries or references.

That way, a single oversized payload does not consume most of the budget.

3. Budget-aware packing

The final context pack is assembled under a per-phase budget.

The budget is enforced by the builder rather than treated as a soft suggestion.

Before and after

The repository includes a simple before/after example:

`bash
$ python examples/before_after.py

WITHOUT contextweaver
Raw prompt tokens: 417
Budget enforcement: none
Large output handling: included verbatim

WITH contextweaver
Final prompt tokens: 126
Budget enforcement: 1500 tokens
Token reduction: 70%
Budget compliance: Yes
`

That example is intentionally small, but it shows the mechanism clearly:

  • less irrelevant context
  • preserved dependencies
  • explicit budget control

The important point is not the exact percentage.

The important point is that prompt size becomes a controlled output of the system, not an accidental byproduct of whatever happened earlier in the agent loop.

Minimal usage

Install:

bash
pip install contextweaver

Then:

`python
from contextweaver.context.manager import ContextManager
from contextweaver.config import ContextBudget
from contextweaver.types import ContextItem, ItemKind, Phase

mgr = ContextManager(budget=ContextBudget(answer=1500))

mgr.ingest(ContextItem(
id="u1",
kind=ItemKind.user_turn,
text="List all active users",
))

mgr.ingest(ContextItem(
id="tc1",
kind=ItemKind.tool_call,
text="db_query('SELECT * FROM users WHERE active = true')",
parent_id="u1",
))

mgr.ingest_tool_result(
tool_call_id="tc1",
raw_output=large_json,
tool_name="db_query",
firewall_threshold=200,
)

pack = mgr.build_sync(phase=Phase.answer, query="active users")

print(pack.prompt)
print(pack.stats)
`

How this differs from simpler approaches

There are already several ways people try to control prompt size.

Bigger context windows

A bigger window gives you more room, but it does not decide what is actually relevant for a specific step.

Manual truncation

This is simple, but it is easy to remove information that another item depends on. For example, keeping a tool result while dropping the tool call that produced it.

Conversation-only memory

Conversation buffers help with turn history, but tool-using agents also have schemas, tool calls, tool results, artifacts, and structured dependencies between them.

RAG

RAG is useful for retrieving external knowledge, but it does not directly solve the problem of assembling the right internal tool context for a particular agent phase.

That is why I think of this as a context compiler, not a memory system and not a retrieval layer.

Design choices

A few implementation choices were deliberate:

  • zero runtime dependencies — stdlib-only, Python 3.10+
  • protocol-based stores — storage backends are swappable via typing.Protocol
  • deterministic output — same input produces the same result
  • debuggable buildsBuildStats explains what was kept, dropped, or deduplicated
  • protocol adapters — support for MCP and A2A-style integrations

The goal was to keep the core small, testable, and independent of any specific model provider or framework.

What this is not

contextweaver is not:

  • a full agent framework
  • a memory product
  • a vector database
  • a replacement for retrieval
  • proof that one context policy is always best

It is a library for one narrower job:

assemble the right context for one agent phase, under a fixed budget.

Where I think this gets interesting

The more tools an agent has, and the more intermediate artifacts it produces, the more expensive naive prompting becomes.

That is where explicit context compilation starts to matter.

If you are building tool-using agents, I think the useful question is no longer:

“How much context can I fit?”

It is:

“What is the minimum context this step actually needs?”

That is the question contextweaver is trying to answer.

Try it

Feedback is very welcome, especially on:

  • the phase split
  • the pipeline design
  • failure cases
  • which framework integrations would be most useful

Top comments (0)