Diogo Santos

Posted on Apr 14

Your AI agent does not need a bigger context window

#opensource #ai #llm #python

Your tool-using agent has dozens of tools, a long conversation history, and a growing pile of tool outputs.

So what happens?

Every LLM call gets the same treatment: shove everything into the prompt and hope the model can sort it out.

That usually leads to three problems:

higher cost
higher latency
worse decisions, because the useful context is buried in noise

The issue is not just context-window size.

It is that different parts of agent execution need different context.

The real problem is not capacity. It is curation.

A common pattern in tool-using agents is to build one giant prompt that includes:

the full conversation history
the full tool catalog
recent tool calls
raw tool outputs
extra memory just in case

That feels safe, but it is often wasteful.

Most of that context is irrelevant to the step the model is currently performing. And when irrelevant context accumulates, you pay for it twice: once in tokens, and again in model confusion.

Even if a model can technically accept a very large prompt, that does not mean every step should receive one.

Tool-using agents have four phases, and each phase needs different context

In practice, a tool-using agent usually moves through four distinct phases:

Phase	What it needs	What it usually does not need
Route	a compact view of available tools	every full schema
Call	the selected tool definition + recent relevant turns	unrelated tools and old history
Interpret	the tool result + the call that produced it	the full conversation
Answer	relevant turns + dependency chain	every raw tool payload

That difference matters.

A routing step does not need the same information as a final answer step. A result-interpretation step does not benefit from seeing the whole tool catalog again.

Yet many agents still feed roughly the same prompt blob into every stage.

That is the inefficiency.

The idea: compile context per phase, under a budget

I built contextweaver around that idea.

It is a Python library that treats context assembly as a compilation problem:

given a specific phase, a specific query, and a fixed budget, build the smallest context pack that still preserves the information the model actually needs.

Instead of concatenating everything, it:

selects candidate items
preserves dependencies between related items
filters or compresses oversized payloads
deduplicates overlapping context
packs the final result into a hard budget

In other words, it tries to answer:

What is the minimum useful context for this exact step?

A concrete example

Suppose a user asks:

“List all active users.”

A naive system might include:

the entire recent conversation
all tool schemas
the full SQL result
raw metadata from previous tool calls

A phase-specific system can do better.

For the call phase, it may only need:

the selected database tool schema
the current request
a small amount of recent context

For the answer phase, it may only need:

the relevant user turn
the tool call that was executed
the summarized result
the dependency chain connecting them

That is a much smaller problem than “show the model everything.”

How contextweaver approaches it

The core pipeline looks like this:

Events
→ generate_candidates
→ dependency_closure
→ sensitivity_filter
→ context_firewall
→ score
→ deduplicate
→ select_and_pack
→ render

Three parts matter especially in practice.

1. Dependency closure

If a tool_result is selected, its parent tool_call is automatically included.

That prevents a common failure mode: the model sees an output, but not the action or question that produced it.

2. Context firewall

Large tool outputs can be stored out of band and replaced with compact summaries or references.

That way, a single oversized payload does not consume most of the budget.

3. Budget-aware packing

The final context pack is assembled under a per-phase budget.

The budget is enforced by the builder rather than treated as a soft suggestion.

Before and after

The repository includes a simple before/after example:

`bash
$ python examples/before_after.py

WITHOUT contextweaver
Raw prompt tokens: 417
Budget enforcement: none
Large output handling: included verbatim

WITH contextweaver
Final prompt tokens: 126
Budget enforcement: 1500 tokens
Token reduction: 70%
Budget compliance: Yes
`

That example is intentionally small, but it shows the mechanism clearly:

less irrelevant context
preserved dependencies
explicit budget control

The important point is not the exact percentage.

The important point is that prompt size becomes a controlled output of the system, not an accidental byproduct of whatever happened earlier in the agent loop.

Minimal usage

Install:

bash pip install contextweaver

Then:

`python
from contextweaver.context.manager import ContextManager
from contextweaver.config import ContextBudget
from contextweaver.types import ContextItem, ItemKind, Phase

mgr = ContextManager(budget=ContextBudget(answer=1500))

mgr.ingest(ContextItem(
id="u1",
kind=ItemKind.user_turn,
text="List all active users",
))

mgr.ingest(ContextItem(
id="tc1",
kind=ItemKind.tool_call,
text="db_query('SELECT * FROM users WHERE active = true')",
parent_id="u1",
))

mgr.ingest_tool_result(
tool_call_id="tc1",
raw_output=large_json,
tool_name="db_query",
firewall_threshold=200,
)

pack = mgr.build_sync(phase=Phase.answer, query="active users")

print(pack.prompt)
print(pack.stats)
`

How this differs from simpler approaches

There are already several ways people try to control prompt size.

Bigger context windows

A bigger window gives you more room, but it does not decide what is actually relevant for a specific step.

Manual truncation

This is simple, but it is easy to remove information that another item depends on. For example, keeping a tool result while dropping the tool call that produced it.

Conversation-only memory

Conversation buffers help with turn history, but tool-using agents also have schemas, tool calls, tool results, artifacts, and structured dependencies between them.

RAG

RAG is useful for retrieving external knowledge, but it does not directly solve the problem of assembling the right internal tool context for a particular agent phase.

That is why I think of this as a context compiler, not a memory system and not a retrieval layer.

Design choices

A few implementation choices were deliberate:

zero runtime dependencies — stdlib-only, Python 3.10+
protocol-based stores — storage backends are swappable via typing.Protocol
deterministic output — same input produces the same result
debuggable builds — BuildStats explains what was kept, dropped, or deduplicated
protocol adapters — support for MCP and A2A-style integrations

The goal was to keep the core small, testable, and independent of any specific model provider or framework.

What this is not

contextweaver is not:

a full agent framework
a memory product
a vector database
a replacement for retrieval
proof that one context policy is always best

It is a library for one narrower job:

assemble the right context for one agent phase, under a fixed budget.

Where I think this gets interesting

The more tools an agent has, and the more intermediate artifacts it produces, the more expensive naive prompting becomes.

That is where explicit context compilation starts to matter.

If you are building tool-using agents, I think the useful question is no longer:

“How much context can I fit?”

It is:

“What is the minimum context this step actually needs?”

That is the question contextweaver is trying to answer.

Try it

GitHub: dgenio/contextweaver
PyPI: pip install contextweaver
Docs: Quickstart guide
Examples: Runnable examples

Feedback is very welcome, especially on:

the phase split
the pipeline design
failure cases
which framework integrations would be most useful

DEV Community