Your tool-using agent has dozens of tools, a long conversation history, and a growing pile of tool outputs.
So what happens?
Every LLM call gets the same treatment: shove everything into the prompt and hope the model can sort it out.
That usually leads to three problems:
- higher cost
- higher latency
- worse decisions, because the useful context is buried in noise
The issue is not just context-window size.
It is that different parts of agent execution need different context.
The real problem is not capacity. It is curation.
A common pattern in tool-using agents is to build one giant prompt that includes:
- the full conversation history
- the full tool catalog
- recent tool calls
- raw tool outputs
- extra memory just in case
That feels safe, but it is often wasteful.
Most of that context is irrelevant to the step the model is currently performing. And when irrelevant context accumulates, you pay for it twice: once in tokens, and again in model confusion.
Even if a model can technically accept a very large prompt, that does not mean every step should receive one.
Tool-using agents have four phases, and each phase needs different context
In practice, a tool-using agent usually moves through four distinct phases:
| Phase | What it needs | What it usually does not need |
|---|---|---|
| Route | a compact view of available tools | every full schema |
| Call | the selected tool definition + recent relevant turns | unrelated tools and old history |
| Interpret | the tool result + the call that produced it | the full conversation |
| Answer | relevant turns + dependency chain | every raw tool payload |
That difference matters.
A routing step does not need the same information as a final answer step. A result-interpretation step does not benefit from seeing the whole tool catalog again.
Yet many agents still feed roughly the same prompt blob into every stage.
That is the inefficiency.
The idea: compile context per phase, under a budget
I built contextweaver around that idea.
It is a Python library that treats context assembly as a compilation problem:
given a specific phase, a specific query, and a fixed budget, build the smallest context pack that still preserves the information the model actually needs.
Instead of concatenating everything, it:
- selects candidate items
- preserves dependencies between related items
- filters or compresses oversized payloads
- deduplicates overlapping context
- packs the final result into a hard budget
In other words, it tries to answer:
What is the minimum useful context for this exact step?
A concrete example
Suppose a user asks:
“List all active users.”
A naive system might include:
- the entire recent conversation
- all tool schemas
- the full SQL result
- raw metadata from previous tool calls
A phase-specific system can do better.
For the call phase, it may only need:
- the selected database tool schema
- the current request
- a small amount of recent context
For the answer phase, it may only need:
- the relevant user turn
- the tool call that was executed
- the summarized result
- the dependency chain connecting them
That is a much smaller problem than “show the model everything.”
How contextweaver approaches it
The core pipeline looks like this:
Events
→ generate_candidates
→ dependency_closure
→ sensitivity_filter
→ context_firewall
→ score
→ deduplicate
→ select_and_pack
→ render
`
Three parts matter especially in practice.
1. Dependency closure
If a tool_result is selected, its parent tool_call is automatically included.
That prevents a common failure mode: the model sees an output, but not the action or question that produced it.
2. Context firewall
Large tool outputs can be stored out of band and replaced with compact summaries or references.
That way, a single oversized payload does not consume most of the budget.
3. Budget-aware packing
The final context pack is assembled under a per-phase budget.
The budget is enforced by the builder rather than treated as a soft suggestion.
Before and after
The repository includes a simple before/after example:
`bash
$ python examples/before_after.py
WITHOUT contextweaver
Raw prompt tokens: 417
Budget enforcement: none
Large output handling: included verbatim
WITH contextweaver
Final prompt tokens: 126
Budget enforcement: 1500 tokens
Token reduction: 70%
Budget compliance: Yes
`
That example is intentionally small, but it shows the mechanism clearly:
- less irrelevant context
- preserved dependencies
- explicit budget control
The important point is not the exact percentage.
The important point is that prompt size becomes a controlled output of the system, not an accidental byproduct of whatever happened earlier in the agent loop.
Minimal usage
Install:
bash
pip install contextweaver
Then:
`python
from contextweaver.context.manager import ContextManager
from contextweaver.config import ContextBudget
from contextweaver.types import ContextItem, ItemKind, Phase
mgr = ContextManager(budget=ContextBudget(answer=1500))
mgr.ingest(ContextItem(
id="u1",
kind=ItemKind.user_turn,
text="List all active users",
))
mgr.ingest(ContextItem(
id="tc1",
kind=ItemKind.tool_call,
text="db_query('SELECT * FROM users WHERE active = true')",
parent_id="u1",
))
mgr.ingest_tool_result(
tool_call_id="tc1",
raw_output=large_json,
tool_name="db_query",
firewall_threshold=200,
)
pack = mgr.build_sync(phase=Phase.answer, query="active users")
print(pack.prompt)
print(pack.stats)
`
How this differs from simpler approaches
There are already several ways people try to control prompt size.
Bigger context windows
A bigger window gives you more room, but it does not decide what is actually relevant for a specific step.
Manual truncation
This is simple, but it is easy to remove information that another item depends on. For example, keeping a tool result while dropping the tool call that produced it.
Conversation-only memory
Conversation buffers help with turn history, but tool-using agents also have schemas, tool calls, tool results, artifacts, and structured dependencies between them.
RAG
RAG is useful for retrieving external knowledge, but it does not directly solve the problem of assembling the right internal tool context for a particular agent phase.
That is why I think of this as a context compiler, not a memory system and not a retrieval layer.
Design choices
A few implementation choices were deliberate:
- zero runtime dependencies — stdlib-only, Python 3.10+
-
protocol-based stores — storage backends are swappable via
typing.Protocol - deterministic output — same input produces the same result
-
debuggable builds —
BuildStatsexplains what was kept, dropped, or deduplicated - protocol adapters — support for MCP and A2A-style integrations
The goal was to keep the core small, testable, and independent of any specific model provider or framework.
What this is not
contextweaver is not:
- a full agent framework
- a memory product
- a vector database
- a replacement for retrieval
- proof that one context policy is always best
It is a library for one narrower job:
assemble the right context for one agent phase, under a fixed budget.
Where I think this gets interesting
The more tools an agent has, and the more intermediate artifacts it produces, the more expensive naive prompting becomes.
That is where explicit context compilation starts to matter.
If you are building tool-using agents, I think the useful question is no longer:
“How much context can I fit?”
It is:
“What is the minimum context this step actually needs?”
That is the question contextweaver is trying to answer.
Try it
- GitHub: dgenio/contextweaver
-
PyPI:
pip install contextweaver - Docs: Quickstart guide
- Examples: Runnable examples
Feedback is very welcome, especially on:
- the phase split
- the pipeline design
- failure cases
- which framework integrations would be most useful
Top comments (0)