<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Diogo Santos</title>
    <description>The latest articles on DEV Community by Diogo Santos (@dgenio).</description>
    <link>https://dev.to/dgenio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3878745%2Fe98ad390-0d49-441b-aa31-3a0c4dc2869a.jpeg</url>
      <title>DEV Community: Diogo Santos</title>
      <link>https://dev.to/dgenio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dgenio"/>
    <language>en</language>
    <item>
      <title>Your AI agent does not need a bigger context window</title>
      <dc:creator>Diogo Santos</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:51:02 +0000</pubDate>
      <link>https://dev.to/dgenio/your-ai-agent-does-not-need-a-bigger-context-window-44ob</link>
      <guid>https://dev.to/dgenio/your-ai-agent-does-not-need-a-bigger-context-window-44ob</guid>
      <description>&lt;p&gt;Your tool-using agent has dozens of tools, a long conversation history, and a growing pile of tool outputs.&lt;/p&gt;

&lt;p&gt;So what happens?&lt;/p&gt;

&lt;p&gt;Every LLM call gets the same treatment: shove everything into the prompt and hope the model can sort it out.&lt;/p&gt;

&lt;p&gt;That usually leads to three problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;higher cost&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;higher latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;worse decisions&lt;/strong&gt;, because the useful context is buried in noise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The issue is not just context-window size.&lt;/p&gt;

&lt;p&gt;It is that &lt;strong&gt;different parts of agent execution need different context&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem is not capacity. It is curation.
&lt;/h2&gt;

&lt;p&gt;A common pattern in tool-using agents is to build one giant prompt that includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the full conversation history&lt;/li&gt;
&lt;li&gt;the full tool catalog&lt;/li&gt;
&lt;li&gt;recent tool calls&lt;/li&gt;
&lt;li&gt;raw tool outputs&lt;/li&gt;
&lt;li&gt;extra memory just in case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That feels safe, but it is often wasteful.&lt;/p&gt;

&lt;p&gt;Most of that context is irrelevant to the step the model is currently performing. And when irrelevant context accumulates, you pay for it twice: once in tokens, and again in model confusion.&lt;/p&gt;

&lt;p&gt;Even if a model can technically accept a very large prompt, that does not mean every step should receive one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool-using agents have four phases, and each phase needs different context
&lt;/h2&gt;

&lt;p&gt;In practice, a tool-using agent usually moves through four distinct phases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What it needs&lt;/th&gt;
&lt;th&gt;What it usually does &lt;strong&gt;not&lt;/strong&gt; need&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Route&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;a compact view of available tools&lt;/td&gt;
&lt;td&gt;every full schema&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Call&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;the selected tool definition + recent relevant turns&lt;/td&gt;
&lt;td&gt;unrelated tools and old history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interpret&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;the tool result + the call that produced it&lt;/td&gt;
&lt;td&gt;the full conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Answer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;relevant turns + dependency chain&lt;/td&gt;
&lt;td&gt;every raw tool payload&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That difference matters.&lt;/p&gt;

&lt;p&gt;A routing step does not need the same information as a final answer step. A result-interpretation step does not benefit from seeing the whole tool catalog again.&lt;/p&gt;

&lt;p&gt;Yet many agents still feed roughly the same prompt blob into every stage.&lt;/p&gt;

&lt;p&gt;That is the inefficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea: compile context per phase, under a budget
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://github.com/dgenio/contextweaver" rel="noopener noreferrer"&gt;contextweaver&lt;/a&gt; around that idea.&lt;/p&gt;

&lt;p&gt;It is a Python library that treats context assembly as a compilation problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;given a specific phase, a specific query, and a fixed budget, build the smallest context pack that still preserves the information the model actually needs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of concatenating everything, it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;selects candidate items&lt;/li&gt;
&lt;li&gt;preserves dependencies between related items&lt;/li&gt;
&lt;li&gt;filters or compresses oversized payloads&lt;/li&gt;
&lt;li&gt;deduplicates overlapping context&lt;/li&gt;
&lt;li&gt;packs the final result into a hard budget&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, it tries to answer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the minimum useful context for this exact step?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A concrete example
&lt;/h2&gt;

&lt;p&gt;Suppose a user asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“List all active users.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A naive system might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the entire recent conversation&lt;/li&gt;
&lt;li&gt;all tool schemas&lt;/li&gt;
&lt;li&gt;the full SQL result&lt;/li&gt;
&lt;li&gt;raw metadata from previous tool calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A phase-specific system can do better.&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;call&lt;/strong&gt; phase, it may only need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the selected database tool schema&lt;/li&gt;
&lt;li&gt;the current request&lt;/li&gt;
&lt;li&gt;a small amount of recent context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the &lt;strong&gt;answer&lt;/strong&gt; phase, it may only need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the relevant user turn&lt;/li&gt;
&lt;li&gt;the tool call that was executed&lt;/li&gt;
&lt;li&gt;the summarized result&lt;/li&gt;
&lt;li&gt;the dependency chain connecting them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a much smaller problem than “show the model everything.”&lt;/p&gt;

&lt;h2&gt;
  
  
  How contextweaver approaches it
&lt;/h2&gt;

&lt;p&gt;The core pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Events
→ generate_candidates
→ dependency_closure
→ sensitivity_filter
→ context_firewall
→ score
→ deduplicate
→ select_and_pack
→ render
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
`&lt;/p&gt;

&lt;p&gt;Three parts matter especially in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Dependency closure
&lt;/h3&gt;

&lt;p&gt;If a &lt;code&gt;tool_result&lt;/code&gt; is selected, its parent &lt;code&gt;tool_call&lt;/code&gt; is automatically included.&lt;/p&gt;

&lt;p&gt;That prevents a common failure mode: the model sees an output, but not the action or question that produced it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context firewall
&lt;/h3&gt;

&lt;p&gt;Large tool outputs can be stored out of band and replaced with compact summaries or references.&lt;/p&gt;

&lt;p&gt;That way, a single oversized payload does not consume most of the budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Budget-aware packing
&lt;/h3&gt;

&lt;p&gt;The final context pack is assembled under a per-phase budget.&lt;/p&gt;

&lt;p&gt;The budget is enforced by the builder rather than treated as a soft suggestion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before and after
&lt;/h2&gt;

&lt;p&gt;The repository includes a simple before/after example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`bash&lt;br&gt;
$ python examples/before_after.py&lt;/p&gt;

&lt;p&gt;WITHOUT contextweaver&lt;br&gt;
  Raw prompt tokens: 417&lt;br&gt;
  Budget enforcement: none&lt;br&gt;
  Large output handling: included verbatim&lt;/p&gt;

&lt;p&gt;WITH contextweaver&lt;br&gt;
  Final prompt tokens: 126&lt;br&gt;
  Budget enforcement: 1500 tokens&lt;br&gt;
  Token reduction: 70%&lt;br&gt;
  Budget compliance: Yes&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That example is intentionally small, but it shows the mechanism clearly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;less irrelevant context&lt;/li&gt;
&lt;li&gt;preserved dependencies&lt;/li&gt;
&lt;li&gt;explicit budget control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important point is not the exact percentage.&lt;/p&gt;

&lt;p&gt;The important point is that &lt;strong&gt;prompt size becomes a controlled output of the system&lt;/strong&gt;, not an accidental byproduct of whatever happened earlier in the agent loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Minimal usage
&lt;/h2&gt;

&lt;p&gt;Install:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;bash&lt;br&gt;
pip install contextweaver&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`python&lt;br&gt;
from contextweaver.context.manager import ContextManager&lt;br&gt;
from contextweaver.config import ContextBudget&lt;br&gt;
from contextweaver.types import ContextItem, ItemKind, Phase&lt;/p&gt;

&lt;p&gt;mgr = ContextManager(budget=ContextBudget(answer=1500))&lt;/p&gt;

&lt;p&gt;mgr.ingest(ContextItem(&lt;br&gt;
    id="u1",&lt;br&gt;
    kind=ItemKind.user_turn,&lt;br&gt;
    text="List all active users",&lt;br&gt;
))&lt;/p&gt;

&lt;p&gt;mgr.ingest(ContextItem(&lt;br&gt;
    id="tc1",&lt;br&gt;
    kind=ItemKind.tool_call,&lt;br&gt;
    text="db_query('SELECT * FROM users WHERE active = true')",&lt;br&gt;
    parent_id="u1",&lt;br&gt;
))&lt;/p&gt;

&lt;p&gt;mgr.ingest_tool_result(&lt;br&gt;
    tool_call_id="tc1",&lt;br&gt;
    raw_output=large_json,&lt;br&gt;
    tool_name="db_query",&lt;br&gt;
    firewall_threshold=200,&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;pack = mgr.build_sync(phase=Phase.answer, query="active users")&lt;/p&gt;

&lt;p&gt;print(pack.prompt)&lt;br&gt;
print(pack.stats)&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How this differs from simpler approaches
&lt;/h2&gt;

&lt;p&gt;There are already several ways people try to control prompt size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bigger context windows
&lt;/h3&gt;

&lt;p&gt;A bigger window gives you more room, but it does not decide what is actually relevant for a specific step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual truncation
&lt;/h3&gt;

&lt;p&gt;This is simple, but it is easy to remove information that another item depends on. For example, keeping a tool result while dropping the tool call that produced it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conversation-only memory
&lt;/h3&gt;

&lt;p&gt;Conversation buffers help with turn history, but tool-using agents also have schemas, tool calls, tool results, artifacts, and structured dependencies between them.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG
&lt;/h3&gt;

&lt;p&gt;RAG is useful for retrieving external knowledge, but it does not directly solve the problem of assembling the right internal tool context for a particular agent phase.&lt;/p&gt;

&lt;p&gt;That is why I think of this as a &lt;strong&gt;context compiler&lt;/strong&gt;, not a memory system and not a retrieval layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design choices
&lt;/h2&gt;

&lt;p&gt;A few implementation choices were deliberate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;zero runtime dependencies&lt;/strong&gt; — stdlib-only, Python 3.10+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;protocol-based stores&lt;/strong&gt; — storage backends are swappable via &lt;code&gt;typing.Protocol&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;deterministic output&lt;/strong&gt; — same input produces the same result&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;debuggable builds&lt;/strong&gt; — &lt;code&gt;BuildStats&lt;/code&gt; explains what was kept, dropped, or deduplicated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;protocol adapters&lt;/strong&gt; — support for MCP and A2A-style integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was to keep the core small, testable, and independent of any specific model provider or framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;contextweaver&lt;/code&gt; is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a full agent framework&lt;/li&gt;
&lt;li&gt;a memory product&lt;/li&gt;
&lt;li&gt;a vector database&lt;/li&gt;
&lt;li&gt;a replacement for retrieval&lt;/li&gt;
&lt;li&gt;proof that one context policy is always best&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is a library for one narrower job:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;assemble the right context for one agent phase, under a fixed budget.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I think this gets interesting
&lt;/h2&gt;

&lt;p&gt;The more tools an agent has, and the more intermediate artifacts it produces, the more expensive naive prompting becomes.&lt;/p&gt;

&lt;p&gt;That is where explicit context compilation starts to matter.&lt;/p&gt;

&lt;p&gt;If you are building tool-using agents, I think the useful question is no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How much context can I fit?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What is the minimum context this step actually needs?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the question &lt;code&gt;contextweaver&lt;/code&gt; is trying to answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/dgenio/contextweaver" rel="noopener noreferrer"&gt;dgenio/contextweaver&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: &lt;code&gt;pip install contextweaver&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs&lt;/strong&gt;: &lt;a href="https://github.com/dgenio/contextweaver/blob/main/docs/quickstart.md" rel="noopener noreferrer"&gt;Quickstart guide&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Examples&lt;/strong&gt;: &lt;a href="https://github.com/dgenio/contextweaver/tree/main/examples" rel="noopener noreferrer"&gt;Runnable examples&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feedback is very welcome, especially on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the phase split&lt;/li&gt;
&lt;li&gt;the pipeline design&lt;/li&gt;
&lt;li&gt;failure cases&lt;/li&gt;
&lt;li&gt;which framework integrations would be most useful&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
