<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Azard Tennant-Hosein</title>
    <description>The latest articles on DEV Community by Azard Tennant-Hosein (@azard_tennant-hosein).</description>
    <link>https://dev.to/azard_tennant-hosein</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3985716%2F4c4f327c-7e2c-4705-9aaf-8177f8fddecd.png</url>
      <title>DEV Community: Azard Tennant-Hosein</title>
      <link>https://dev.to/azard_tennant-hosein</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/azard_tennant-hosein"/>
    <language>en</language>
    <item>
      <title>The two causes of your token bill</title>
      <dc:creator>Azard Tennant-Hosein</dc:creator>
      <pubDate>Wed, 17 Jun 2026 09:49:27 +0000</pubDate>
      <link>https://dev.to/azard_tennant-hosein/the-two-causes-of-your-token-bill-402e</link>
      <guid>https://dev.to/azard_tennant-hosein/the-two-causes-of-your-token-bill-402e</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://llmsieve.dev/blog/2026/06/15/the-two-causes-of-your-token-bill/" rel="noopener noreferrer"&gt;the Sieve blog&lt;/a&gt;. Sieve is an open-source (Apache 2.0) context-reduction proxy — I work on it, and I've tried to keep this post about the problem rather than the tool.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you run an LLM agent for real work, the bill is the part nobody warned you about. It starts small, it grows with use, and the worst of it is invisible — most of what you pay for on any given turn is text the model has already seen, or text you never meant to send.&lt;/p&gt;

&lt;p&gt;There's a temptation to treat this as one problem with one fix. It isn't. An agent's token bill has two distinct causes, and they need two genuinely different kinds of tool. This post is about telling them apart — because once you can, the question stops being "which tool wins" and becomes "which of my two problems am I looking at right now."&lt;/p&gt;

&lt;h2&gt;
  
  
  The bill is mostly things you didn't choose
&lt;/h2&gt;

&lt;p&gt;Start with where the tokens actually go, because it's rarely where people assume.&lt;/p&gt;

&lt;p&gt;When your agent calls a tool, the model doesn't just pay for your request — it pays for the machinery of asking. Anthropic's own &lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;pricing documentation&lt;/a&gt; spells this out: the &lt;code&gt;tools&lt;/code&gt; parameter alone adds hundreds of tokens of schema to every request, the bash tool adds a fixed overhead, and a single web fetch pulls the fetched page straight into your context — "Average web page (10 kB): ~2,500 tokens... Research paper PDF (500 kB): ~125,000 tokens". A tool result you glance at once and never need again can cost more than the entire conversation around it.&lt;/p&gt;

&lt;p&gt;Now add the part that repeats. On every turn, a typical agent re-sends its system prompt, its full tool catalogue, its persona, and the conversation so far. The variable part of the request — what you actually typed — is often the smallest thing in the payload. The fixed overhead, multiplied across every turn of a long session, &lt;em&gt;is&lt;/em&gt; the bill.&lt;/p&gt;

&lt;p&gt;So the cost has two shapes, and they're not the same shape:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verbose machine output&lt;/strong&gt; — JSON tool results, logs, search dumps, fetched pages, code listings. Big, one-off, and mostly structural noise around a small signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repeated standing context&lt;/strong&gt; — the system prompt, tool schemas, persona, and history that ride along on every single turn, plus the absence of any memory that would let the agent &lt;em&gt;not&lt;/em&gt; re-send it all.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These call for different interventions, and conflating them is why "just reduce my tokens" never quite works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two different jobs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Compressing verbose output&lt;/strong&gt; is a content problem. You have a 10,000-token JSON blob; you want the model to get its meaning at a fraction of the size without losing the parts that matter. This is hard in an interesting way — it's about understanding the &lt;em&gt;shape&lt;/em&gt; of the content (a deeply nested object, an AST, a log stream) and squeezing it losslessly enough that the answer doesn't change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reducing repeated context&lt;/strong&gt; is a traffic problem. The model has already seen your tool schemas and your standing instructions; the fix is to stop re-sending what it's seen, and to remember durable facts so they can be supplied on demand instead of permanently parked in the prompt. This isn't about any single payload's shape — it's about what crosses the wire, turn after turn, and what gets remembered between turns.&lt;/p&gt;

&lt;p&gt;You can have either problem without the other. An agent that does a lot of web research and tool-calling has a &lt;em&gt;verbose-output&lt;/em&gt; problem even in a short session. A long-running personal assistant that mostly chats has a &lt;em&gt;repeated-context&lt;/em&gt; problem even though no individual message is large. Most real agents have both, in different proportions — which is exactly why one tool rarely covers the whole bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two tools, two halves
&lt;/h2&gt;

&lt;p&gt;This is where it's worth being concrete, and fair to the projects doing this work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/chopratejas/headroom" rel="noopener noreferrer"&gt;&lt;strong&gt;Headroom&lt;/strong&gt;&lt;/a&gt; is, in its own words, "the context compression layer for AI agents" — it targets the first problem. Its job is taking verbose content and making it smaller while accuracy is preserved on standard benchmarks: JSON, code, logs, the bulky machine output that coding agents generate constantly. It's Apache 2.0, runs locally, and offers library, proxy, agent-wrap, and MCP modes. If your bill is dominated by tool outputs and search results, that's the shape of problem it's built for.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/llmsieve/llm-sieve" rel="noopener noreferrer"&gt;&lt;strong&gt;Sieve&lt;/strong&gt;&lt;/a&gt; — the project I work on — targets the second. It's a proxy that strips the context the model has already seen from every outbound turn, and backs that with an encrypted local store of durable facts it can inject only when a turn needs them, rather than keeping everything in the prompt forever. It also refuses to invent answers about things it was never told. If your bill is dominated by the same standing apparatus re-sent on every turn, and by an agent that forgets you between sessions, that's its half.&lt;/p&gt;

&lt;p&gt;Notice these are &lt;em&gt;different halves&lt;/em&gt;. One makes a big payload smaller; the other stops a payload from being re-sent and gives the agent a memory so it doesn't have to be. They're not competing for the same job — they're addressing the two causes named above. In principle they compose: compression handling the verbose one-off content, a reduction-and-memory layer handling the repeated standing content.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is worth to you
&lt;/h2&gt;

&lt;p&gt;Set the percentages aside for a moment — every tool in this space quotes a big reduction number, and the numbers depend entirely on your workload. The value to you as a user is more concrete than any headline figure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sessions that don't fall over.&lt;/strong&gt; The most common real complaint isn't the monthly invoice — it's hitting a limit or a context wall in the middle of work. Spending fewer tokens per turn is, before anything else, &lt;em&gt;more room to keep going&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A bill you can reason about.&lt;/strong&gt; Both kinds of tool are observable: you can see what was sent before and after. A cost you can inspect is a cost you can manage, instead of a number that arrives at month-end.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Less re-explaining yourself.&lt;/strong&gt; For the repeated-context half specifically, the payoff isn't only tokens — it's an agent that remembers your preferences and your project across sessions, so you stop re-establishing the same ground every time you open it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy you don't have to trade for savings.&lt;/strong&gt; Both Headroom and Sieve run locally; Sieve additionally keeps its memory store encrypted on your own disk with no telemetry. Cutting your token bill shouldn't mean shipping your context to one more third party.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Honest limits
&lt;/h2&gt;

&lt;p&gt;A few things I won't claim, because they aren't mine to claim yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I haven't run the two together.&lt;/strong&gt; The "they compose" argument above is architectural — it follows from what each tool does, not from a tested pipeline I've measured. Treat it as a sound hypothesis, not a benchmarked result. If you stack them, I'd genuinely like to hear how it goes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reduction has a warm-up.&lt;/strong&gt; A memory-and-reduction layer with an empty store can't save you much on day one; the savings arrive as it learns. Compression, by contrast, helps on the very first verbose payload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The numbers are yours, not mine.&lt;/strong&gt; Whatever either project's headline percentage, the only figure that means anything is the one you measure on your own workload. Both projects expose per-request stats precisely so you can check rather than trust. Use them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The next time the bill jumps, the useful first question isn't "what cuts tokens" — it's "which of my two problems is this." If it's verbose tool output drowning a small signal, you want compression. If it's the same standing context re-sent every turn and an agent with no memory, you want reduction. Most agents need both, and the good news is the tooling for each now exists, is open source, and runs on your own machine. Knowing which half you're looking at is most of the battle.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Sieve is open source under Apache 2.0: &lt;a href="https://github.com/llmsieve/llm-sieve" rel="noopener noreferrer"&gt;github.com/llmsieve/llm-sieve&lt;/a&gt;. If I've misrepresented Headroom, &lt;a href="https://github.com/llmsieve/llm-sieve/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt; and I'll correct it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
