<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: François Kiene</title>
    <description>The latest articles on DEV Community by François Kiene (@fkiene).</description>
    <link>https://dev.to/fkiene</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3987586%2Fc8a48bde-26b4-4820-b975-8b33282dcfb5.png</url>
      <title>DEV Community: François Kiene</title>
      <link>https://dev.to/fkiene</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fkiene"/>
    <language>en</language>
    <item>
      <title>How I cut my Claude Code bill 67% with a local proxy</title>
      <dc:creator>François Kiene</dc:creator>
      <pubDate>Tue, 16 Jun 2026 16:29:44 +0000</pubDate>
      <link>https://dev.to/fkiene/how-i-cut-my-claude-code-bill-67-with-a-local-proxy-2cj9</link>
      <guid>https://dev.to/fkiene/how-i-cut-my-claude-code-bill-67-with-a-local-proxy-2cj9</guid>
      <description>&lt;p&gt;I opened my Claude Code bill, didn't like the number, and went looking for why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caching saves less than you think
&lt;/h2&gt;

&lt;p&gt;Prompt caching only discounts the stable prefix you mark. New content each turn, the latest messages and fresh tool output, is full price. And that new-content surface is most of the bill.&lt;/p&gt;

&lt;p&gt;The agent runs a command, reads a 400-line git log or a test dump, and that whole wall of text gets re-sent at full price on the next turn. The model's own replies cost too.&lt;/p&gt;

&lt;p&gt;I tried a couple of prompt-shrinking tools first. The problem: anything that rewrites the cached prefix forfeits the cache discount, so you shrink the text and pay the uncached rate on what's left. You can come out behind.&lt;/p&gt;

&lt;p&gt;So the rule I needed was narrow. Shrink the stuff that's already full price, never touch the cached prefix.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;A small binary that runs between Claude Code and the API. Your request goes through it on the way out, the junk gets stripped, and the reply comes back untouched.&lt;/p&gt;

&lt;p&gt;I run it on Claude Code, but it isn't Claude-specific. Anything that routes through &lt;code&gt;HTTPS_PROXY&lt;/code&gt; gets the same treatment: Codex, Cursor, Aider, whatever you've got.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @llmtrim/cli &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; llmtrim setup
&lt;span class="c"&gt;# then open a NEW terminal for Claude Code&lt;/span&gt;
llmtrim status &lt;span class="nt"&gt;--watch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's my own dashboard right now, from real Claude Code use, not the benchmark:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx01tsinbgmkcxlf1txxz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx01tsinbgmkcxlf1txxz.png" alt="llmtrim status --watch: $198.05 off the real bill, $50.36 saved today, 123.4M tokens trimmed across 19,621 requests; input down 67% (cache excluded), per-model breakdown for opus 4.8, fable 5, and sonnet 4.6" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The $198.05 is what actually came off the bill, after caching. I'd read the input number (-67%, cache excluded) as the honest one. Output savings are an estimate, the proxy can't A/B your live traffic.&lt;/p&gt;

&lt;p&gt;Two things make it safe to leave running.&lt;/p&gt;

&lt;p&gt;It never rewrites anything under a &lt;code&gt;cache_control&lt;/code&gt; marker, so the cache discount survives. The cache benefit only shows up on repeated-prefix workloads, but on diverse one-shot traffic there's little to cache anyway.&lt;/p&gt;

&lt;p&gt;It can't make your bill bigger. It re-measures every step before the request goes out and reverts anything that doesn't net out on cost. Provider rejects the compressed request? The original goes out verbatim. Worst case it does nothing. The tokenizer is exact on OpenAI. On Anthropic and Gemini there's no public exact tokenizer, so it's a BPE proxy and &lt;code&gt;status&lt;/code&gt; tells you when.&lt;/p&gt;

&lt;p&gt;There's no second model in the loop. It's deterministic text cleanup, and your prompts never leave the machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool output is where the money goes
&lt;/h2&gt;

&lt;p&gt;Most of the waste is command output. The agent runs a build, gets 200 lines back, and 2 of them are the errors that matter. The other 198 are noise you're paying full freight to re-send.&lt;/p&gt;

&lt;p&gt;A real example, a build log the &lt;code&gt;bash&lt;/code&gt; tool returned.&lt;/p&gt;

&lt;p&gt;Before, 58 lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[2026-06-13T10:02:00Z] INFO  compiling module core::worker::task_0 (incremental)
[2026-06-13T10:02:01Z] INFO  compiling module core::worker::task_1 (incremental)
... 28 more near-identical INFO lines ...
[2026-06-13T10:02:31Z] ERROR src/worker/pool.rs:214: mismatched types: expected `usize`, found `i64`
... 25 more INFO lines ...
[2026-06-13T10:03:02Z] INFO  build failed, 2 errors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After, 5 lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[{}] INFO compiling module core::worker::task_{} (incremental) [×30: 0..29]
[2026-06-13T10:02:31Z] ERROR src/worker/pool.rs:214: mismatched types: expected `usize`, found `i64`
[{}] INFO compiling module core::net::conn_{} (incremental) [×25: 0..24]
[2026-06-13T10:03:01Z] ERROR src/net/conn.rs:88: cannot borrow `buf` as mutable more than once
[2026-06-13T10:03:02Z] INFO  build failed, 2 errors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both errors and the summary survive word for word. The repetitive INFO lines fold into a template plus their values, losslessly, because the range is regular. The model sees what happened, at a fifth the cost.&lt;/p&gt;

&lt;p&gt;On that tool-output layer, the layer the closest tool (Headroom) targets, llmtrim removed about 84% of the input tokens against Headroom's 36%, same &lt;code&gt;o200k_base&lt;/code&gt; tokenizer. Headroom only touches input, so this is the tool-output slice, not whole traffic.&lt;/p&gt;

&lt;p&gt;Log-folding is one of ten compressors. Another re-encodes a JSON array into a compact table: same rows, a third of the tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;before:  [{"id":1,"city":"Paris","ok":true},{"id":2,"city":"Lyon","ok":false}, … 200 rows]
after:   [200]{id,city,ok}: 1,Paris,true; 2,Lyon,false; …          (lossless)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The numbers, and the honest caveats
&lt;/h2&gt;

&lt;p&gt;Every case is sent twice, once original and once compressed, both answers scored and billed at real rates. Cost and quality measured together, not estimated. 112 paired A/B cases across 11 corpora (5 to 12 each), all in the repo.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input tokens&lt;/td&gt;
&lt;td&gt;-31%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output tokens&lt;/td&gt;
&lt;td&gt;-74%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Round-trip cost (qwen3-next-80b)&lt;/td&gt;
&lt;td&gt;-66%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer quality (aggregate)&lt;/td&gt;
&lt;td&gt;78.9% -&amp;gt; 82.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Read the caveats before you quote that 66%:&lt;/p&gt;

&lt;p&gt;The token cuts (-31% input, -74% output) are model-independent. The dollar figure tracks each model's output-to-input price ratio, so it's -66% on qwen3-next-80b (non-reasoning) and lands around -59% at Opus and Sonnet rates. Run it on your model.&lt;/p&gt;

&lt;p&gt;Quality held in aggregate (+3.3pp), but per workload it ranges from -8pp on grade-school math to +21pp on multi-hop RAG, and several per-corpus deltas sit inside their confidence interval. One lossy code stage measured -21.6pp and got dropped from the default. So the aggregate is the headline; treat the per-corpus cells as directional.&lt;/p&gt;

&lt;p&gt;Reproduce it from the repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 crates/llmtrim-cli/bench/scripts/download.py 40
cargo run &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="nt"&gt;--features&lt;/span&gt; live &lt;span class="nt"&gt;--&lt;/span&gt; bench suite   &lt;span class="c"&gt;# needs OPENROUTER_API_KEY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Without the proxy
&lt;/h2&gt;

&lt;p&gt;If you don't want to route traffic through a proxy, the same engine runs as an MCP server, a CLI, an embeddable Rust crate, or bindings for Python, Ruby, Swift, and Kotlin.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;llmtrim&lt;/span&gt;
&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llmtrim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llmtrim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPEN_AI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aggressive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens_before&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens_after&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  It's early, and I want your numbers
&lt;/h2&gt;

&lt;p&gt;This is rough in places and it won't help every workload. Chat with short prompts has nothing to trim.&lt;/p&gt;

&lt;p&gt;The run I most want from you is the opposite of a win: a workload where llmtrim saves close to nothing. Those are the ones that turn up bugs. Point it at a session and tell me what you see.&lt;/p&gt;

&lt;p&gt;Repo, AGPL-3.0: &lt;a href="https://github.com/fkiene/llmtrim" rel="noopener noreferrer"&gt;https://github.com/fkiene/llmtrim&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>showdev</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
