How I cut my Claude Code bill 65% with a local proxy

François Kiene — Sun, 21 Jun 2026 16:36:12 +0000

I opened my Claude Code bill, didn't like the number, and went looking for why.

Caching saves less than you think

Prompt caching only discounts the stable prefix you mark. New content each turn, the latest messages and fresh tool output, is full price. And that new-content surface is most of the bill.

The agent runs a command, reads a 400-line git log or a test dump, and that whole wall of text gets re-sent at full price on the next turn. The model's own replies cost too.

I tried a couple of prompt-shrinking tools first. The problem: anything that rewrites the cached prefix forfeits the cache discount, so you shrink the text and pay the uncached rate on what's left. You can come out behind.

So the rule I needed was narrow. Shrink the stuff that's already full price, never touch the cached prefix.

What I built

A small binary that runs between Claude Code and the API. Your request goes through it on the way out, the junk gets stripped, and the reply comes back untouched.

I run it on Claude Code, but it isn't Claude-specific. Anything that routes through HTTPS_PROXY gets the same treatment: Codex, Cursor, Aider, whatever you've got.

npm install -g @llmtrim/cli && llmtrim setup
# then open a NEW terminal for Claude Code
llmtrim status --watch

Here's my own dashboard right now, from real Claude Code use, not the benchmark:

The $305.82 is what actually came off the bill, after caching. I'd read the input number (-65%, cache excluded) as the honest one. Output savings are an estimate, the proxy can't A/B your live traffic.

Two things make it safe to leave running.

It never rewrites anything under a cache_control marker, so the cache discount survives. That benefit only shows up on repeated-prefix workloads, but on diverse one-shot traffic there's little to cache anyway.

It can't make your bill bigger. It re-measures every step before the request goes out and reverts anything that doesn't net out on cost. Provider rejects the compressed request? The original goes out verbatim. Worst case it does nothing. The tokenizer is exact on OpenAI; on Anthropic and Gemini there's no public exact tokenizer, so it's a BPE proxy and status tells you when.

There's no second model in the loop. It's deterministic text cleanup, and your prompts never leave the machine.

Tool output is where the money goes

Most of the waste is command output. The agent runs a build, gets 200 lines back, and 2 of them are the errors that matter. The other 198 are noise you're paying full freight to re-send.

A real example, a build log the bash tool returned.

Before, 58 lines:

[2026-06-13T10:02:00Z] INFO  compiling module core::worker::task_0 (incremental)
[2026-06-13T10:02:01Z] INFO  compiling module core::worker::task_1 (incremental)
... 28 more near-identical INFO lines ...
[2026-06-13T10:02:31Z] ERROR src/worker/pool.rs:214: mismatched types: expected `usize`, found `i64`
... 25 more INFO lines ...
[2026-06-13T10:03:02Z] INFO  build failed, 2 errors

After, 5 lines:

[{}] INFO compiling module core::worker::task_{} (incremental) [×30: 0..29]
[2026-06-13T10:02:31Z] ERROR src/worker/pool.rs:214: mismatched types: expected `usize`, found `i64`
[{}] INFO compiling module core::net::conn_{} (incremental) [×25: 0..24]
[2026-06-13T10:03:01Z] ERROR src/net/conn.rs:88: cannot borrow `buf` as mutable more than once
[2026-06-13T10:03:02Z] INFO  build failed, 2 errors

Both errors and the summary survive word for word. The repetitive INFO lines fold into a template plus their values, losslessly, because the range is regular. The model sees what happened, at a fifth the cost.

I ran llmtrim and Headroom head-to-head at matched aggressiveness, same o200k_base tokenizer, harness in the repo. On raw input tokens Headroom usually strips more: on tool output it cut 74% to my 58% at the aggressive setting. But raw input is half the story. In the same live A/B (small sample, directional), Headroom's answer accuracy fell to 50% on the faithful cases where llmtrim held 100%, and its output ballooned to about 4,000 tokens against llmtrim's 900. Cutting more input doesn't help if the model then answers wrong and writes four times as much back. The bill is round-trip, and that's the half Headroom leaves on the table.

Log-folding is one of ten compressors. Another re-encodes a JSON array into a compact table: same rows, a third of the tokens:

before:  [{"id":1,"city":"Paris","ok":true},{"id":2,"city":"Lyon","ok":false}, … 200 rows]
after:   [200]{id,city,ok}: 1,Paris,true; 2,Lyon,false; …          (lossless)

The numbers, and the honest caveats

Every case is sent twice, once original and once compressed, both answers scored and billed at real rates. Cost and quality measured together, not estimated. 112 paired A/B cases across 11 corpora (5 to 12 each), all in the repo.

Metric	Result
Input tokens	-31%
Output tokens	-74%
Round-trip cost (qwen3-next-80b)	-66%
Answer quality (aggregate)	78.9% -> 82.2%

Read the caveats before you quote that 66%:

The token cuts (-31% input, -74% output) are model-independent. The dollar figure tracks each model's output-to-input price ratio, so it's -66% on qwen3-next-80b (non-reasoning) and lands around -59% at Opus and Sonnet rates. Run it on your model.

Quality held in aggregate (+3.3pp), but per workload it ranges from -8pp on grade-school math to +21pp on multi-hop RAG, and several per-corpus deltas sit inside their confidence interval. One lossy code stage measured -21.6pp and got dropped from the default. So the aggregate is the headline; treat the per-corpus cells as directional.

Reproduce it from the repo:

python3 crates/llmtrim-cli/bench/scripts/download.py 40
cargo run -q --features live -- bench suite   # needs OPENROUTER_API_KEY

Without the proxy

If you don't want to route traffic through a proxy, the same engine runs as an MCP server, a CLI, an embeddable Rust crate, or bindings for Python, Ruby, Swift, and Kotlin.

import llmtrim
out = llmtrim.compress(request_json, llmtrim.Provider.OPEN_AI, "aggressive")
print(out.input_tokens_before, "->", out.input_tokens_after)

It's early, and I want your numbers

This is rough in places and it won't help every workload. Chat with short prompts has nothing to trim.

The run I most want from you is the opposite of a win: a workload where llmtrim saves close to nothing. Those are the ones that turn up bugs. Point it at a session and tell me what you see.

Repo, MPL-2.0: https://github.com/fkiene/llmtrim