François Kiene

Posted on Jun 16 • Originally published at github.com

How I cut my Claude Code bill 67% with a local proxy

#claude #ai #showdev #programming

I opened my Claude Code bill, didn't like the number, and went looking for why.

Caching saves less than you think

Prompt caching only discounts the stable prefix you mark. New content each turn, the latest messages and fresh tool output, is full price. And that new-content surface is most of the bill.

The agent runs a command, reads a 400-line git log or a test dump, and that whole wall of text gets re-sent at full price on the next turn. The model's own replies cost too.

I tried a couple of prompt-shrinking tools first. The problem: anything that rewrites the cached prefix forfeits the cache discount, so you shrink the text and pay the uncached rate on what's left. You can come out behind.

So the rule I needed was narrow. Shrink the stuff that's already full price, never touch the cached prefix.

What I built

A small binary that runs between Claude Code and the API. Your request goes through it on the way out, the junk gets stripped, and the reply comes back untouched.

I run it on Claude Code, but it isn't Claude-specific. Anything that routes through HTTPS_PROXY gets the same treatment: Codex, Cursor, Aider, whatever you've got.

npm install -g @llmtrim/cli && llmtrim setup
# then open a NEW terminal for Claude Code
llmtrim status --watch

Here's my own dashboard right now, from real Claude Code use, not the benchmark:

The $198.05 is what actually came off the bill, after caching. I'd read the input number (-67%, cache excluded) as the honest one. Output savings are an estimate, the proxy can't A/B your live traffic.

Two things make it safe to leave running.

It never rewrites anything under a cache_control marker, so the cache discount survives. The cache benefit only shows up on repeated-prefix workloads, but on diverse one-shot traffic there's little to cache anyway.

It can't make your bill bigger. It re-measures every step before the request goes out and reverts anything that doesn't net out on cost. Provider rejects the compressed request? The original goes out verbatim. Worst case it does nothing. The tokenizer is exact on OpenAI. On Anthropic and Gemini there's no public exact tokenizer, so it's a BPE proxy and status tells you when.

There's no second model in the loop. It's deterministic text cleanup, and your prompts never leave the machine.

Tool output is where the money goes

Most of the waste is command output. The agent runs a build, gets 200 lines back, and 2 of them are the errors that matter. The other 198 are noise you're paying full freight to re-send.

A real example, a build log the bash tool returned.

Before, 58 lines:

[2026-06-13T10:02:00Z] INFO  compiling module core::worker::task_0 (incremental)
[2026-06-13T10:02:01Z] INFO  compiling module core::worker::task_1 (incremental)
... 28 more near-identical INFO lines ...
[2026-06-13T10:02:31Z] ERROR src/worker/pool.rs:214: mismatched types: expected `usize`, found `i64`
... 25 more INFO lines ...
[2026-06-13T10:03:02Z] INFO  build failed, 2 errors

After, 5 lines:

[{}] INFO compiling module core::worker::task_{} (incremental) [×30: 0..29]
[2026-06-13T10:02:31Z] ERROR src/worker/pool.rs:214: mismatched types: expected `usize`, found `i64`
[{}] INFO compiling module core::net::conn_{} (incremental) [×25: 0..24]
[2026-06-13T10:03:01Z] ERROR src/net/conn.rs:88: cannot borrow `buf` as mutable more than once
[2026-06-13T10:03:02Z] INFO  build failed, 2 errors

Both errors and the summary survive word for word. The repetitive INFO lines fold into a template plus their values, losslessly, because the range is regular. The model sees what happened, at a fifth the cost.

On that tool-output layer, the layer the closest tool (Headroom) targets, llmtrim removed about 84% of the input tokens against Headroom's 36%, same o200k_base tokenizer. Headroom only touches input, so this is the tool-output slice, not whole traffic.

Log-folding is one of ten compressors. Another re-encodes a JSON array into a compact table: same rows, a third of the tokens:

before:  [{"id":1,"city":"Paris","ok":true},{"id":2,"city":"Lyon","ok":false}, … 200 rows]
after:   [200]{id,city,ok}: 1,Paris,true; 2,Lyon,false; …          (lossless)

The numbers, and the honest caveats

Every case is sent twice, once original and once compressed, both answers scored and billed at real rates. Cost and quality measured together, not estimated. 112 paired A/B cases across 11 corpora (5 to 12 each), all in the repo.

Metric	Result
Input tokens	-31%
Output tokens	-74%
Round-trip cost (qwen3-next-80b)	-66%
Answer quality (aggregate)	78.9% -> 82.2%

Read the caveats before you quote that 66%:

The token cuts (-31% input, -74% output) are model-independent. The dollar figure tracks each model's output-to-input price ratio, so it's -66% on qwen3-next-80b (non-reasoning) and lands around -59% at Opus and Sonnet rates. Run it on your model.

Quality held in aggregate (+3.3pp), but per workload it ranges from -8pp on grade-school math to +21pp on multi-hop RAG, and several per-corpus deltas sit inside their confidence interval. One lossy code stage measured -21.6pp and got dropped from the default. So the aggregate is the headline; treat the per-corpus cells as directional.

Reproduce it from the repo:

python3 crates/llmtrim-cli/bench/scripts/download.py 40
cargo run -q --features live -- bench suite   # needs OPENROUTER_API_KEY

Without the proxy

If you don't want to route traffic through a proxy, the same engine runs as an MCP server, a CLI, an embeddable Rust crate, or bindings for Python, Ruby, Swift, and Kotlin.

import llmtrim
out = llmtrim.compress(request_json, llmtrim.Provider.OPEN_AI, "aggressive")
print(out.input_tokens_before, "->", out.input_tokens_after)

It's early, and I want your numbers

This is rough in places and it won't help every workload. Chat with short prompts has nothing to trim.

The run I most want from you is the opposite of a win: a workload where llmtrim saves close to nothing. Those are the ones that turn up bugs. Point it at a session and tell me what you see.

Repo, AGPL-3.0: https://github.com/fkiene/llmtrim

Top comments (1)

François Kiene • Jun 16

Author here. A couple of footnotes on the numbers.

The 67% in the title is my own Claude Code use: input down 67% (cache excluded), about $197 off the bill so far, the screenshot above. That's one workload. So the repo has a real benchmark too: every case runs twice, original then compressed, both answers scored and billed at real rates. 112 paired runs. The harness is in the repo (crates/llmtrim-cli/bench). On that set round-trip cost drops 66% on qwen3-next-80b. The dollar saving tracks each model's output:input price ratio, so it lands around -59% at Opus and Sonnet rates.

What I actually want back is the opposite: a workload where it saves close to nothing. Short prompts have nothing to trim, and one lossy code stage measured -21.6pp and got cut from the default. Point it at a session and tell me where it does nothing, or worse. That's where the bugs are.