How I Cut My AI Coding Agent's Token Bill by 85% - Without Losing Answer Quality

#ai #llm #opensource #python

If you're using AI coding agents like Claude, Cursor, Codex, or Aider on a real codebase, you've probably noticed: your bill is mostly input tokens.

The agent dumps your entire repo context into every request. You pay for all of it. And the model still misses files it can't see.

I built Entroly to fix both problems - locally, on your machine.

The Two Wastes

AI coding agents waste context in two different ways:

They send too much - large repos mean 100K+ tokens per request
They send the wrong stuff - irrelevant files crowd out the ones that matter

Compression alone only fixes the first half. You can shrink 100K tokens to 30K, but if those 30K are the wrong files, the answer is still bad.

What Entroly Does

Entroly is a local verified-context layer that sits between your agent and the LLM provider. It selects the repo evidence first, compresses noisy context second, and can audit the model's answer against supplied evidence.

Key mechanisms:

Context selection - ranks your whole repo using BM25 + entropy scoring + dependency graph analysis, then packs the most answer-relevant files under a token budget using knapsack optimization
Reversible compression - everything compressed is fully recoverable via CCR handles. Nothing is lost.
Cache alignment - keeps the injected prefix byte-stable so provider cache discounts apply (Anthropic: up to 90% off, OpenAI: 50% off)
WITNESS hallucination guard - checks the model's answer against the evidence it was given. $0, ~3ms, no extra API call.

Getting Started (60 seconds)

pip install entroly
cd /your/repo
entroly go

Or test the package first:

entroly verify-claims

No API key required.

Use It However You Work

Wrap mode: entroly wrap cursor / entroly wrap aider / entroly wrap claude
Proxy mode: entroly proxy -> point your tool at localhost:9377
MCP server: entroly serve -> works with any MCP-compatible client
Library: from entroly import compress_context

Works with Claude, Cursor, Codex, Aider, Continue, Windsurf, Cline, and 30+ more tools.

Real Numbers

Benchmark	Result
Token reduction (large repos)	70-95% fewer input tokens
Accuracy (NeedleInAHaystack)	100% retained
Hallucination detection (HaluEval-QA)	0.844 AUROC
WITNESS latency	~3ms, $0

Small prompts and tiny repos may show little or no savings. Always measure on your own repository.

Why I Built This

I was spending $200+/month on AI coding tools, and most of that cost was the agent re-reading the same files over and over. Entroly combines query-conditioned context selection, recoverable compression, and WITNESS proof certificates in one local layer.

Apache-2.0, local-first, no outbound analytics by default.

GitHub: github.com/juyterman1000/entroly