AI coding agents are expensive — not because models cost too much per token, but because they send too many of them. An SRE debugging session with a raw agent: 65,694 tokens in. With Headroom in the middle: 5,118. Same bug found.
Headroom is a new open-source context compression layer that intercepts everything your agent reads — tool outputs, log dumps, RAG chunks, files, conversation history — and compresses it before the LLM ever sees it. It's local, reversible, and available as a drop-in proxy, a library, or an MCP server.
The numbers that matter
Savings on real agent workloads:
- Code search (100 results): 17,765 → 1,408 tokens (92% reduction)
- SRE incident debugging: 65,694 → 5,118 tokens (92%)
- GitHub issue triage: 54,174 → 14,761 tokens (73%)
- Codebase exploration: 78,502 → 41,254 tokens (47%)
Accuracy on standard benchmarks (GSM8K, TruthfulQA, SQuAD v2, BFCL) is preserved — some scores actually improve slightly, likely because the model sees cleaner signal.
What's doing the compression
Under the hood, Headroom routes content through a stack of specialised compressors:
- SmartCrusher — JSON, nested objects, arrays of dicts
- CodeCompressor — AST-aware for Python, JS, Go, Rust, Java, C++
- Kompress-base — a custom HuggingFace model trained on agentic traces, for prose and mixed content
- CacheAligner — stabilises prompt prefixes so Anthropic/OpenAI KV caches actually hit
It also does CCR (reversible compression) — originals are cached locally and the LLM can retrieve them on demand if it needs them. Nothing is destroyed.
Why the proxy mode matters
The most interesting deployment path: headroom proxy --port 8787, then point your existing tool at localhost. Zero code changes. Works with any language.
Or even simpler: headroom wrap claude wraps Claude Code, routes its traffic through Headroom automatically. One command, savings start immediately. Same for Codex, Cursor, Aider, Copilot CLI.
"Library — compress(messages) in Python or TypeScript, inline in any app. Proxy — headroom proxy --port 8787, zero code changes, any language."
There's also a cross-agent memory store — shared context across Claude, Codex, and Gemini sessions with auto-dedup — and a headroom learn feature that mines past failed sessions and writes corrections back to your CLAUDE.md / AGENTS.md.
What to do
-
Running Claude Code or Codex daily?
pip install "headroom-ai[all]"thenheadroom wrap claude. See the savings in five minutes. -
Using any OpenAI-compatible client?
headroom proxy --port 8787and point your client at localhost. No code changes needed. - On LangChain, Agno, or Vercel AI SDK? Native middleware integrations are available — no proxy required.
-
On Opus-class models? Also enable
HEADROOM_OUTPUT_SHAPER=1— it trims verbose model output too, and on 5× output pricing that adds up fast. - Not burning tokens on agent context yet? Bookmark it. You will be.
Source: github.com/chopratejas/headroom
✏️ Drafted with KewBot (AI), edited and approved by Drew.
Top comments (0)