60–95% fewer tokens in your agent loops, same answers. Meet Headroom.

#ai #api #llm #agents

AI coding agents are expensive — not because models cost too much per token, but because they send too many of them. An SRE debugging session with a raw agent: 65,694 tokens in. With Headroom in the middle: 5,118. Same bug found.

Headroom is a new open-source context compression layer that intercepts everything your agent reads — tool outputs, log dumps, RAG chunks, files, conversation history — and compresses it before the LLM ever sees it. It's local, reversible, and available as a drop-in proxy, a library, or an MCP server.

The numbers that matter

Savings on real agent workloads:

Code search (100 results): 17,765 → 1,408 tokens (92% reduction)
SRE incident debugging: 65,694 → 5,118 tokens (92%)
GitHub issue triage: 54,174 → 14,761 tokens (73%)
Codebase exploration: 78,502 → 41,254 tokens (47%)

Accuracy on standard benchmarks (GSM8K, TruthfulQA, SQuAD v2, BFCL) is preserved — some scores actually improve slightly, likely because the model sees cleaner signal.

What's doing the compression

Under the hood, Headroom routes content through a stack of specialised compressors:

SmartCrusher — JSON, nested objects, arrays of dicts
CodeCompressor — AST-aware for Python, JS, Go, Rust, Java, C++
Kompress-base — a custom HuggingFace model trained on agentic traces, for prose and mixed content
CacheAligner — stabilises prompt prefixes so Anthropic/OpenAI KV caches actually hit

It also does CCR (reversible compression) — originals are cached locally and the LLM can retrieve them on demand if it needs them. Nothing is destroyed.

Why the proxy mode matters

The most interesting deployment path: headroom proxy --port 8787, then point your existing tool at localhost. Zero code changes. Works with any language.

Or even simpler: headroom wrap claude wraps Claude Code, routes its traffic through Headroom automatically. One command, savings start immediately. Same for Codex, Cursor, Aider, Copilot CLI.

"Library — compress(messages) in Python or TypeScript, inline in any app. Proxy — headroom proxy --port 8787, zero code changes, any language."

There's also a cross-agent memory store — shared context across Claude, Codex, and Gemini sessions with auto-dedup — and a headroom learn feature that mines past failed sessions and writes corrections back to your CLAUDE.md / AGENTS.md.

What to do

Running Claude Code or Codex daily? pip install "headroom-ai[all]" then headroom wrap claude. See the savings in five minutes.
Using any OpenAI-compatible client? headroom proxy --port 8787 and point your client at localhost. No code changes needed.
On LangChain, Agno, or Vercel AI SDK? Native middleware integrations are available — no proxy required.
On Opus-class models? Also enable HEADROOM_OUTPUT_SHAPER=1 — it trims verbose model output too, and on 5× output pricing that adds up fast.
Not burning tokens on agent context yet? Bookmark it. You will be.

Source: github.com/chopratejas/headroom

✏️ Drafted with KewBot (AI), edited and approved by Drew.

Top comments (1)

Luis Cruz • Jun 22

This hits a real bottleneck in agent systems: context, not model intelligence, is often the dominant cost driver.

If these compression ratios hold in real workloads, the impact is less about raw token savings and more about enabling longer agent loops without degradation — which is what actually breaks most SRE/debug workflows today.

The key question isn’t just “do we reduce tokens,” but “does compression preserve the failure signal without collapsing edge-case detail?” That’s where most approaches usually trade cost for lost debugging fidelity.