If you're building AI agents or running LLM pipelines in production, you already know the pain: tool outputs, logs, RAG chunks, and conversation history pile up fast. Before you know it, you're burning through tokens at a rate that makes your billing dashboard uncomfortable to look at.
Headroom is an open-source project that tackles this problem directly. It compresses everything your AI agent reads — before it ever reaches the LLM — and claims 60–95% token reduction on real workloads, with accuracy preserved.
The Core Idea
Headroom sits as a layer between your application and the LLM provider. It takes whatever your agent was about to send — a stack of tool call results, a long log file, a RAG retrieval dump — and compresses it using one of several strategies depending on the content type:
- SmartCrusher handles JSON (arrays, nested objects, mixed types)
- CodeCompressor uses AST-aware compression for Python, JS, Go, Rust, Java, and C++
- Kompress-base is a HuggingFace model trained on agentic traces, for prose and text
- CacheAligner stabilizes prompt prefixes so provider KV caches actually hit consistently
- CCR (Content-Compressed Retrieval) stores originals locally and lets the LLM fetch them on demand — so compression is fully reversible
A ContentRouter figures out what kind of content it's looking at and picks the right compressor automatically. You don't have to think about it.
The key thing: originals are never deleted. If the LLM needs the full version of something, it can retrieve it. Compression is lossless in that sense.
Real Numbers
These are the token counts from the project's benchmarks on real agent workloads:
| Workload | Before | After | Savings |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| GitHub issue triage | 54,174 | 14,761 | 73% |
| Codebase exploration | 78,502 | 41,254 | 47% |
On accuracy benchmarks (GSM8K math, TruthfulQA, SQuAD v2, BFCL tool-use), scores hold steady or slightly improve after compression. The intuition is that stripping noise helps the model focus on the signal.
You can reproduce these yourself with:
python -m headroom.evals suite --tier 1
Setup: Three Ways to Use It
Headroom gives you three integration modes. Pick whichever fits how you work.
Option 1: Wrap an existing agent (zero code changes)
pip install "headroom-ai[all]"
headroom wrap claude
That's it. Headroom intercepts traffic from Claude Code, Codex, Cursor, Aider, or Copilot CLI automatically. You don't touch your existing code at all.
Option 2: Drop-in proxy
Run Headroom as a local proxy on any port:
headroom proxy --port 8787
Then point your existing OpenAI/Anthropic SDK calls at localhost:8787 instead of the provider URL. Any language, any framework — no code changes needed beyond updating the base URL.
Option 3: Inline library
For finer control, use it directly in Python or TypeScript:
Python:
from headroom import compress
messages = [{"role": "user", "content": your_giant_tool_output}]
compressed = compress(messages, model="claude-opus-4-6")
# compressed has the same structure, far fewer tokens
TypeScript:
import { compress } from "headroom-ai";
const compressed = await compress(messages, { model: "claude-opus-4-6" });
With the Anthropic SDK directly:
from anthropic import Anthropic
from headroom import withHeadroom
client = withHeadroom(Anthropic())
# Use client exactly like normal — compression happens automatically
With LangChain:
from headroom.integrations.langchain import HeadroomChatModel
llm = HeadroomChatModel(your_existing_llm)
With Vercel AI SDK:
import { wrapLanguageModel } from "ai";
import { headroomMiddleware } from "headroom-ai";
const model = wrapLanguageModel({
model: yourModel,
middleware: headroomMiddleware(),
});
Requires Python 3.10+. For Node/TypeScript: npm install headroom-ai.
MCP Server Mode
If you're using an MCP client (Claude Desktop, etc.), you can install Headroom as an MCP server:
headroom mcp install
This exposes three MCP tools: headroom_compress, headroom_retrieve, and headroom_stats. Your AI agent can call them directly as part of its tool loop.
Cross-Agent Memory
One underrated feature: shared memory across agents. If you're running Claude and Codex side by side, Headroom can give them a common compressed context store with automatic deduplication.
from headroom.memory import SharedContext
ctx = SharedContext()
ctx.put("current_task", task_description)
# In a different agent's session
task = ctx.get("current_task")
This is useful in multi-agent pipelines where you'd otherwise be passing the same context repeatedly.
headroom learn
There's also a headroom learn command that mines failed agent sessions and writes corrections back to CLAUDE.md, AGENTS.md, or GEMINI.md. The idea is that your agent accumulates a record of what went wrong and avoids repeating the same mistakes.
headroom learn
It parses session logs, extracts failure patterns, and appends structured learnings to your project's agent config files.
Check Your Savings
After using Headroom for a while:
headroom stats
This shows you cumulative compression ratios, tokens saved, and per-content-type breakdowns.
Is It Worth Trying?
Yes, if you:
- Run AI coding agents (Claude Code, Cursor, Codex, Aider) regularly and pay for tokens
- Build pipelines where tool outputs and RAG chunks are large and repetitive
- Want cross-agent shared memory without building it yourself
- Need reversible compression — Headroom never discards originals
Skip it, or approach carefully, if you:
- Only use a single provider's built-in context management and don't need more
- Work in sandboxed or restricted environments where running a local process is an issue
- Are on a very simple single-turn setup where context bloat isn't a real problem yet
Quick Reference
# Install
pip install "headroom-ai[all]"
npm install headroom-ai
# Wrap an agent
headroom wrap claude
# Run as proxy
headroom proxy --port 8787
# Install as MCP server
headroom mcp install
# Check savings
headroom stats
# Learn from failures
headroom learn
GitHub: chopratejas/headroom
Top comments (0)