ArshTechPro

Posted on Jun 4

Headroom: Cut Your LLM Token Usage by Up to 95% Without Changing Your Answers

#ai #agents #agentskills #programming

If you're building AI agents or running LLM pipelines in production, you already know the pain: tool outputs, logs, RAG chunks, and conversation history pile up fast. Before you know it, you're burning through tokens at a rate that makes your billing dashboard uncomfortable to look at.

Headroom is an open-source project that tackles this problem directly. It compresses everything your AI agent reads — before it ever reaches the LLM — and claims 60–95% token reduction on real workloads, with accuracy preserved.

The Core Idea

Headroom sits as a layer between your application and the LLM provider. It takes whatever your agent was about to send — a stack of tool call results, a long log file, a RAG retrieval dump — and compresses it using one of several strategies depending on the content type:

SmartCrusher handles JSON (arrays, nested objects, mixed types)
CodeCompressor uses AST-aware compression for Python, JS, Go, Rust, Java, and C++
Kompress-base is a HuggingFace model trained on agentic traces, for prose and text
CacheAligner stabilizes prompt prefixes so provider KV caches actually hit consistently
CCR (Content-Compressed Retrieval) stores originals locally and lets the LLM fetch them on demand — so compression is fully reversible

A ContentRouter figures out what kind of content it's looking at and picks the right compressor automatically. You don't have to think about it.

The key thing: originals are never deleted. If the LLM needs the full version of something, it can retrieve it. Compression is lossless in that sense.

Real Numbers

These are the token counts from the project's benchmarks on real agent workloads:

Workload	Before	After	Savings
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
GitHub issue triage	54,174	14,761	73%
Codebase exploration	78,502	41,254	47%

On accuracy benchmarks (GSM8K math, TruthfulQA, SQuAD v2, BFCL tool-use), scores hold steady or slightly improve after compression. The intuition is that stripping noise helps the model focus on the signal.

You can reproduce these yourself with:

python -m headroom.evals suite --tier 1

Setup: Three Ways to Use It

Headroom gives you three integration modes. Pick whichever fits how you work.

Option 1: Wrap an existing agent (zero code changes)

pip install "headroom-ai[all]"
headroom wrap claude

That's it. Headroom intercepts traffic from Claude Code, Codex, Cursor, Aider, or Copilot CLI automatically. You don't touch your existing code at all.

Option 2: Drop-in proxy

Run Headroom as a local proxy on any port:

headroom proxy --port 8787

Then point your existing OpenAI/Anthropic SDK calls at localhost:8787 instead of the provider URL. Any language, any framework — no code changes needed beyond updating the base URL.

Option 3: Inline library

For finer control, use it directly in Python or TypeScript:

Python:

from headroom import compress

messages = [{"role": "user", "content": your_giant_tool_output}]
compressed = compress(messages, model="claude-opus-4-6")
# compressed has the same structure, far fewer tokens

TypeScript:

import { compress } from "headroom-ai";

const compressed = await compress(messages, { model: "claude-opus-4-6" });

With the Anthropic SDK directly:

from anthropic import Anthropic
from headroom import withHeadroom

client = withHeadroom(Anthropic())
# Use client exactly like normal — compression happens automatically

With LangChain:

from headroom.integrations.langchain import HeadroomChatModel

llm = HeadroomChatModel(your_existing_llm)

With Vercel AI SDK:

import { wrapLanguageModel } from "ai";
import { headroomMiddleware } from "headroom-ai";

const model = wrapLanguageModel({
  model: yourModel,
  middleware: headroomMiddleware(),
});

Requires Python 3.10+. For Node/TypeScript: npm install headroom-ai.

MCP Server Mode

If you're using an MCP client (Claude Desktop, etc.), you can install Headroom as an MCP server:

headroom mcp install

This exposes three MCP tools: headroom_compress, headroom_retrieve, and headroom_stats. Your AI agent can call them directly as part of its tool loop.

Cross-Agent Memory

One underrated feature: shared memory across agents. If you're running Claude and Codex side by side, Headroom can give them a common compressed context store with automatic deduplication.

from headroom.memory import SharedContext

ctx = SharedContext()
ctx.put("current_task", task_description)

# In a different agent's session
task = ctx.get("current_task")

This is useful in multi-agent pipelines where you'd otherwise be passing the same context repeatedly.

headroom learn

There's also a headroom learn command that mines failed agent sessions and writes corrections back to CLAUDE.md, AGENTS.md, or GEMINI.md. The idea is that your agent accumulates a record of what went wrong and avoids repeating the same mistakes.

headroom learn

It parses session logs, extracts failure patterns, and appends structured learnings to your project's agent config files.

Check Your Savings

After using Headroom for a while:

headroom stats

This shows you cumulative compression ratios, tokens saved, and per-content-type breakdowns.

Is It Worth Trying?

Yes, if you:

Run AI coding agents (Claude Code, Cursor, Codex, Aider) regularly and pay for tokens
Build pipelines where tool outputs and RAG chunks are large and repetitive
Want cross-agent shared memory without building it yourself
Need reversible compression — Headroom never discards originals

Skip it, or approach carefully, if you:

Only use a single provider's built-in context management and don't need more
Work in sandboxed or restricted environments where running a local process is an issue
Are on a very simple single-turn setup where context bloat isn't a real problem yet

Quick Reference

# Install
pip install "headroom-ai[all]"
npm install headroom-ai

# Wrap an agent
headroom wrap claude

# Run as proxy
headroom proxy --port 8787

# Install as MCP server
headroom mcp install

# Check savings
headroom stats

# Learn from failures
headroom learn

GitHub: chopratejas/headroom

Top comments (4)

ArshTechPro • Jun 5

ArshTechPro • Jun 7

f you're running Claude and Codex side by side, Headroom can give them a common compressed context store with automatic deduplication.

François Kiene • Jun 21

Good writeup, and the reversible-retrieval (CCR) design is the part I wish more tools copied. Keeping the original and letting the model fetch it on demand is the honest way to call something lossless, and the cache-prefix stabilizing is underrated too.

Disclosure: I build a competing tool (llmtrim), so read this as a competitor's note rather than a neutral one. One calibration for readers, not really a criticism: the 60-95% is very content-type-dependent. The 92% rows are structured content (SmartCrusher on JSON, CodeCompressor on code and logs). On plain prose the gains are much smaller. So whether you land near the top or the bottom of that range comes down to whether your traffic is mostly structured tool output or mostly prose. Worth measuring on your own mix before you anchor on 95%.

Shivsantosh Singh • Jun 15

How to use with GitHub Copilot?