Introduction
"Running out of context isn't always about a small window — it's usually about a window full of noise."
This is article #86 in the Open Source Project of the Day series. Today's project is headroom — a context compression layer purpose-built for AI agents.
As agents grow more capable, their context windows fill up faster: tool responses return thousands of lines of JSON, RAG retrieval produces heavily redundant documents, search results include irrelevant content, and logs are full of noise. The result: token consumption skyrockets, costs spiral, and agents occasionally hit their context limit mid-task.
headroom's answer is straightforward: compress content before it reaches the LLM. Not truncation (permanent loss), not summarization (possible distortion) — but semantically-aware compression that retains the signal and discards the noise, with the original always retrievable on demand.
What You'll Learn
- Why agent context "inflates" and what it costs
- headroom's four integration modes: Library, Proxy, Agent Wrap, MCP Server
- How the three compression engines work: SmartCrusher, CodeCompressor, Kompress-base
- CCR (Compressed Context Retrieval): reversible compression that keeps originals accessible
- Real benchmark data: compression ratios and accuracy retention across workload types
Prerequisites
- Basic Python experience
- Familiarity with LLM API usage (knowing tokens are the billing unit is enough)
- Experience with at least one AI Agent framework (optional)
Project Background
What Is headroom?
headroom positions itself as "the context compression layer for AI agents" — sitting between your application and the LLM. Its core insight is that tool outputs, logs, code snippets, and RAG retrieval results are packed with content irrelevant to the current question. Removing that irrelevant content doesn't hurt the LLM; it often helps, because the model is no longer distracted.
Unlike naive truncation or LLM-based summarization, headroom's compression is reversible: originals are stored locally, and the LLM can retrieve them on demand via the headroom_retrieve tool. Nothing is truly discarded.
Author / Team
- Author: Tejas Chopra (chopratejas)
- Language breakdown: Python 76.9% + Rust 18.3% + TypeScript 2.7%
- License: Apache 2.0
Project Stats
- ⭐ GitHub Stars: 12,800+
- 🍴 Forks: 823
- 📦 Latest version: v0.23.0
- 📄 License: Apache 2.0
- 🌐 Requirements: Python 3.10+, Node.js, or Docker
Core Features
What It Does
headroom intercepts agent tool outputs before they enter the LLM context, strips noise, and reduces token counts by 60–95%. It operates as a transparent compression middleware — it doesn't change your agent's logic, only the density of the content flowing through it.
Use Cases
-
Code search and analysis
- 100 search results at 17,765 tokens compress to 1,408 (92% reduction). The agent sees only the code snippets relevant to its current query.
-
SRE incident debugging
- System logs, stack traces, and metrics mixed in one context: 65,694 tokens → 5,118 (92% reduction). Critical anomalies become more visible, not less.
-
GitHub issue triage
- Processing issues in bulk: 54,174 → 14,761 tokens (73% reduction), with no drop in classification accuracy.
-
RAG-augmented generation
- Retrieval chunks de-duplicated and filtered before reaching the LLM. The model answers from relevant content instead of wading through noise.
-
Multi-agent coordination
- Compressed, deduplicated memory shared across agents. The same information doesn't get re-consumed by every agent in the pipeline.
Quick Start
# Full install (all extensions)
pip install "headroom-ai[all]"
# Or install only what you need
pip install "headroom-ai[proxy,mcp]"
Mode 1: Library (inline in your code)
from headroom import Headroom
hr = Headroom()
# Compress messages before sending to the LLM
compressed = hr.compress(messages)
response = client.messages.create(
model="claude-opus-4-5",
messages=compressed.messages,
)
print(f"Compression ratio: {compressed.compression_ratio:.1%}")
print(f"Tokens saved: {compressed.tokens_saved}")
Mode 2: Proxy (zero code changes)
headroom proxy --port 8787
import anthropic
# One change: point base_url at the headroom proxy
client = anthropic.Anthropic(base_url="http://localhost:8787")
# Everything else stays identical
response = client.messages.create(...)
Mode 3: Agent Wrap (one command, existing agent)
headroom wrap claude # Wrap Claude Code
headroom wrap aider # Wrap Aider
headroom wrap cursor # Wrap Cursor
headroom wrap codex # Wrap Codex CLI
Check compression stats
headroom perf
# Today's savings: 48,392 tokens | Cumulative savings: $12.40
Key Properties
-
Four integration modes
- Library / Proxy / Agent Wrap / MCP Server — pick the one that fits your architecture without touching the rest
-
CCR — Compressed Context Retrieval
- Originals are indexed locally. The LLM can call
headroom_retrieveto get any compressed content back on demand. Compression ≠ deletion.
- Originals are indexed locally. The LLM can call
-
Cross-agent shared memory
- Multiple agents read from and write to the same memory store, with automatic deduplication
-
headroom learn— automatic learning from failures- Analyzes failed agent sessions and writes derived rules directly into
CLAUDE.md/AGENTS.md
- Analyzes failed agent sessions and writes derived rules directly into
-
All content types covered
- JSON (SmartCrusher), code (AST-level CodeCompressor), prose (Kompress-base), images
-
Local-first, privacy-safe
- All compression runs locally. No data leaves your machine.
Comparison with Alternatives
| Dimension | headroom | Simple truncation | LLM summarization | Manual filtering |
|---|---|---|---|---|
| Information retention | ✅ Reversible, originals kept | ❌ Permanent loss | ⚠️ Possible distortion | ⚠️ Rule-dependent |
| Integration effort | ✅ 1 line / 0 lines | ✅ Trivial | ❌ Extra LLM call | ❌ High maintenance |
| Compression quality | ✅ Structure-aware | ❌ Blind | ⚠️ Generic | ⚠️ Brittle |
| Cost | ✅ Saves API spend | ✅ No extra cost | ❌ Extra tokens | ✅ No extra cost |
Deep Dive
Compression Pipeline Architecture
headroom's internal processing has three stages:
User input / Tool output
↓
CacheAligner ← Skip re-processing of already-compressed content
↓
ContentRouter ← Detect content type, route to the right engine
├── SmartCrusher (JSON / structured data)
├── CodeCompressor (source code, AST-level)
└── Kompress-base (prose text, HuggingFace model)
↓
Compressed content → LLM
↓
Original stored locally (CCR index) ← Retrievable on demand
CacheAligner: Detects content that has already been processed this session and skips it, avoiding redundant computation.
ContentRouter: Classifies the incoming content — JSON structure, code syntax, or plain text — and routes it to the most effective engine. Content-type-aware compression outperforms a single general-purpose approach by a wide margin.
Three Compression Engines
① SmartCrusher (JSON / structured data)
Tool call responses are commonly JSON with dozens of fields, most irrelevant to the current task. SmartCrusher analyzes the preceding LLM query and extracts only the fields that matter:
# Raw tool response: ~1,200 tokens
{
"results": [
{
"id": "abc123",
"title": "...",
"content": "...", # Relevant to the query
"metadata": { # Irrelevant fields
"created_at": "...",
"updated_at": "...",
"author_id": 42,
"tags": ["...", "..."],
"internal_score": 0.87,
# ... 20+ more irrelevant fields
}
}
# × 99 more results
]
}
# After SmartCrusher: ~80 tokens (93% reduction)
# Only title + content retained, all metadata dropped
② CodeCompressor (source code, AST-level)
Code can't be compressed by truncation — cutting mid-function breaks syntax and confuses the LLM. CodeCompressor parses the AST, keeps function signatures, class definitions, and key docstrings, and compresses away implementation bodies:
# Original code: ~800 tokens
def process_payment(
user_id: int,
amount: float,
currency: str = "USD",
retry_count: int = 3
) -> PaymentResult:
"""Process a payment request with retry logic."""
for attempt in range(retry_count):
try:
balance = get_user_balance(user_id)
if balance < amount:
raise InsufficientFunds(...)
# ... 200 lines of implementation ...
except NetworkError:
if attempt == retry_count - 1:
raise
time.sleep(2 ** attempt)
# After CodeCompressor: ~60 tokens
def process_payment(user_id: int, amount: float,
currency: str = "USD", retry_count: int = 3) -> PaymentResult:
"""Process a payment request with retry logic."""
... # [Body compressed; retrieve via headroom_retrieve]
③ Kompress-base (prose text)
For documents, logs, and free-form text, headroom uses the Kompress-base model (HuggingFace) to perform semantic compression: sentences most relevant to the current query are kept, redundant and off-topic sentences are dropped.
CCR: Why Reversible Compression Matters
Standard truncation is one-way. headroom's CCR (Compressed Context Retrieval) makes compression reversible:
Original ────── compress ──────→ LLM
│ │
└── stored in local CCR index │
(indexed by trace_id) │
│ When more detail is needed:
↓
headroom_retrieve("trace_id", "retry logic")
│
↓
Returns the relevant original snippet
In MCP Server mode, the LLM drives retrieval itself:
# When the LLM decides it needs the full implementation:
headroom_retrieve(
trace_id="abc123",
query="process_payment retry logic implementation"
)
# → Returns the original code section on demand
MCP Server Mode
In MCP mode, headroom exposes three tools to the LLM:
headroom_compress(content, content_type="auto")
# → Compress content; returns compressed text + trace_id
headroom_retrieve(trace_id, query)
# → Retrieve an original content fragment by semantic query
headroom_stats()
# → Return compression statistics for the current session
Setup (claude_desktop_config.json):
{
"mcpServers": {
"headroom": {
"command": "headroom",
"args": ["mcp"]
}
}
}
headroom learn: Automatically Learning from Failures
When an agent session fails — the task wasn't completed, the LLM retried multiple times, or the context overflowed — headroom learn analyzes the session log and writes derived rules into CLAUDE.md or AGENTS.md:
headroom learn --session-log ./logs/session_2026-06-05.jsonl
# Example output:
# Found 3 recurring patterns:
# 1. GitHub API calls consistently carry excessive metadata fields
# → Added SmartCrusher filter rule to CLAUDE.md
# 2. Context overflow occurred 3× during large JSON processing
# → Added compression hint to AGENTS.md
# 3. CodeCompressor improved success rate by 40% on code analysis tasks
# → Configuration recorded
Benchmark Results
Real workload benchmarks (not synthetic):
| Workload | Before | After | Saved |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| GitHub issue triage | 54,174 | 14,761 | 73% |
| Codebase exploration | 78,502 | 41,254 | 47% |
Accuracy retention (the critical question: does compression hurt answer quality?):
| Benchmark | Compression | Accuracy change |
|---|---|---|
| GSM8K (math reasoning) | — | delta = ±0.000 (zero change) |
| SQuAD v2 (reading comprehension) | 19% | 97% retained |
The key finding: for agent workloads, removing irrelevant content doesn't degrade answer quality — it often improves it. The LLM is no longer distracted by noise.
Links and Resources
Official Resources
- 🌟 GitHub: chopratejas/headroom
- 📦 PyPI:
pip install headroom-ai - 🐛 Issue Tracker: github.com/chopratejas/headroom/issues
Related Resources
Conclusion
headroom addresses a problem that's been chronically underestimated: noise in the context window. We spend hours tuning prompt wording, but rarely think about the bloat coming from tool responses. One code search returning 17,765 tokens, with 16,357 of them noise — that's happening dozens of times per minute in a complex agent.
Four integration modes (Library / Proxy / Wrap / MCP) mean headroom can fit into almost any existing agent architecture without surgery. CCR reversible compression means compression isn't deletion. And headroom learn means agents automatically get smarter from their own failures.
If your agent is facing cost overruns or context overflow issues, headroom is the first thing worth trying.
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (0)