WonderLab

Posted on Jun 5

Open Source Project of the Day (#86): headroom - A Context Compression Layer for AI Agents, Up to 95% Token Savings

#ai #opensource #agents #llm

Introduction

"Running out of context isn't always about a small window — it's usually about a window full of noise."

This is article #86 in the Open Source Project of the Day series. Today's project is headroom — a context compression layer purpose-built for AI agents.

As agents grow more capable, their context windows fill up faster: tool responses return thousands of lines of JSON, RAG retrieval produces heavily redundant documents, search results include irrelevant content, and logs are full of noise. The result: token consumption skyrockets, costs spiral, and agents occasionally hit their context limit mid-task.

headroom's answer is straightforward: compress content before it reaches the LLM. Not truncation (permanent loss), not summarization (possible distortion) — but semantically-aware compression that retains the signal and discards the noise, with the original always retrievable on demand.

What You'll Learn

Why agent context "inflates" and what it costs
headroom's four integration modes: Library, Proxy, Agent Wrap, MCP Server
How the three compression engines work: SmartCrusher, CodeCompressor, Kompress-base
CCR (Compressed Context Retrieval): reversible compression that keeps originals accessible
Real benchmark data: compression ratios and accuracy retention across workload types

Prerequisites

Basic Python experience
Familiarity with LLM API usage (knowing tokens are the billing unit is enough)
Experience with at least one AI Agent framework (optional)

Project Background

What Is headroom?

headroom positions itself as "the context compression layer for AI agents" — sitting between your application and the LLM. Its core insight is that tool outputs, logs, code snippets, and RAG retrieval results are packed with content irrelevant to the current question. Removing that irrelevant content doesn't hurt the LLM; it often helps, because the model is no longer distracted.

Unlike naive truncation or LLM-based summarization, headroom's compression is reversible: originals are stored locally, and the LLM can retrieve them on demand via the headroom_retrieve tool. Nothing is truly discarded.

Author / Team

Author: Tejas Chopra (chopratejas)
Language breakdown: Python 76.9% + Rust 18.3% + TypeScript 2.7%
License: Apache 2.0

Project Stats

⭐ GitHub Stars: 12,800+
🍴 Forks: 823
📦 Latest version: v0.23.0
📄 License: Apache 2.0
🌐 Requirements: Python 3.10+, Node.js, or Docker

Core Features

What It Does

headroom intercepts agent tool outputs before they enter the LLM context, strips noise, and reduces token counts by 60–95%. It operates as a transparent compression middleware — it doesn't change your agent's logic, only the density of the content flowing through it.

Use Cases

Code search and analysis
- 100 search results at 17,765 tokens compress to 1,408 (92% reduction). The agent sees only the code snippets relevant to its current query.
SRE incident debugging
- System logs, stack traces, and metrics mixed in one context: 65,694 tokens → 5,118 (92% reduction). Critical anomalies become more visible, not less.
GitHub issue triage
- Processing issues in bulk: 54,174 → 14,761 tokens (73% reduction), with no drop in classification accuracy.
RAG-augmented generation
- Retrieval chunks de-duplicated and filtered before reaching the LLM. The model answers from relevant content instead of wading through noise.
Multi-agent coordination
- Compressed, deduplicated memory shared across agents. The same information doesn't get re-consumed by every agent in the pipeline.

Quick Start

# Full install (all extensions)
pip install "headroom-ai[all]"

# Or install only what you need
pip install "headroom-ai[proxy,mcp]"

Mode 1: Library (inline in your code)

from headroom import Headroom

hr = Headroom()

# Compress messages before sending to the LLM
compressed = hr.compress(messages)

response = client.messages.create(
    model="claude-opus-4-5",
    messages=compressed.messages,
)

print(f"Compression ratio: {compressed.compression_ratio:.1%}")
print(f"Tokens saved: {compressed.tokens_saved}")

Mode 2: Proxy (zero code changes)

headroom proxy --port 8787

import anthropic

# One change: point base_url at the headroom proxy
client = anthropic.Anthropic(base_url="http://localhost:8787")

# Everything else stays identical
response = client.messages.create(...)

Mode 3: Agent Wrap (one command, existing agent)

headroom wrap claude    # Wrap Claude Code
headroom wrap aider     # Wrap Aider
headroom wrap cursor    # Wrap Cursor
headroom wrap codex     # Wrap Codex CLI

Check compression stats

headroom perf
# Today's savings: 48,392 tokens | Cumulative savings: $12.40

Key Properties

Four integration modes
- Library / Proxy / Agent Wrap / MCP Server — pick the one that fits your architecture without touching the rest
CCR — Compressed Context Retrieval
- Originals are indexed locally. The LLM can call headroom_retrieve to get any compressed content back on demand. Compression ≠ deletion.
Cross-agent shared memory
- Multiple agents read from and write to the same memory store, with automatic deduplication
headroom learn — automatic learning from failures
- Analyzes failed agent sessions and writes derived rules directly into CLAUDE.md / AGENTS.md
All content types covered
- JSON (SmartCrusher), code (AST-level CodeCompressor), prose (Kompress-base), images
Local-first, privacy-safe
- All compression runs locally. No data leaves your machine.

Comparison with Alternatives

Dimension	headroom	Simple truncation	LLM summarization	Manual filtering
Information retention	✅ Reversible, originals kept	❌ Permanent loss	⚠️ Possible distortion	⚠️ Rule-dependent
Integration effort	✅ 1 line / 0 lines	✅ Trivial	❌ Extra LLM call	❌ High maintenance
Compression quality	✅ Structure-aware	❌ Blind	⚠️ Generic	⚠️ Brittle
Cost	✅ Saves API spend	✅ No extra cost	❌ Extra tokens	✅ No extra cost

Deep Dive

Compression Pipeline Architecture

headroom's internal processing has three stages:

User input / Tool output
         ↓
    CacheAligner        ← Skip re-processing of already-compressed content
         ↓
    ContentRouter       ← Detect content type, route to the right engine
      ├── SmartCrusher         (JSON / structured data)
      ├── CodeCompressor       (source code, AST-level)
      └── Kompress-base        (prose text, HuggingFace model)
         ↓
    Compressed content → LLM
         ↓
    Original stored locally (CCR index) ← Retrievable on demand

CacheAligner: Detects content that has already been processed this session and skips it, avoiding redundant computation.

ContentRouter: Classifies the incoming content — JSON structure, code syntax, or plain text — and routes it to the most effective engine. Content-type-aware compression outperforms a single general-purpose approach by a wide margin.

Three Compression Engines

① SmartCrusher (JSON / structured data)

Tool call responses are commonly JSON with dozens of fields, most irrelevant to the current task. SmartCrusher analyzes the preceding LLM query and extracts only the fields that matter:

# Raw tool response: ~1,200 tokens
{
  "results": [
    {
      "id": "abc123",
      "title": "...",
      "content": "...",          # Relevant to the query
      "metadata": {              # Irrelevant fields
        "created_at": "...",
        "updated_at": "...",
        "author_id": 42,
        "tags": ["...", "..."],
        "internal_score": 0.87,
        # ... 20+ more irrelevant fields
      }
    }
    # × 99 more results
  ]
}

# After SmartCrusher: ~80 tokens (93% reduction)
# Only title + content retained, all metadata dropped

② CodeCompressor (source code, AST-level)

Code can't be compressed by truncation — cutting mid-function breaks syntax and confuses the LLM. CodeCompressor parses the AST, keeps function signatures, class definitions, and key docstrings, and compresses away implementation bodies:

# Original code: ~800 tokens
def process_payment(
    user_id: int,
    amount: float,
    currency: str = "USD",
    retry_count: int = 3
) -> PaymentResult:
    """Process a payment request with retry logic."""
    for attempt in range(retry_count):
        try:
            balance = get_user_balance(user_id)
            if balance < amount:
                raise InsufficientFunds(...)
            # ... 200 lines of implementation ...
        except NetworkError:
            if attempt == retry_count - 1:
                raise
            time.sleep(2 ** attempt)

# After CodeCompressor: ~60 tokens
def process_payment(user_id: int, amount: float,
                    currency: str = "USD", retry_count: int = 3) -> PaymentResult:
    """Process a payment request with retry logic."""
    ...  # [Body compressed; retrieve via headroom_retrieve]

③ Kompress-base (prose text)

For documents, logs, and free-form text, headroom uses the Kompress-base model (HuggingFace) to perform semantic compression: sentences most relevant to the current query are kept, redundant and off-topic sentences are dropped.

CCR: Why Reversible Compression Matters

Standard truncation is one-way. headroom's CCR (Compressed Context Retrieval) makes compression reversible:

Original ────── compress ──────→ LLM
   │                                │
   └── stored in local CCR index    │
        (indexed by trace_id)       │
                                    │  When more detail is needed:
                                    ↓
                        headroom_retrieve("trace_id", "retry logic")
                                    │
                                    ↓
                        Returns the relevant original snippet

In MCP Server mode, the LLM drives retrieval itself:

# When the LLM decides it needs the full implementation:
headroom_retrieve(
    trace_id="abc123",
    query="process_payment retry logic implementation"
)
# → Returns the original code section on demand

MCP Server Mode

In MCP mode, headroom exposes three tools to the LLM:

headroom_compress(content, content_type="auto")
# → Compress content; returns compressed text + trace_id

headroom_retrieve(trace_id, query)
# → Retrieve an original content fragment by semantic query

headroom_stats()
# → Return compression statistics for the current session

Setup (claude_desktop_config.json):

{
  "mcpServers": {
    "headroom": {
      "command": "headroom",
      "args": ["mcp"]
    }
  }
}

`headroom learn`: Automatically Learning from Failures

When an agent session fails — the task wasn't completed, the LLM retried multiple times, or the context overflowed — headroom learn analyzes the session log and writes derived rules into CLAUDE.md or AGENTS.md:

headroom learn --session-log ./logs/session_2026-06-05.jsonl

# Example output:
# Found 3 recurring patterns:
#   1. GitHub API calls consistently carry excessive metadata fields
#      → Added SmartCrusher filter rule to CLAUDE.md
#   2. Context overflow occurred 3× during large JSON processing
#      → Added compression hint to AGENTS.md
#   3. CodeCompressor improved success rate by 40% on code analysis tasks
#      → Configuration recorded

Benchmark Results

Real workload benchmarks (not synthetic):

Workload	Before	After	Saved
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
GitHub issue triage	54,174	14,761	73%
Codebase exploration	78,502	41,254	47%

Accuracy retention (the critical question: does compression hurt answer quality?):

Benchmark	Compression	Accuracy change
GSM8K (math reasoning)	—	delta = ±0.000 (zero change)
SQuAD v2 (reading comprehension)	19%	97% retained

The key finding: for agent workloads, removing irrelevant content doesn't degrade answer quality — it often improves it. The LLM is no longer distracted by noise.

Links and Resources

Official Resources

🌟 GitHub: chopratejas/headroom
📦 PyPI: pip install headroom-ai
🐛 Issue Tracker: github.com/chopratejas/headroom/issues

Related Resources

Conclusion

headroom addresses a problem that's been chronically underestimated: noise in the context window. We spend hours tuning prompt wording, but rarely think about the bloat coming from tool responses. One code search returning 17,765 tokens, with 16,357 of them noise — that's happening dozens of times per minute in a complex agent.

Four integration modes (Library / Proxy / Wrap / MCP) mean headroom can fit into almost any existing agent architecture without surgery. CCR reversible compression means compression isn't deletion. And headroom learn means agents automatically get smarter from their own failures.

If your agent is facing cost overruns or context overflow issues, headroom is the first thing worth trying.

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community

Open Source Project of the Day (#86): headroom - A Context Compression Layer for AI Agents, Up to 95% Token Savings

Introduction

What You'll Learn

Prerequisites

Project Background

What Is headroom?

Author / Team

Project Stats

Core Features

What It Does

Use Cases

Quick Start

Key Properties

Comparison with Alternatives

Deep Dive

Compression Pipeline Architecture

Three Compression Engines

CCR: Why Reversible Compression Matters

MCP Server Mode

`headroom learn`: Automatically Learning from Failures

Benchmark Results

Links and Resources

Official Resources

Related Resources

Conclusion

Top comments (0)

Introduction

What You'll Learn

Prerequisites

Project Background

What Is headroom?

Author / Team

Project Stats

Core Features

What It Does

Use Cases

Quick Start

Key Properties

Comparison with Alternatives

Deep Dive

Compression Pipeline Architecture

Three Compression Engines

CCR: Why Reversible Compression Matters

MCP Server Mode

headroom learn: Automatically Learning from Failures

Benchmark Results

Links and Resources

Official Resources

Related Resources

Conclusion

`headroom learn`: Automatically Learning from Failures