I recently built an agent to handle some SRE tasks—fetching logs, querying databases, searching code. It worked, but when I looked at the traces, I was annoyed.
It wasn't just that it was expensive (though the bill was climbing). It was the sheer inefficiency.
I looked at a single tool output—a search for Python files. It was 40,000 tokens.
About 35,000 of those tokens were just "type": "file" and "language": "python" repeated 2,000 times.
We are paying premium compute prices to force state-of-the-art models to read standard JSON boilerplate.
I couldn't find a tool that solved this without breaking the agent, so I wrote one. It's called Headroom. It's a context optimization layer that sits between your app and your LLM. It compresses context by ~85% without losing semantic meaning.
It's open source (Apache-2.0). If you just want the code:
github.com/chopratejas/headroom
Why Truncation and Summarization Don't Work
When your context window fills up, the standard industry solution is truncation (chopping off the oldest messages or the middle of the document).
But for an agent, truncation is dangerous.
- If you chop the middle of a log file, you might lose the one error line that explains the crash.
- If you chop a file list, you might lose the exact config file the user asked for.
I tried summarization (using a cheaper model to summarize the data first), but that introduced hallucination. I had a summarizer tell me a deployment "looked fine" because it ignored specific error codes in the raw log.
I needed a third option: Lossless compression. Or at least, "intent-lossless."
The Core Idea: Statistical Analysis, Not Blind Truncation
I realized that 90% of the data in a tool output is just schema scaffolding. The LLM doesn't need to see status: active repeated a thousand times. It needs the anomalies.
Headroom's SmartCrusher runs statistical analysis before touching your data:
1. Constant Factoring
If every item in an array has "type": "file", it doesn't repeat that 2,000 times. It extracts constants once.
2. Outlier Detection
It calculates standard deviation of numerical fields. It preserves the spikes—the values that are >2σ from the mean. Those are usually what matters.
3. Error Preservation
Hard rule: never discard strings that look like stack traces, error messages, or failures. Errors are sacred.
4. Relevance Scoring
If you searched for "auth", items containing "auth" get preserved. Uses BM25 + semantic embeddings (hybrid scoring) to match items against the user's query context.
5. First/Last Retention
Always keeps first few and last few items. The LLM expects to see some examples, and recency matters.
The result: 40,000 tokens → 4,000 tokens. Same information density. No hallucination risk.
CCR: Making Compression Reversible
Here's the insight that changed everything: compression should be reversible.
I call the architecture CCR (Compress-Cache-Retrieve):
1. Compress
SmartCrusher compresses the tool output from 2,000 items to 20.
2. Cache
The original 2,000 items are cached locally (5-minute TTL, LRU eviction).
3. Retrieve
Headroom injects a tool called headroom_retrieve() into the LLM's context. If the model looks at the compressed summary and decides it needs more data—maybe the user asked a follow-up question—it can call that tool. Headroom fetches from the cache and returns the relevant items.
This changes the risk calculus. You can compress aggressively (90%+) because nothing is ever truly lost. The model can always "unzip" what it needs.
I've had conversations like this:
Turn 1: "Search for all Python files"
→ 1000 files returned, compressed to 15
Turn 5: "Actually, what was that file handling JWT tokens?"
→ LLM calls headroom_retrieve("jwt")
→ Returns jwt_handler.py from cached data
No extra API calls. No "sorry, I don't have that information anymore."
TOIN: The Network Effect
Here's where it gets interesting. Headroom learns from compression patterns.
TOIN (Tool Output Intelligence Network) tracks—anonymously—what happens after compression:
- Which fields get retrieved most often?
- Which tool types have high retrieval rates?
- What query patterns trigger retrievals?
This data feeds back into compression recommendations. If TOIN learns that users frequently retrieve error_code fields after compression, it tells SmartCrusher to preserve error_code more aggressively next time.
Privacy is built in:
- No actual data values stored
- Tool names are structure hashes
- Field names are SHA256[:8] hashes
- No user identifiers
The network effect: more users → more compression events → better recommendations for everyone.
Memory: Cross-Conversation Learning
Agents often need to remember things across conversations. "I prefer dark mode." "My timezone is PST." "I'm working on the auth refactor."
Headroom has a memory system that extracts and stores these facts automatically.
Two approaches:
Fast Memory (Recommended)
Zero extra latency. The LLM outputs a <memory> block inline with its response. Headroom parses it out and stores the memory.
from headroom.memory import with_fast_memory
client = with_fast_memory(OpenAI(), user_id="alice")
# Memories extracted automatically from responses
# Injected automatically into future requests
Background Memory
Separate LLM call extracts memories asynchronously. More accurate but adds latency.
from headroom import with_memory
client = with_memory(OpenAI(), user_id="alice")
Memories are stored locally (SQLite) and injected into future conversations. The model remembers that Bob prefers dark mode without you managing state.
The Transform Pipeline
Headroom runs four transforms on each request:
1. CacheAligner
LLM providers offer cached token pricing (Anthropic: 90% off, OpenAI: 50% off). But caching only works if your prompt prefix is stable.
Problem: your system prompt probably has a timestamp. Current time: 2024-01-15 10:32:45. That breaks caching.
CacheAligner extracts dynamic content and moves it to the end, stabilizing the prefix. Same information, better cache hits.
2. SmartCrusher
The statistical compression engine. Analyzes arrays, detects patterns, preserves anomalies, factors constants.
3. ContentRouter
Different content needs different compression. Code isn't JSON isn't logs isn't prose.
ContentRouter uses ML-based content detection to route data to specialized compressors:
- Code → AST-aware compression (tree-sitter)
- JSON → SmartCrusher
- Logs → LogCompressor (clusters similar messages)
- Text → Optional LLMLingua integration (20x compression, adds latency)
4. RollingWindow
When context exceeds the model limit, something has to go. RollingWindow drops oldest tool calls + responses together (never orphans data), preserves system prompt and recent turns.
Three Ways to Use It
Option 1: Proxy Server (Zero Code Changes)
pip install headroom-ai
headroom proxy --port 8787
Point your OpenAI client to http://localhost:8787/v1. Done.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1")
# No other changes
Works with Claude Code, Cursor, any OpenAI-compatible client.
Option 2: SDK Wrapper
from headroom import HeadroomClient
from openai import OpenAI
client = HeadroomClient(OpenAI())
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
headroom_mode="optimize" # or "audit" or "simulate"
)
Three modes:
- audit: Observe only. Logs what would be optimized, doesn't change anything.
- optimize: Apply compression. This is what saves tokens.
- simulate: Dry run. Returns the optimized messages without calling the API.
Start with audit to see potential savings, then flip to optimize when you're confident.
Option 3: Framework Integrations
LangChain:
from langchain_openai import ChatOpenAI
from headroom.integrations.langchain import HeadroomChatModel
base_model = ChatOpenAI(model="gpt-4o")
model = HeadroomChatModel(base_model, mode="optimize")
# Use in any chain or agent
chain = prompt | model | parser
Agno:
from agno.agent import Agent
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(original_model, mode="optimize")
agent = Agent(model=model, tools=[...])
MCP (Model Context Protocol):
from headroom.integrations.mcp import compress_tool_result
# Compress any tool result before returning to LLM
compressed = compress_tool_result(tool_name, result_data)
Real Numbers
I've been running this in production for months. Here's what the token reduction looks like:
| Workload | Before | After | Savings |
|---|---|---|---|
| Log Analysis | 22,000 | 3,300 | 85% |
| Code Search | 45,000 | 4,500 | 90% |
| Database Queries | 18,000 | 2,700 | 85% |
| Long Conversations | 80,000 | 32,000 | 60% |
What's Coming Next
This is actively maintained. On the roadmap:
More Frameworks
- CrewAI integration
- AutoGen integration
- Semantic Kernel integration
Managed Storage
- Cloud-hosted TOIN backend (opt-in)
- Cross-device memory sync
- Team-shared compression patterns
Better Compression
- Domain-specific profiles (SRE, coding, data analysis)
- Custom compressor plugins
- Streaming compression for real-time tools
Why I Built This
I'm a believer that we're in the "optimization phase" of the AI hype cycle. Getting things to work is table stakes; getting them to work cheaply and reliably is the actual engineering work.
Headroom is my attempt to fix the "context bloat" problem properly. Not with heuristics or truncation, but with statistical analysis and reversible compression.
It runs entirely locally. No data leaves your machine (except to OpenAI/Anthropic as usual). Apache-2.0 licensed.
Repo: github.com/chopratejas/headroom
If you find bugs or have ideas, open an issue. I'm actively maintaining this.
Top comments (0)