DEV Community

Cover image for Stop Feeding "Junk" Tokens to Your LLM. (I Built a Proxy to Fix It)
Tejas Chopra
Tejas Chopra

Posted on

Stop Feeding "Junk" Tokens to Your LLM. (I Built a Proxy to Fix It)

I recently built an agent to handle some SRE tasks—fetching logs, querying databases, searching code. It worked, but when I looked at the traces, I was annoyed.

It wasn't just that it was expensive (though the bill was climbing). It was the sheer inefficiency.

I looked at a single tool output—a search for Python files. It was 40,000 tokens.
About 35,000 of those tokens were just "type": "file" and "language": "python" repeated 2,000 times.

We are paying premium compute prices to force state-of-the-art models to read standard JSON boilerplate.

I couldn't find a tool that solved this without breaking the agent, so I wrote one. It's called Headroom. It's a context optimization layer that sits between your app and your LLM. It compresses context by ~85% without losing semantic meaning.

It's open source (Apache-2.0). If you just want the code:
github.com/chopratejas/headroom


Why Truncation and Summarization Don't Work

When your context window fills up, the standard industry solution is truncation (chopping off the oldest messages or the middle of the document).

But for an agent, truncation is dangerous.

  • If you chop the middle of a log file, you might lose the one error line that explains the crash.
  • If you chop a file list, you might lose the exact config file the user asked for.

I tried summarization (using a cheaper model to summarize the data first), but that introduced hallucination. I had a summarizer tell me a deployment "looked fine" because it ignored specific error codes in the raw log.

I needed a third option: Lossless compression. Or at least, "intent-lossless."


The Core Idea: Statistical Analysis, Not Blind Truncation

I realized that 90% of the data in a tool output is just schema scaffolding. The LLM doesn't need to see status: active repeated a thousand times. It needs the anomalies.

Headroom's SmartCrusher runs statistical analysis before touching your data:

1. Constant Factoring
If every item in an array has "type": "file", it doesn't repeat that 2,000 times. It extracts constants once.

2. Outlier Detection
It calculates standard deviation of numerical fields. It preserves the spikes—the values that are >2σ from the mean. Those are usually what matters.

3. Error Preservation
Hard rule: never discard strings that look like stack traces, error messages, or failures. Errors are sacred.

4. Relevance Scoring
If you searched for "auth", items containing "auth" get preserved. Uses BM25 + semantic embeddings (hybrid scoring) to match items against the user's query context.

5. First/Last Retention
Always keeps first few and last few items. The LLM expects to see some examples, and recency matters.

The result: 40,000 tokens → 4,000 tokens. Same information density. No hallucination risk.


CCR: Making Compression Reversible

Here's the insight that changed everything: compression should be reversible.

I call the architecture CCR (Compress-Cache-Retrieve):

1. Compress

SmartCrusher compresses the tool output from 2,000 items to 20.

2. Cache

The original 2,000 items are cached locally (5-minute TTL, LRU eviction).

3. Retrieve

Headroom injects a tool called headroom_retrieve() into the LLM's context. If the model looks at the compressed summary and decides it needs more data—maybe the user asked a follow-up question—it can call that tool. Headroom fetches from the cache and returns the relevant items.

This changes the risk calculus. You can compress aggressively (90%+) because nothing is ever truly lost. The model can always "unzip" what it needs.

I've had conversations like this:

Turn 1: "Search for all Python files"
        → 1000 files returned, compressed to 15

Turn 5: "Actually, what was that file handling JWT tokens?"
        → LLM calls headroom_retrieve("jwt")
        → Returns jwt_handler.py from cached data
Enter fullscreen mode Exit fullscreen mode

No extra API calls. No "sorry, I don't have that information anymore."


TOIN: The Network Effect

Here's where it gets interesting. Headroom learns from compression patterns.

TOIN (Tool Output Intelligence Network) tracks—anonymously—what happens after compression:

  • Which fields get retrieved most often?
  • Which tool types have high retrieval rates?
  • What query patterns trigger retrievals?

This data feeds back into compression recommendations. If TOIN learns that users frequently retrieve error_code fields after compression, it tells SmartCrusher to preserve error_code more aggressively next time.

Privacy is built in:

  • No actual data values stored
  • Tool names are structure hashes
  • Field names are SHA256[:8] hashes
  • No user identifiers

The network effect: more users → more compression events → better recommendations for everyone.


Memory: Cross-Conversation Learning

Agents often need to remember things across conversations. "I prefer dark mode." "My timezone is PST." "I'm working on the auth refactor."

Headroom has a memory system that extracts and stores these facts automatically.

Two approaches:

Fast Memory (Recommended)
Zero extra latency. The LLM outputs a <memory> block inline with its response. Headroom parses it out and stores the memory.

from headroom.memory import with_fast_memory
client = with_fast_memory(OpenAI(), user_id="alice")

# Memories extracted automatically from responses
# Injected automatically into future requests
Enter fullscreen mode Exit fullscreen mode

Background Memory
Separate LLM call extracts memories asynchronously. More accurate but adds latency.

from headroom import with_memory
client = with_memory(OpenAI(), user_id="alice")
Enter fullscreen mode Exit fullscreen mode

Memories are stored locally (SQLite) and injected into future conversations. The model remembers that Bob prefers dark mode without you managing state.


The Transform Pipeline

Headroom runs four transforms on each request:

1. CacheAligner

LLM providers offer cached token pricing (Anthropic: 90% off, OpenAI: 50% off). But caching only works if your prompt prefix is stable.

Problem: your system prompt probably has a timestamp. Current time: 2024-01-15 10:32:45. That breaks caching.

CacheAligner extracts dynamic content and moves it to the end, stabilizing the prefix. Same information, better cache hits.

2. SmartCrusher

The statistical compression engine. Analyzes arrays, detects patterns, preserves anomalies, factors constants.

3. ContentRouter

Different content needs different compression. Code isn't JSON isn't logs isn't prose.

ContentRouter uses ML-based content detection to route data to specialized compressors:

  • Code → AST-aware compression (tree-sitter)
  • JSON → SmartCrusher
  • Logs → LogCompressor (clusters similar messages)
  • Text → Optional LLMLingua integration (20x compression, adds latency)

4. RollingWindow

When context exceeds the model limit, something has to go. RollingWindow drops oldest tool calls + responses together (never orphans data), preserves system prompt and recent turns.


Three Ways to Use It

Option 1: Proxy Server (Zero Code Changes)

pip install headroom-ai
headroom proxy --port 8787
Enter fullscreen mode Exit fullscreen mode

Point your OpenAI client to http://localhost:8787/v1. Done.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1")
# No other changes
Enter fullscreen mode Exit fullscreen mode

Works with Claude Code, Cursor, any OpenAI-compatible client.

Option 2: SDK Wrapper

from headroom import HeadroomClient
from openai import OpenAI

client = HeadroomClient(OpenAI())

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    headroom_mode="optimize"  # or "audit" or "simulate"
)
Enter fullscreen mode Exit fullscreen mode

Three modes:

  • audit: Observe only. Logs what would be optimized, doesn't change anything.
  • optimize: Apply compression. This is what saves tokens.
  • simulate: Dry run. Returns the optimized messages without calling the API.

Start with audit to see potential savings, then flip to optimize when you're confident.

Option 3: Framework Integrations

LangChain:

from langchain_openai import ChatOpenAI
from headroom.integrations.langchain import HeadroomChatModel

base_model = ChatOpenAI(model="gpt-4o")
model = HeadroomChatModel(base_model, mode="optimize")

# Use in any chain or agent
chain = prompt | model | parser
Enter fullscreen mode Exit fullscreen mode

Agno:

from agno.agent import Agent
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(original_model, mode="optimize")
agent = Agent(model=model, tools=[...])
Enter fullscreen mode Exit fullscreen mode

MCP (Model Context Protocol):

from headroom.integrations.mcp import compress_tool_result

# Compress any tool result before returning to LLM
compressed = compress_tool_result(tool_name, result_data)
Enter fullscreen mode Exit fullscreen mode

Real Numbers

I've been running this in production for months. Here's what the token reduction looks like:

Workload Before After Savings
Log Analysis 22,000 3,300 85%
Code Search 45,000 4,500 90%
Database Queries 18,000 2,700 85%
Long Conversations 80,000 32,000 60%

What's Coming Next

This is actively maintained. On the roadmap:

More Frameworks

  • CrewAI integration
  • AutoGen integration
  • Semantic Kernel integration

Managed Storage

  • Cloud-hosted TOIN backend (opt-in)
  • Cross-device memory sync
  • Team-shared compression patterns

Better Compression

  • Domain-specific profiles (SRE, coding, data analysis)
  • Custom compressor plugins
  • Streaming compression for real-time tools

Why I Built This

I'm a believer that we're in the "optimization phase" of the AI hype cycle. Getting things to work is table stakes; getting them to work cheaply and reliably is the actual engineering work.

Headroom is my attempt to fix the "context bloat" problem properly. Not with heuristics or truncation, but with statistical analysis and reversible compression.

It runs entirely locally. No data leaves your machine (except to OpenAI/Anthropic as usual). Apache-2.0 licensed.

Repo: github.com/chopratejas/headroom

If you find bugs or have ideas, open an issue. I'm actively maintaining this.

Top comments (0)