How I Built a Prompt Compressor That Saves 65% on LLM Costs
Every time you call an LLM, tokens that never needed to be processed burn GPU cycles, waste money, and strain the grid. The problem gets worse with every agent loop, every long-context RAG query, every multi-turn conversation.
I built SuperCompress — a tiny ~5K parameter CPU policy that scores every line of context for relevance before inference, keeping only what the model needs.
The results? 65% fewer tokens, 100% oracle recall, ~60ms latency. Open source. MIT licensed.
The Problem: LLMs Are Wasteful
Modern LLMs process every token you give them. On long contexts (think agent logs, RAG results, codebases), most of those tokens are padding — irrelevant boilerplate that consumes KV cache space without contributing to the answer.
The standard approaches don't work well:
| Approach | Tokens Saved | Answer Quality |
|---|---|---|
| Truncation (keep head/tail) | ~65% | ~25% recall |
| FIFO eviction | ~65% | ~25% recall |
| H2O | ~65% | ~98% recall |
| SuperCompress | ~65% | 100% recall |
At the same KV savings, SuperCompress preserves answer quality dramatically better.
The Architecture: CPU-First Eviction
The key insight: you don't need a GPU to decide what a GPU should process.
┌─────────────┐ ┌──────────────┐ ┌──────────┐
│ Context In │ ──→ │ CPU Policy │ ──→ │ GPU LLM │
│ (1,247 tok) │ │ (5K params) │ │ (437 tok) │
└─────────────┘ └──────────────┘ └──────────┘
│
↓
Score each line
Drop low-relevance
Keep answer-critical
The policy is a lightweight neural network (~5,000 parameters) that runs entirely on CPU. It takes each line of context + the user's question, and scores how relevant that line is to answering the question. Lines below a threshold get evicted.
Training Approach
The policy was trained on a dataset of:
- Long-form text passages (books, documentation, code)
- Paired with realistic user questions
- Ground-truth relevance labels from oracle LLM judgments
The training objective balances:
- Token savings — maximize KV reduction
- Recall — preserve lines needed for correct answers
- Latency — keep inference under 100ms on CPU
Benchmarks
At a fixed 35% budget (keep 35% of tokens):
Policy | Oracle Recall | Entity Recall | Latency
────────────────┼───────────────┼───────────────┼────────
FIFO/Truncation | 25% | 73% | ~57ms
Summarization | 61% | 65% | ~63ms
H2O | 98% | 73% | ~56ms
SuperCompress | 100% | 73% | ~60ms
100% oracle recall means the policy never dropped a line that the answer depended on. At the same compute savings.
Environmental Impact
Per 1 million compressions:
- 800M tokens avoided — that's real GPU time
- 29 kWh saved — enough to power a home for a day
- 12 kg CO₂ avoided — tiny but it adds up
- 52 L water saved — datacenter cooling is thirsty
Getting Started
Python (in-process)
pip install git+https://github.com/arjunkshah/supercompress.git
from supercompress import compress_context
result = compress_context(
"Your long context text here...",
"What does this code do?",
budget_ratio=0.35,
)
print(result.compressed_text)
print(f"{result.kv_savings_pct:.1f}% KV saved")
Hosted API (no local ML deps)
curl -X POST https://supercompress.vercel.app/api/v1/compress \
-H "X-API-Key: sc_live_YOUR_KEY" \
-d '{"context":"...","query":"Summarize this"}'
Browser demo (no setup needed)
Just visit supercompress.vercel.app and try the live demo.
What's Next
- Adaptive compression ratios (not fixed budget)
- Integration with LangChain/LlamaIndex as a built-in compressor
- Quantized policy for even lower latency
The code is open source under MIT. Contributions welcome!
GitHub: https://github.com/arjunkshah/supercompress
Live demo: https://supercompress.vercel.app
Docs: https://arjunkshah-supercompress-55.mintlify.app
Top comments (0)