DEV Community

Arjun Shah
Arjun Shah

Posted on

How I Built a Prompt Compressor That Saves 65% on LLM Costs

How I Built a Prompt Compressor That Saves 65% on LLM Costs

Every time you call an LLM, tokens that never needed to be processed burn GPU cycles, waste money, and strain the grid. The problem gets worse with every agent loop, every long-context RAG query, every multi-turn conversation.

I built SuperCompress — a tiny ~5K parameter CPU policy that scores every line of context for relevance before inference, keeping only what the model needs.

The results? 65% fewer tokens, 100% oracle recall, ~60ms latency. Open source. MIT licensed.

The Problem: LLMs Are Wasteful

Modern LLMs process every token you give them. On long contexts (think agent logs, RAG results, codebases), most of those tokens are padding — irrelevant boilerplate that consumes KV cache space without contributing to the answer.

The standard approaches don't work well:

Approach Tokens Saved Answer Quality
Truncation (keep head/tail) ~65% ~25% recall
FIFO eviction ~65% ~25% recall
H2O ~65% ~98% recall
SuperCompress ~65% 100% recall

At the same KV savings, SuperCompress preserves answer quality dramatically better.

The Architecture: CPU-First Eviction

The key insight: you don't need a GPU to decide what a GPU should process.

┌─────────────┐     ┌──────────────┐     ┌──────────┐
│  Context In  │ ──→ │  CPU Policy  │ ──→ │  GPU LLM │
│ (1,247 tok)  │     │  (5K params) │     │ (437 tok) │
└─────────────┘     └──────────────┘     └──────────┘
                          │
                          ↓
                    Score each line
                    Drop low-relevance
                    Keep answer-critical
Enter fullscreen mode Exit fullscreen mode

The policy is a lightweight neural network (~5,000 parameters) that runs entirely on CPU. It takes each line of context + the user's question, and scores how relevant that line is to answering the question. Lines below a threshold get evicted.

Training Approach

The policy was trained on a dataset of:

  • Long-form text passages (books, documentation, code)
  • Paired with realistic user questions
  • Ground-truth relevance labels from oracle LLM judgments

The training objective balances:

  1. Token savings — maximize KV reduction
  2. Recall — preserve lines needed for correct answers
  3. Latency — keep inference under 100ms on CPU

Benchmarks

At a fixed 35% budget (keep 35% of tokens):

Policy          | Oracle Recall | Entity Recall | Latency
────────────────┼───────────────┼───────────────┼────────
FIFO/Truncation |         25%  |         73%   | ~57ms
Summarization   |         61%  |         65%   | ~63ms
H2O             |         98%  |         73%   | ~56ms
SuperCompress   |        100%  |         73%   | ~60ms
Enter fullscreen mode Exit fullscreen mode

100% oracle recall means the policy never dropped a line that the answer depended on. At the same compute savings.

Environmental Impact

Per 1 million compressions:

  • 800M tokens avoided — that's real GPU time
  • 29 kWh saved — enough to power a home for a day
  • 12 kg CO₂ avoided — tiny but it adds up
  • 52 L water saved — datacenter cooling is thirsty

Getting Started

Python (in-process)

pip install git+https://github.com/arjunkshah/supercompress.git

from supercompress import compress_context

result = compress_context(
    "Your long context text here...",
    "What does this code do?",
    budget_ratio=0.35,
)
print(result.compressed_text)
print(f"{result.kv_savings_pct:.1f}% KV saved")
Enter fullscreen mode Exit fullscreen mode

Hosted API (no local ML deps)

curl -X POST https://supercompress.vercel.app/api/v1/compress \
  -H "X-API-Key: sc_live_YOUR_KEY" \
  -d '{"context":"...","query":"Summarize this"}'
Enter fullscreen mode Exit fullscreen mode

Browser demo (no setup needed)

Just visit supercompress.vercel.app and try the live demo.

What's Next

  • Adaptive compression ratios (not fixed budget)
  • Integration with LangChain/LlamaIndex as a built-in compressor
  • Quantized policy for even lower latency

The code is open source under MIT. Contributions welcome!

GitHub: https://github.com/arjunkshah/supercompress
Live demo: https://supercompress.vercel.app
Docs: https://arjunkshah-supercompress-55.mintlify.app

Top comments (0)