Arjun Shah

Posted on Jun 26

How I Built a Prompt Compressor That Saves 65% on LLM Costs

#llm #ai #python #opensource

How I Built a Prompt Compressor That Saves 65% on LLM Costs

Every time you call an LLM, tokens that never needed to be processed burn GPU cycles, waste money, and strain the grid. The problem gets worse with every agent loop, every long-context RAG query, every multi-turn conversation.

I built SuperCompress — a tiny ~5K parameter CPU policy that scores every line of context for relevance before inference, keeping only what the model needs.

The results? 65% fewer tokens, 100% oracle recall, ~60ms latency. Open source. MIT licensed.

The Problem: LLMs Are Wasteful

Modern LLMs process every token you give them. On long contexts (think agent logs, RAG results, codebases), most of those tokens are padding — irrelevant boilerplate that consumes KV cache space without contributing to the answer.

The standard approaches don't work well:

Approach	Tokens Saved	Answer Quality
Truncation (keep head/tail)	~65%	~25% recall
FIFO eviction	~65%	~25% recall
H2O	~65%	~98% recall
SuperCompress	~65%	100% recall

At the same KV savings, SuperCompress preserves answer quality dramatically better.

The Architecture: CPU-First Eviction

The key insight: you don't need a GPU to decide what a GPU should process.

┌─────────────┐     ┌──────────────┐     ┌──────────┐
│  Context In  │ ──→ │  CPU Policy  │ ──→ │  GPU LLM │
│ (1,247 tok)  │     │  (5K params) │     │ (437 tok) │
└─────────────┘     └──────────────┘     └──────────┘
                          │
                          ↓
                    Score each line
                    Drop low-relevance
                    Keep answer-critical

The policy is a lightweight neural network (~5,000 parameters) that runs entirely on CPU. It takes each line of context + the user's question, and scores how relevant that line is to answering the question. Lines below a threshold get evicted.

Training Approach

The policy was trained on a dataset of:

Long-form text passages (books, documentation, code)
Paired with realistic user questions
Ground-truth relevance labels from oracle LLM judgments

The training objective balances:

Token savings — maximize KV reduction
Recall — preserve lines needed for correct answers
Latency — keep inference under 100ms on CPU

Benchmarks

At a fixed 35% budget (keep 35% of tokens):

Policy          | Oracle Recall | Entity Recall | Latency
────────────────┼───────────────┼───────────────┼────────
FIFO/Truncation |         25%  |         73%   | ~57ms
Summarization   |         61%  |         65%   | ~63ms
H2O             |         98%  |         73%   | ~56ms
SuperCompress   |        100%  |         73%   | ~60ms

100% oracle recall means the policy never dropped a line that the answer depended on. At the same compute savings.

Environmental Impact

Per 1 million compressions:

800M tokens avoided — that's real GPU time
29 kWh saved — enough to power a home for a day
12 kg CO₂ avoided — tiny but it adds up
52 L water saved — datacenter cooling is thirsty

Getting Started

Python (in-process)

pip install git+https://github.com/arjunkshah/supercompress.git

from supercompress import compress_context

result = compress_context(
    "Your long context text here...",
    "What does this code do?",
    budget_ratio=0.35,
)
print(result.compressed_text)
print(f"{result.kv_savings_pct:.1f}% KV saved")

Hosted API (no local ML deps)

curl -X POST https://supercompress.vercel.app/api/v1/compress \
  -H "X-API-Key: sc_live_YOUR_KEY" \
  -d '{"context":"...","query":"Summarize this"}'

Browser demo (no setup needed)

Just visit supercompress.vercel.app and try the live demo.

What's Next

Adaptive compression ratios (not fixed budget)
Integration with LangChain/LlamaIndex as a built-in compressor
Quantized policy for even lower latency

The code is open source under MIT. Contributions welcome!

GitHub: https://github.com/arjunkshah/supercompress
Live demo: https://supercompress.vercel.app
Docs: https://arjunkshah-supercompress-55.mintlify.app

DEV Community

How I Built a Prompt Compressor That Saves 65% on LLM Costs

How I Built a Prompt Compressor That Saves 65% on LLM Costs

The Problem: LLMs Are Wasteful

The Architecture: CPU-First Eviction

Training Approach

Benchmarks

Environmental Impact

Getting Started

Python (in-process)

Hosted API (no local ML deps)

Browser demo (no setup needed)

What's Next

Top comments (0)