DEV Community: Arjun Shah

SuperCompress is now on PyPI! pip install supercompress in 1 line

Arjun Shah — Fri, 26 Jun 2026 19:55:27 +0000

I just published SuperCompress to PyPI! 🎉

pip install supercompress — that's all it takes.

What is it?

A tiny ~5K parameter CPU policy that scores every line of context for relevance before sending to the LLM. It keeps only what matters for the answer.

The Numbers

65% fewer tokens → same answers
100% oracle recall → never drops the answer line
~60ms CPU latency → no GPU needed
Open source → MIT with non-commercial clause

Quick Start

pip install supercompress

from supercompress import compress
result = compress(context, question)
print(f"Saved {result['kv_savings_pct']}% tokens")

Live Demo

Try the interactive comparison tool: https://supercompress.vercel.app/compare

Or read the technical deep-dive: https://dev.to/arjunkshah/how-i-built-a-prompt-compressor-that-saves-65-on-llm-costs-3m80

GitHub: https://github.com/arjunkshah/supercompress
PyPI: https://pypi.org/project/supercompress/

I Built a Prompt Compressor That Saves 65% on LLM Costs — Here's the Story

Arjun Shah — Fri, 26 Jun 2026 19:45:49 +0000

I've been working on a side project called SuperCompress — an intelligent prompt compression system for LLMs. The idea is simple: most tokens you send to an LLM never need to be processed. They're padding, boilerplate, irrelevant context. But they still burn GPU cycles.

I wanted to fix that.

The Problem

Working with LLM agents, I noticed something: every agent loop was sending massive context through the GPU. 10K tokens. 50K tokens. Sometimes more. Most of it was irrelevant to the specific task.

Truncation (keeping head + tail) was the standard approach, but it regularly dropped critical information from the middle of the context.

I thought: what if we could score each line of context for relevance BEFORE sending it to the GPU? A tiny CPU model that decides what matters.

The Build

The technical challenge was:

Train a lightweight policy (~5K params) that runs on CPU in under 60ms
Score each line of context relative to the user's question
Evict low-relevance lines while keeping answer-critical ones
Ensure the compressed output preserves correct answers

After a lot of iteration, the results surprised even me:

Policy	KV Saved	Oracle Recall
Truncation	65%	25%
H2O	65%	98%
SuperCompress	65%	100%

100% oracle recall at the same token savings. The policy never dropped a line the answer depended on.

The Environmental Angle

Here's what hit me hardest: at 50M agent turns per day (a conservative estimate for the industry), we're wasting 100B tokens daily. That's 24K GPU hours, 1,526 tons of CO₂, 6.5M liters of cooling water. Every day.

Per 1 million compressions, SuperCompress saves:

800M tokens avoided
29 kWh energy
12 kg CO₂
52 L cooling water

It's tiny per call. It's enormous at scale.

Current Status

✅ Working policy with 100% oracle recall
✅ Benchmarks and tests (65 passing)
✅ Hosted API with free tier
✅ Browser demo (compresses in-browser)
✅ Python client library
✅ Integration guides (OpenAI, LangChain, LlamaIndex)
✅ Open source (MIT)

Currently looking for:

First real users and feedback
Integration partners
Contributors to the open-source codebase

Try It

Live demo: https://supercompress.vercel.app
GitHub: https://github.com/arjunkshah/supercompress
Docs: https://arjunkshah-supercompress-55.mintlify.app

The ask: If you're building with LLMs, try compressing your next prompt. See if the answers stay the same. I'd love to hear what you think.

Now available on PyPI! pip install supercompress

Links: GitHub | PyPI | Live Demo

SuperCompress: Cut LLM Costs by 65% Without Losing Answers

Arjun Shah — Fri, 26 Jun 2026 19:23:33 +0000

Tweet 1

Every LLM call burns GPU cycles on tokens that never needed to run.

Padding. Boilerplate. Irrelevant context.

I built SuperCompress — a tiny CPU policy that cuts 65% of tokens before inference.

Open source. MIT. Free tier.

supercompress.vercel.app

Tweet 2

The problem is worse than most people realize.

At ~50M agent turns/day:

→ 100B tokens wasted daily

→ 24K GPU hours

→ 1,526 tons CO₂

→ 6.5M L cooling water

We're burning through resources on tokens that don't matter.

Tweet 3

How it works:

1️⃣ Context + question → CPU policy (5K params)

2️⃣ Every line scored for relevance to the question

3️⃣ Low-scoring lines evicted

4️⃣ Only essential tokens reach the GPU

CPU first. GPU for what matters.

Tweet 4

The numbers at 35% budget:

• 65% KV cache saved

• 100% oracle recall (vs 25% for truncation)

• ~60ms CPU latency

Same answers. ⅓ the compute.

Tweet 5

Per 1 million compressions:

→ 800M tokens avoided

→ 29 kWh saved

→ 12 kg CO₂ avoided

→ 52 L cooling water saved

Scale that across the industry and it's enormous.

Tweet 6

SuperCompress is:

✅ Open source (MIT)

✅ Free API tier

✅ Python library

✅ Browser demo (no install)

✅ Integration guides for OpenAI/LangChain

Try it: supercompress.vercel.app

GitHub: github.com/arjunkshah/supercompress

Tweet 7

Built this because I believe we can't scale AI by burning through what we have left.

Smarter compute means more AI for everyone — without the environmental cost.

Would love feedback from the community 🙏

LLM #AI #OpenSource #MachineLearning

Links: GitHub | Live Demo | Interactive Tool

How I Built a Prompt Compressor That Saves 65% on LLM Costs

Arjun Shah — Fri, 26 Jun 2026 19:15:11 +0000

How I Built a Prompt Compressor That Saves 65% on LLM Costs

Every time you call an LLM, tokens that never needed to be processed burn GPU cycles, waste money, and strain the grid. The problem gets worse with every agent loop, every long-context RAG query, every multi-turn conversation.

I built SuperCompress — a tiny ~5K parameter CPU policy that scores every line of context for relevance before inference, keeping only what the model needs.

The results? 65% fewer tokens, 100% oracle recall, ~60ms latency. Open source. MIT licensed.

The Problem: LLMs Are Wasteful

Modern LLMs process every token you give them. On long contexts (think agent logs, RAG results, codebases), most of those tokens are padding — irrelevant boilerplate that consumes KV cache space without contributing to the answer.

The standard approaches don't work well:

Approach	Tokens Saved	Answer Quality
Truncation (keep head/tail)	~65%	~25% recall
FIFO eviction	~65%	~25% recall
H2O	~65%	~98% recall
SuperCompress	~65%	100% recall

At the same KV savings, SuperCompress preserves answer quality dramatically better.

The Architecture: CPU-First Eviction

The key insight: you don't need a GPU to decide what a GPU should process.

┌─────────────┐     ┌──────────────┐     ┌──────────┐
│  Context In  │ ──→ │  CPU Policy  │ ──→ │  GPU LLM │
│ (1,247 tok)  │     │  (5K params) │     │ (437 tok) │
└─────────────┘     └──────────────┘     └──────────┘
                          │
                          ↓
                    Score each line
                    Drop low-relevance
                    Keep answer-critical

The policy is a lightweight neural network (~5,000 parameters) that runs entirely on CPU. It takes each line of context + the user's question, and scores how relevant that line is to answering the question. Lines below a threshold get evicted.

Training Approach

The policy was trained on a dataset of:

Long-form text passages (books, documentation, code)
Paired with realistic user questions
Ground-truth relevance labels from oracle LLM judgments

The training objective balances:

Token savings — maximize KV reduction
Recall — preserve lines needed for correct answers
Latency — keep inference under 100ms on CPU

Benchmarks

At a fixed 35% budget (keep 35% of tokens):

Policy          | Oracle Recall | Entity Recall | Latency
────────────────┼───────────────┼───────────────┼────────
FIFO/Truncation |         25%  |         73%   | ~57ms
Summarization   |         61%  |         65%   | ~63ms
H2O             |         98%  |         73%   | ~56ms
SuperCompress   |        100%  |         73%   | ~60ms

100% oracle recall means the policy never dropped a line that the answer depended on. At the same compute savings.

Environmental Impact

Per 1 million compressions:

800M tokens avoided — that's real GPU time
29 kWh saved — enough to power a home for a day
12 kg CO₂ avoided — tiny but it adds up
52 L water saved — datacenter cooling is thirsty

Getting Started

Python (in-process)

pip install git+https://github.com/arjunkshah/supercompress.git

from supercompress import compress_context

result = compress_context(
    "Your long context text here...",
    "What does this code do?",
    budget_ratio=0.35,
)
print(result.compressed_text)
print(f"{result.kv_savings_pct:.1f}% KV saved")

Hosted API (no local ML deps)

curl -X POST https://supercompress.vercel.app/api/v1/compress \
  -H "X-API-Key: sc_live_YOUR_KEY" \
  -d '{"context":"...","query":"Summarize this"}'

Browser demo (no setup needed)

Just visit supercompress.vercel.app and try the live demo.

What's Next

Adaptive compression ratios (not fixed budget)
Integration with LangChain/LlamaIndex as a built-in compressor
Quantized policy for even lower latency

The code is open source under MIT. Contributions welcome!

GitHub: https://github.com/arjunkshah/supercompress
Live demo: https://supercompress.vercel.app
Docs: https://arjunkshah-supercompress-55.mintlify.app