DEV Community

Arjun Shah
Arjun Shah

Posted on

SuperCompress: Cut LLM Costs by 65% Without Losing Answers

Tweet 1

Every LLM call burns GPU cycles on tokens that never needed to run.

Padding. Boilerplate. Irrelevant context.

I built SuperCompress — a tiny CPU policy that cuts 65% of tokens before inference.

Open source. MIT. Free tier.

supercompress.vercel.app

Tweet 2

The problem is worse than most people realize.

At ~50M agent turns/day:

→ 100B tokens wasted daily

→ 24K GPU hours

→ 1,526 tons CO₂

→ 6.5M L cooling water

We're burning through resources on tokens that don't matter.

Tweet 3

How it works:

1️⃣ Context + question → CPU policy (5K params)

2️⃣ Every line scored for relevance to the question

3️⃣ Low-scoring lines evicted

4️⃣ Only essential tokens reach the GPU

CPU first. GPU for what matters.

Tweet 4

The numbers at 35% budget:

• 65% KV cache saved

• 100% oracle recall (vs 25% for truncation)

• ~60ms CPU latency

Same answers. ⅓ the compute.

Tweet 5

Per 1 million compressions:

→ 800M tokens avoided

→ 29 kWh saved

→ 12 kg CO₂ avoided

→ 52 L cooling water saved

Scale that across the industry and it's enormous.

Tweet 6

SuperCompress is:

✅ Open source (MIT)

✅ Free API tier

✅ Python library

✅ Browser demo (no install)

✅ Integration guides for OpenAI/LangChain

Try it: supercompress.vercel.app

GitHub: github.com/arjunkshah/supercompress

Tweet 7

Built this because I believe we can't scale AI by burning through what we have left.

Smarter compute means more AI for everyone — without the environmental cost.

Would love feedback from the community 🙏

LLM #AI #OpenSource #MachineLearning


Links: GitHub | Live Demo | Interactive Tool

Top comments (0)