Tweet 1
Every LLM call burns GPU cycles on tokens that never needed to run.
Padding. Boilerplate. Irrelevant context.
I built SuperCompress — a tiny CPU policy that cuts 65% of tokens before inference.
Open source. MIT. Free tier.
supercompress.vercel.app
Tweet 2
The problem is worse than most people realize.
At ~50M agent turns/day:
→ 100B tokens wasted daily
→ 24K GPU hours
→ 1,526 tons CO₂
→ 6.5M L cooling water
We're burning through resources on tokens that don't matter.
Tweet 3
How it works:
1️⃣ Context + question → CPU policy (5K params)
2️⃣ Every line scored for relevance to the question
3️⃣ Low-scoring lines evicted
4️⃣ Only essential tokens reach the GPU
CPU first. GPU for what matters.
Tweet 4
The numbers at 35% budget:
• 65% KV cache saved
• 100% oracle recall (vs 25% for truncation)
• ~60ms CPU latency
Same answers. ⅓ the compute.
Tweet 5
Per 1 million compressions:
→ 800M tokens avoided
→ 29 kWh saved
→ 12 kg CO₂ avoided
→ 52 L cooling water saved
Scale that across the industry and it's enormous.
Tweet 6
SuperCompress is:
✅ Open source (MIT)
✅ Free API tier
✅ Python library
✅ Browser demo (no install)
✅ Integration guides for OpenAI/LangChain
Try it: supercompress.vercel.app
GitHub: github.com/arjunkshah/supercompress
Tweet 7
Built this because I believe we can't scale AI by burning through what we have left.
Smarter compute means more AI for everyone — without the environmental cost.
Would love feedback from the community 🙏
LLM #AI #OpenSource #MachineLearning
Links: GitHub | Live Demo | Interactive Tool
Top comments (0)