DEV Community

Arjun Shah
Arjun Shah

Posted on

I Built a Prompt Compressor That Saves 65% on LLM Costs — Here's the Story

I've been working on a side project called SuperCompress — an intelligent prompt compression system for LLMs. The idea is simple: most tokens you send to an LLM never need to be processed. They're padding, boilerplate, irrelevant context. But they still burn GPU cycles.

I wanted to fix that.

The Problem

Working with LLM agents, I noticed something: every agent loop was sending massive context through the GPU. 10K tokens. 50K tokens. Sometimes more. Most of it was irrelevant to the specific task.

Truncation (keeping head + tail) was the standard approach, but it regularly dropped critical information from the middle of the context.

I thought: what if we could score each line of context for relevance BEFORE sending it to the GPU? A tiny CPU model that decides what matters.

The Build

The technical challenge was:

  1. Train a lightweight policy (~5K params) that runs on CPU in under 60ms
  2. Score each line of context relative to the user's question
  3. Evict low-relevance lines while keeping answer-critical ones
  4. Ensure the compressed output preserves correct answers

After a lot of iteration, the results surprised even me:

Policy KV Saved Oracle Recall
Truncation 65% 25%
H2O 65% 98%
SuperCompress 65% 100%

100% oracle recall at the same token savings. The policy never dropped a line the answer depended on.

The Environmental Angle

Here's what hit me hardest: at 50M agent turns per day (a conservative estimate for the industry), we're wasting 100B tokens daily. That's 24K GPU hours, 1,526 tons of CO₂, 6.5M liters of cooling water. Every day.

Per 1 million compressions, SuperCompress saves:

  • 800M tokens avoided
  • 29 kWh energy
  • 12 kg CO₂
  • 52 L cooling water

It's tiny per call. It's enormous at scale.

Current Status

  • ✅ Working policy with 100% oracle recall
  • ✅ Benchmarks and tests (65 passing)
  • ✅ Hosted API with free tier
  • ✅ Browser demo (compresses in-browser)
  • ✅ Python client library
  • ✅ Integration guides (OpenAI, LangChain, LlamaIndex)
  • ✅ Open source (MIT)

Currently looking for:

  • First real users and feedback
  • Integration partners
  • Contributors to the open-source codebase

Try It

Live demo: https://supercompress.vercel.app
GitHub: https://github.com/arjunkshah/supercompress
Docs: https://arjunkshah-supercompress-55.mintlify.app

The ask: If you're building with LLMs, try compressing your next prompt. See if the answers stay the same. I'd love to hear what you think.


Now available on PyPI! pip install supercompress

Links: GitHub | PyPI | Live Demo

Top comments (0)