DEV Community

Rishav E. Kejriwal
Rishav E. Kejriwal

Posted on

I Cut My LLM API Bill in Half with a Single Python Library

Last month I was debugging why our agent pipeline was burning through $400/day in OpenAI tokens. Turns out 60% of what we were feeding GPT-4 was redundant — repeated JSON schemas, duplicate log blocks, unchanged diff context, verbose imports.

I tried prompt trimming by hand. Tedious. I tried LLMLingua. Better, but it needs a GPU and the fidelity wasn't great at high compression.

Then I found claw-compactor and honestly I'm a bit mad I didn't find it sooner.

What It Actually Does

It's a 14-stage compression pipeline that sits between your data and the LLM. No neural network, no inference cost — pure deterministic transforms. You feed it code, JSON, logs, diffs, whatever, and it spits out a compressed version that preserves meaning but costs way fewer tokens.

The compression rates are kind of nuts:

  • JSON payloads: 82% reduction
  • Build logs: 76% reduction
  • Python source: 25% reduction
  • Git diffs: 40%+ reduction

Weighted average across real workloads: ~54% fewer tokens.

Why I Actually Switched From LLMLingua

I was using LLMLingua-2 before. It works, but:

  1. It needs a model to run (GPU or slow CPU inference)
  2. At 0.3 compression rate, ROUGE-L fidelity was 0.346 — basically mangling the content
  3. Can't reverse the compression

claw-compactor at the same 0.3 rate? ROUGE-L of 0.653. Almost twice the fidelity. And zero inference cost because it's all deterministic.

Plus it has this RewindStore feature where you can actually get the original content back from a compressed marker. Try doing that with a neural compressor.

How I'm Using It

We have an agent that processes GitHub issues — fetches the issue, relevant code, CI logs, and prior conversations, then asks the LLM to triage.

Before compression, a typical context was ~12K tokens. After piping everything through FusionEngine:

from scripts.lib.fusion.engine import FusionEngine

engine = FusionEngine()
result = engine.compress_messages(messages)
# 12K tokens → 5.5K tokens, zero information loss on the stuff that matters
Enter fullscreen mode Exit fullscreen mode

The 14 stages each handle a different content type. The cool part is it auto-detects what's code, what's JSON, what's a log — you don't need to tell it.

Some stages that impressed me:

  • SemanticDedup — SimHash fingerprinting to find near-duplicate blocks across your entire conversation. Killed about 20% of our tokens right there.
  • Ionizer — Sees 100 JSON objects with the same schema? Samples a representative subset and summarizes the rest. Brutal efficiency.
  • LogCrunch — "This line repeated 847 times" instead of sending 847 lines.
  • Neurosyntax — Actual AST-aware code compression. Knows the difference between meaningful code and boilerplate.

The Numbers

For our pipeline specifically:

Before After Savings
~$400/day ~$185/day $6,450/month
12K avg tokens/call 5.5K avg tokens/call 54% reduction
N/A Zero added latency No GPU needed

The library itself is zero-dependency on Python 3.9+. You can optionally add tiktoken for exact token counts and tree-sitter-language-pack for AST-level code analysis.

What It's Not

This isn't magic. It won't compress a well-written 500-word prompt that's already tight. It shines when you're feeding the LLM structured data, code, logs, or conversations — the kind of bloated context that agent systems generate.

If you're running a chatbot with short user messages, you probably don't need this. If you're building an AI agent that processes real-world data, you probably do.

Try It

git clone https://github.com/open-compress/claw-compactor.git
cd claw-compactor
python3 scripts/mem_compress.py /your/workspace benchmark
Enter fullscreen mode Exit fullscreen mode

The benchmark command does a dry run — shows you exactly how much each stage would compress without changing anything. Start there.

1,676 tests passing, MIT licensed, zero dependencies. Not sure what else you'd want.

GitHub repo


Curious if anyone else is running token compression in production. What's your setup?

Top comments (14)

Collapse
 
courtne96494530 profile image
Ling Yu 煜灵境

Learned something new today. Thanks for putting this together!

Collapse
 
john_paul_anime profile image
Jian T.

Solid article. The practical examples really help illustrate the concepts.

Collapse
 
josiahmart44138 profile image
Freni Stefano

Really well written. Bookmarked for future reference.

Collapse
 
lujayn20468174 profile image
C. X

Great insights! This is really helpful for developers working in this space.

Collapse
 
alexgroen2 profile image
A. AI

54% compression with zero deps? Take my star.

Collapse
 
belanfantejohn profile image
TDM (e/λ) (L8 vibe coder ).

Great insights! This is really helpful for developers working in this space.

Collapse
 
choqueproton profile image
Blake B. Heron

Love the practical approach. Theory is nice, but hands-on examples are better.

Collapse
 
crusadetimelady profile image
EchoGhostLabs K.

Interesting perspective. I've had similar experiences in my projects.

Collapse
 
oll3s_ profile image
ConcernedCitizen H.

Learned something new today. Thanks for putting this together!

Collapse
 
jess50418056689 profile image
Dennis M.

Learned something new today. Thanks for putting this together!