I Cut My LLM API Bill in Half with a Single Python Library

Rishav E. Kejriwal — Wed, 18 Mar 2026 08:39:23 +0000

Last month I was debugging why our agent pipeline was burning through $400/day in OpenAI tokens. Turns out 60% of what we were feeding GPT-4 was redundant — repeated JSON schemas, duplicate log blocks, unchanged diff context, verbose imports.

I tried prompt trimming by hand. Tedious. I tried LLMLingua. Better, but it needs a GPU and the fidelity wasn't great at high compression.

Then I found claw-compactor and honestly I'm a bit mad I didn't find it sooner.

What It Actually Does

It's a 14-stage compression pipeline that sits between your data and the LLM. No neural network, no inference cost — pure deterministic transforms. You feed it code, JSON, logs, diffs, whatever, and it spits out a compressed version that preserves meaning but costs way fewer tokens.

The compression rates are kind of nuts:

JSON payloads: 82% reduction
Build logs: 76% reduction
Python source: 25% reduction
Git diffs: 40%+ reduction

Weighted average across real workloads: ~54% fewer tokens.

Why I Actually Switched From LLMLingua

I was using LLMLingua-2 before. It works, but:

It needs a model to run (GPU or slow CPU inference)
At 0.3 compression rate, ROUGE-L fidelity was 0.346 — basically mangling the content
Can't reverse the compression

claw-compactor at the same 0.3 rate? ROUGE-L of 0.653. Almost twice the fidelity. And zero inference cost because it's all deterministic.

Plus it has this RewindStore feature where you can actually get the original content back from a compressed marker. Try doing that with a neural compressor.

How I'm Using It

We have an agent that processes GitHub issues — fetches the issue, relevant code, CI logs, and prior conversations, then asks the LLM to triage.

Before compression, a typical context was ~12K tokens. After piping everything through FusionEngine:

from scripts.lib.fusion.engine import FusionEngine

engine = FusionEngine()
result = engine.compress_messages(messages)
# 12K tokens → 5.5K tokens, zero information loss on the stuff that matters

The 14 stages each handle a different content type. The cool part is it auto-detects what's code, what's JSON, what's a log — you don't need to tell it.

Some stages that impressed me:

SemanticDedup — SimHash fingerprinting to find near-duplicate blocks across your entire conversation. Killed about 20% of our tokens right there.
Ionizer — Sees 100 JSON objects with the same schema? Samples a representative subset and summarizes the rest. Brutal efficiency.
LogCrunch — "This line repeated 847 times" instead of sending 847 lines.
Neurosyntax — Actual AST-aware code compression. Knows the difference between meaningful code and boilerplate.

The Numbers

For our pipeline specifically:

Before	After	Savings
~$400/day	~$185/day	$6,450/month
12K avg tokens/call	5.5K avg tokens/call	54% reduction
N/A	Zero added latency	No GPU needed

The library itself is zero-dependency on Python 3.9+. You can optionally add tiktoken for exact token counts and tree-sitter-language-pack for AST-level code analysis.

What It's Not

This isn't magic. It won't compress a well-written 500-word prompt that's already tight. It shines when you're feeding the LLM structured data, code, logs, or conversations — the kind of bloated context that agent systems generate.

If you're running a chatbot with short user messages, you probably don't need this. If you're building an AI agent that processes real-world data, you probably do.

Try It

git clone https://github.com/open-compress/claw-compactor.git
cd claw-compactor
python3 scripts/mem_compress.py /your/workspace benchmark

The benchmark command does a dry run — shows you exactly how much each stage would compress without changing anything. Start there.

1,676 tests passing, MIT licensed, zero dependencies. Not sure what else you'd want.

GitHub repo

Curious if anyone else is running token compression in production. What's your setup?