Last month I was debugging why our agent pipeline was burning through $400/day in OpenAI tokens. Turns out 60% of what we were feeding GPT-4 was redundant — repeated JSON schemas, duplicate log blocks, unchanged diff context, verbose imports.
I tried prompt trimming by hand. Tedious. I tried LLMLingua. Better, but it needs a GPU and the fidelity wasn't great at high compression.
Then I found claw-compactor and honestly I'm a bit mad I didn't find it sooner.
What It Actually Does
It's a 14-stage compression pipeline that sits between your data and the LLM. No neural network, no inference cost — pure deterministic transforms. You feed it code, JSON, logs, diffs, whatever, and it spits out a compressed version that preserves meaning but costs way fewer tokens.
The compression rates are kind of nuts:
- JSON payloads: 82% reduction
- Build logs: 76% reduction
- Python source: 25% reduction
- Git diffs: 40%+ reduction
Weighted average across real workloads: ~54% fewer tokens.
Why I Actually Switched From LLMLingua
I was using LLMLingua-2 before. It works, but:
- It needs a model to run (GPU or slow CPU inference)
- At 0.3 compression rate, ROUGE-L fidelity was 0.346 — basically mangling the content
- Can't reverse the compression
claw-compactor at the same 0.3 rate? ROUGE-L of 0.653. Almost twice the fidelity. And zero inference cost because it's all deterministic.
Plus it has this RewindStore feature where you can actually get the original content back from a compressed marker. Try doing that with a neural compressor.
How I'm Using It
We have an agent that processes GitHub issues — fetches the issue, relevant code, CI logs, and prior conversations, then asks the LLM to triage.
Before compression, a typical context was ~12K tokens. After piping everything through FusionEngine:
from scripts.lib.fusion.engine import FusionEngine
engine = FusionEngine()
result = engine.compress_messages(messages)
# 12K tokens → 5.5K tokens, zero information loss on the stuff that matters
The 14 stages each handle a different content type. The cool part is it auto-detects what's code, what's JSON, what's a log — you don't need to tell it.
Some stages that impressed me:
- SemanticDedup — SimHash fingerprinting to find near-duplicate blocks across your entire conversation. Killed about 20% of our tokens right there.
- Ionizer — Sees 100 JSON objects with the same schema? Samples a representative subset and summarizes the rest. Brutal efficiency.
- LogCrunch — "This line repeated 847 times" instead of sending 847 lines.
- Neurosyntax — Actual AST-aware code compression. Knows the difference between meaningful code and boilerplate.
The Numbers
For our pipeline specifically:
| Before | After | Savings |
|---|---|---|
| ~$400/day | ~$185/day | $6,450/month |
| 12K avg tokens/call | 5.5K avg tokens/call | 54% reduction |
| N/A | Zero added latency | No GPU needed |
The library itself is zero-dependency on Python 3.9+. You can optionally add tiktoken for exact token counts and tree-sitter-language-pack for AST-level code analysis.
What It's Not
This isn't magic. It won't compress a well-written 500-word prompt that's already tight. It shines when you're feeding the LLM structured data, code, logs, or conversations — the kind of bloated context that agent systems generate.
If you're running a chatbot with short user messages, you probably don't need this. If you're building an AI agent that processes real-world data, you probably do.
Try It
git clone https://github.com/open-compress/claw-compactor.git
cd claw-compactor
python3 scripts/mem_compress.py /your/workspace benchmark
The benchmark command does a dry run — shows you exactly how much each stage would compress without changing anything. Start there.
1,676 tests passing, MIT licensed, zero dependencies. Not sure what else you'd want.
Curious if anyone else is running token compression in production. What's your setup?
Top comments (14)
Learned something new today. Thanks for putting this together!
Solid article. The practical examples really help illustrate the concepts.
Really well written. Bookmarked for future reference.
Great insights! This is really helpful for developers working in this space.
54% compression with zero deps? Take my star.
Great insights! This is really helpful for developers working in this space.
Love the practical approach. Theory is nice, but hands-on examples are better.
Interesting perspective. I've had similar experiences in my projects.
Learned something new today. Thanks for putting this together!
Learned something new today. Thanks for putting this together!