Google published TurboQuant at ICLR 2026 — a compression algorithm making AI models 6x more memory efficient with no quality loss. The internet is calling it the Pied Piper moment for AI.
Here's the problem: TurboQuant runs inside the model's inference engine on Anthropic's servers. You can't touch it.
But the core idea? Portable to any AI system today.
The Problem TurboQuant Solves
Every time you send a message to an AI, the full conversation history goes with it. On a 30-turn session, that's 3,000+ tokens of history sent on every single request. Tokens cost money. Long sessions get expensive fast.
The Hot/Cold Cache Split
TurboQuant's core insight: not all context is equally important. Recent turns matter most. Old turns can be compressed aggressively without hurting response quality.
We built this as an OpenClaw plugin using before_prompt_build — a hook that fires before every inference call:
function applyCompression(messages, config) {
const sys = messages.filter(m => m.role === 'system');
const conv = messages.filter(m => m.role !== 'system');
if (conv.length < config.minTurnsBeforeCompression) return messages;
const hotStart = Math.max(0, conv.length - config.keepRecentTurns);
const cold = conv.slice(0, hotStart); // compressed
const hot = conv.slice(hotStart); // verbatim
return [...sys, ...cold.map(m => compress(m, config.ratio)), ...hot];
}
The compression is extractive — scores every sentence by information density, keeps the highest-value ones. No AI calls, no extra API cost:
function compressText(text, ratio) {
const target = Math.ceil(text.length * ratio);
const scored = scoreSentences(text).sort((a, b) => b.score - a.score);
const picked = [];
let len = 0;
for (const item of scored) {
if (len >= target && picked.length >= 1) break;
picked.push(item);
len += item.s.length;
}
picked.sort((a, b) => a.i - b.i);
return picked.map(p => p.s.trim()).join(' ');
}
Results
On a 30-turn conversation:
| Without plugin | With plugin | |
|---|---|---|
| Tokens per request | ~3,200 | ~1,100 |
| Cost ratio | 1x | ~0.34x |
| Recent context quality | Full | Full |
| Old context quality | Full | ~75% |
~66% token reduction per request.
OpenClaw Plugin Config
{
"plugins": {
"allow": ["turboquant"],
"entries": {
"turboquant": {
"enabled": true,
"config": {
"keepRecentTurns": 6,
"compressionRatio": 0.25,
"minTurnsBeforeCompression": 10
}
}
}
}
}
After restarting the gateway you see logs like:
[turboquant] 28 turns → compressed 22 cold: ~3100→~980 tokens (saved ~2120, 68%)
The Real Point
We read the TurboQuant paper in the morning and shipped this the same afternoon. That's AI-assisted development in 2026 — read the research, understand the principle, ship something real before dinner.
Source code: github.com/Boehner/openclaw-turboquant
Drop your token savings in the comments if you try it.
Top comments (0)