This week Google Research published TurboQuant — a two-stage KV-cache quantization algorithm that achieves 6x memory reduction and 8x attention speedup with zero accuracy loss at 3 bits. No training required.
It's genuinely impressive engineering. But it's worth being precise about what problem it solves.
The two AI memory problems
Most people conflate two distinct problems:
Problem A: memory within a session
As context grows, the KV-cache grows. It becomes expensive in RAM and slow in attention computation. TurboQuant solves this — brilliantly.
Problem B: memory between sessions
When the session ends, the KV-cache is gone. The model starts from zero next time. No memory of past interactions, no accumulated patterns, no structured experience. TurboQuant doesn't touch this.
What TurboQuant actually does
TurboQuant is a two-stage pipeline:
PolarQuant — rotates vectors randomly, converts to polar coordinates, quantizes components without needing per-block normalization constants. This eliminates the 1–2 bit overhead that traditional quantization methods carry.
QJL (Quantized Johnson-Lindenstrauss) — encodes residual error with a single sign bit. Zero memory overhead.
Result: 3-bit KV-cache, 6x compression, 8x speedup, zero accuracy degradation on LongBench, Needle-in-a-Haystack, RULER, and ZeroSCROLLS benchmarks.
This makes long-context inference significantly cheaper and faster. Real value.
The gap it leaves open
The moment the session ends — the KV-cache is gone.
Week 1 with any model: average answers.
Week 4 with any model: still average answers. It forgot everything.
Fine-tuning costs thousands of dollars and weeks. RAG gives you retrieval, not cognition. Context windows bill per token and still reset.
What we built for Problem B
AuraSDK is a local cognitive substrate that sits outside model weights.
It accumulates structured experience across sessions through a 5-layer pipeline:
Record → Belief → Concept → Causal → Policy
Each layer is derived deterministically from the one below — no LLM, no embeddings. Policy hints like "deploy to staging first" aren't written by anyone. They emerge from repeated causal patterns in stored experience.
from aura import Aura
brain = Aura("./agent_memory")
brain.store("Staging deploy prevented 3 production incidents", tags=["workflow"])
brain.store("User always deploys to staging first", tags=["workflow"])
# after run_maintenance(), the cognitive stack derives:
hints = brain.get_surfaced_policy_hints()
# → [{"action": "Prefer", "domain": "workflow", "description": "deploy to staging first"}]
What v1.5.4 adds:
- Autonomous cognitive plasticity — the substrate observes model output and updates itself. No fine-tuning. Full audit trail.
- Salience weighting — what matters persists longer, decays slower
- Contradiction governance — conflicting evidence surfaced explicitly, not averaged silently
Performance (1,000 records, Ryzen 7, v1.5.4):
- Store: 0.91ms
- Recall: 0.076ms (~2,600× faster than Mem0)
- Recall (cached): 1.4µs
- Maintenance cycle: 15ms median
No API keys. No cloud. No LLM dependency. ~3MB binary. Fully offline. MIT license.
The full picture
| TurboQuant | AuraSDK | |
|---|---|---|
| Problem | KV-cache overhead within session | No memory between sessions |
| Approach | Quantization of attention keys/values | Persistent cognitive substrate |
| Scope | Single inference pass | Cross-session accumulation |
| Requires LLM | Yes (runs inside it) | No |
| Works offline | N/A | Fully |
| Open source | Research paper | MIT, ships today |
These are complementary. TurboQuant makes inference cheaper in the moment. AuraSDK makes the model smarter over time.
The field needs both.
pip install aura-memory
GitHub ·
Top comments (0)