DEV Community

Oleksander
Oleksander

Posted on

Google's TurboQuant solves half the AI memory problem. Here's the other half.

This week Google Research published TurboQuant — a two-stage KV-cache quantization algorithm that achieves 6x memory reduction and 8x attention speedup with zero accuracy loss at 3 bits. No training required.

It's genuinely impressive engineering. But it's worth being precise about what problem it solves.

The two AI memory problems

Most people conflate two distinct problems:

Problem A: memory within a session
As context grows, the KV-cache grows. It becomes expensive in RAM and slow in attention computation. TurboQuant solves this — brilliantly.

Problem B: memory between sessions
When the session ends, the KV-cache is gone. The model starts from zero next time. No memory of past interactions, no accumulated patterns, no structured experience. TurboQuant doesn't touch this.

What TurboQuant actually does

TurboQuant is a two-stage pipeline:

  1. PolarQuant — rotates vectors randomly, converts to polar coordinates, quantizes components without needing per-block normalization constants. This eliminates the 1–2 bit overhead that traditional quantization methods carry.

  2. QJL (Quantized Johnson-Lindenstrauss) — encodes residual error with a single sign bit. Zero memory overhead.

Result: 3-bit KV-cache, 6x compression, 8x speedup, zero accuracy degradation on LongBench, Needle-in-a-Haystack, RULER, and ZeroSCROLLS benchmarks.

This makes long-context inference significantly cheaper and faster. Real value.

The gap it leaves open

The moment the session ends — the KV-cache is gone.

Week 1 with any model: average answers.
Week 4 with any model: still average answers. It forgot everything.

Fine-tuning costs thousands of dollars and weeks. RAG gives you retrieval, not cognition. Context windows bill per token and still reset.

What we built for Problem B

AuraSDK is a local cognitive substrate that sits outside model weights.

It accumulates structured experience across sessions through a 5-layer pipeline:

Record → Belief → Concept → Causal → Policy
Enter fullscreen mode Exit fullscreen mode

Each layer is derived deterministically from the one below — no LLM, no embeddings. Policy hints like "deploy to staging first" aren't written by anyone. They emerge from repeated causal patterns in stored experience.

from aura import Aura

brain = Aura("./agent_memory")

brain.store("Staging deploy prevented 3 production incidents", tags=["workflow"])
brain.store("User always deploys to staging first", tags=["workflow"])

# after run_maintenance(), the cognitive stack derives:
hints = brain.get_surfaced_policy_hints()
# → [{"action": "Prefer", "domain": "workflow", "description": "deploy to staging first"}]
Enter fullscreen mode Exit fullscreen mode

What v1.5.4 adds:

  • Autonomous cognitive plasticity — the substrate observes model output and updates itself. No fine-tuning. Full audit trail.
  • Salience weighting — what matters persists longer, decays slower
  • Contradiction governance — conflicting evidence surfaced explicitly, not averaged silently

Performance (1,000 records, Ryzen 7, v1.5.4):

  • Store: 0.91ms
  • Recall: 0.076ms (~2,600× faster than Mem0)
  • Recall (cached): 1.4µs
  • Maintenance cycle: 15ms median

No API keys. No cloud. No LLM dependency. ~3MB binary. Fully offline. MIT license.

The full picture

TurboQuant AuraSDK
Problem KV-cache overhead within session No memory between sessions
Approach Quantization of attention keys/values Persistent cognitive substrate
Scope Single inference pass Cross-session accumulation
Requires LLM Yes (runs inside it) No
Works offline N/A Fully
Open source Research paper MIT, ships today

These are complementary. TurboQuant makes inference cheaper in the moment. AuraSDK makes the model smarter over time.

The field needs both.

pip install aura-memory
Enter fullscreen mode Exit fullscreen mode

GitHub ·

Top comments (0)