DEV Community

Oleksander
Oleksander

Posted on

Google's TurboQuant solves half the AI memory problem. Here's the other half.

This week Google Research published TurboQuant — a two-stage KV-cache quantization algorithm that achieves 6x memory reduction and 8x attention speedup with zero accuracy loss at 3 bits. No training required.

It's genuinely impressive engineering. But it's worth being precise about what problem it solves.

The two AI memory problems

Most people conflate two distinct problems:

Problem A: memory within a session
As context grows, the KV-cache grows. It becomes expensive in RAM and slow in attention computation. TurboQuant solves this — brilliantly.

Problem B: memory between sessions
When the session ends, the KV-cache is gone. The model starts from zero next time. No memory of past interactions, no accumulated patterns, no structured experience. TurboQuant doesn't touch this.

What TurboQuant actually does

TurboQuant is a two-stage pipeline:

  1. PolarQuant — rotates vectors randomly, converts to polar coordinates, quantizes components without needing per-block normalization constants. This eliminates the 1–2 bit overhead that traditional quantization methods carry.

  2. QJL (Quantized Johnson-Lindenstrauss) — encodes residual error with a single sign bit. Zero memory overhead.

Result: 3-bit KV-cache, 6x compression, 8x speedup, zero accuracy degradation on LongBench, Needle-in-a-Haystack, RULER, and ZeroSCROLLS benchmarks.

This makes long-context inference significantly cheaper and faster. Real value.

The gap it leaves open

The moment the session ends — the KV-cache is gone.

Week 1 with any model: average answers.
Week 4 with any model: still average answers. It forgot everything.

Fine-tuning costs thousands of dollars and weeks. RAG gives you retrieval, not cognition. Context windows bill per token and still reset.

What we built for Problem B

AuraSDK is a local cognitive substrate that sits outside model weights.

It accumulates structured experience across sessions through a 5-layer pipeline:

Record → Belief → Concept → Causal → Policy
Enter fullscreen mode Exit fullscreen mode

Each layer is derived deterministically from the one below — no LLM, no embeddings. Policy hints like "deploy to staging first" aren't written by anyone. They emerge from repeated causal patterns in stored experience.

from aura import Aura

brain = Aura("./agent_memory")

brain.store("Staging deploy prevented 3 production incidents", tags=["workflow"])
brain.store("User always deploys to staging first", tags=["workflow"])

# after run_maintenance(), the cognitive stack derives:
hints = brain.get_surfaced_policy_hints()
# → [{"action": "Prefer", "domain": "workflow", "description": "deploy to staging first"}]
Enter fullscreen mode Exit fullscreen mode

What v1.5.4 adds:

  • Autonomous cognitive plasticity — the substrate observes model output and updates itself. No fine-tuning. Full audit trail.
  • Salience weighting — what matters persists longer, decays slower
  • Contradiction governance — conflicting evidence surfaced explicitly, not averaged silently

Performance (1,000 records, Ryzen 7, v1.5.4):

  • Store: 0.91ms
  • Recall: 0.076ms (~2,600× faster than Mem0)
  • Recall (cached): 1.4µs
  • Maintenance cycle: 15ms median

No API keys. No cloud. No LLM dependency. ~3MB binary. Fully offline. MIT license.

The full picture

TurboQuant AuraSDK
Problem KV-cache overhead within session No memory between sessions
Approach Quantization of attention keys/values Persistent cognitive substrate
Scope Single inference pass Cross-session accumulation
Requires LLM Yes (runs inside it) No
Works offline N/A Fully
Open source Research paper MIT, ships today

These are complementary. TurboQuant makes inference cheaper in the moment. AuraSDK makes the model smarter over time.

The field needs both.

pip install aura-memory
Enter fullscreen mode Exit fullscreen mode

GitHub ·

Top comments (4)

Collapse
 
pixeliro profile image
Pixeliro

Really sharp breakdown — especially the distinction between in-session vs cross-session memory. TurboQuant clearly pushes the efficiency frontier for long-context inference.

That said, I think the “Problem B” side (persistent memory) is more nuanced than it looks.

Building a per-user cognitive layer isn’t just a storage problem — it starts to resemble training a personalized model. Each user has different mental models, biases, and even contradictions over time. Unlike global model training (which is curated and validated), personal memory lacks a clear ground truth.

That creates a few hard issues:

Validation: if incorrect patterns are stored, the system may reinforce them
Contradictions: users don’t behave consistently, so derived “policies” can drift or conflict
Context decay: what was true a month ago may no longer be relevant

Because of this, I’m not fully convinced that turning memory into a deterministic cognitive pipeline (Record → Belief → Policy) is always the right abstraction. It risks overfitting to noisy personal data.

In practice, a simpler model often works better: treat memory as a retrieval layer, not a reasoning layer

Something like: store history retrieve relevant slices based on intent inject into context let the model reason statelessly each time

This is closer to how humans actually operate too — we don’t perfectly internalize everything; we externalize (notes, logs, tools) and recall selectively when needed.

So I’d frame it less as “making the model smarter over time” and more as: “improving context assembly over time”

Curious how you’re thinking about validation and contradiction handling at scale — that feels like the real bottleneck for any persistent cognitive system.

Collapse
 
teolex2020 profile image
Oleksander

This is the right pushback, and you've identified the real hard problem.

You're correct that "deterministic pipeline" sounds overconfident. But the architecture doesn't assume ground truth — it's built around epistemic uncertainty:

Records carry confidence, support_mass, and conflict_mass — one-off signals stay weak singletons
Beliefs only stabilize when corroborating evidence converges across independent observations
Contradicting patterns suppress each other explicitly — they don't silently average
Everything decays. Volatile signals don't survive consolidation
So "user said X once" never becomes a policy hint. "User did X consistently across 15 independent observations, with no contradicting signals" might.

On your "retrieval layer, not reasoning layer" model — I actually think we agree more than it looks. AuraSDK doesn't replace model reasoning. It improves what gets assembled into context: instead of raw history, you get beliefs with evidence weight, causal patterns with support counts, and policy hints with provenance chains.

The real bottleneck you named — validation at scale — is open. We don't have a good answer for "user's beliefs are systematically wrong." That's why everything stays advisory and auditable, not authoritative.

Curious whether you see a path where the retrieval layer itself could become epistemically aware, or whether you think that's always better left to the model.

Collapse
 
pixeliro profile image
Pixeliro

I think this also highlights a deeper separation that might explain why systems like ChatGPT or Gemini don’t prioritize this layer.

What we’re discussing here belongs to the user cognition layer, not the model layer.

Most general-purpose models are optimized for:

correctness
consistency
shared knowledge across users

So their memory design naturally focuses on:

problem → solution → reusable context

That works well at scale, because the goal is to serve everyone with stable, generalizable answers.

But the kind of memory we’re talking about here is different.

It’s not about what is universally correct —
it’s about how a specific user understands, resolves, and evolves their thinking over time.

And that introduces a major challenge:

human cognition is not stable.

A user’s beliefs, understanding, and even problem framing can change over time.
So any system that tries to “lock in” user-specific memory as ground truth risks becoming outdated or misleading.

This is likely why most systems avoid building deeply personalized cognitive memory:

it’s dynamic
it drifts
it lacks clear validation signals

Instead, they keep memory shallow and retrieval-based, and let the model handle reasoning.

But that also means they miss an opportunity.

Because if we treat memory not as static truth, but as a structured, evolving cognitive graph — with:

open / closed propositions
resolution events
branching when understanding deepens
decay or superseding of outdated beliefs

Then the system doesn’t need to assume correctness.

It only needs to track:

how the user’s understanding changes over time.

So the goal shifts from:

“store what is true”

to:

“track what has been resolved, what is still open, and how the user’s thinking evolves”

That’s a fundamentally different problem from building a general-purpose model —
and probably why it hasn’t been a focus so far.

Collapse
 
pixeliro profile image
Pixeliro

I think this is where the discussion becomes really interesting.

The question isn’t just how we stabilize beliefs, but who or what actually validates them. And I’m starting to think repetition alone is a weak signal.

A user can repeat the same unresolved problem 15 times without contradiction, and the system will converge on a strong belief — but nothing has actually been resolved or understood. It’s just stable confusion.

So what seems missing is not just probabilistic weighting, but a notion of resolution as a first-class signal.

Instead of modeling memory as accumulated observations, we could think of it as state transitions:

a problem appears
the user explores / struggles
a resolution event happens (explicit confirmation, demonstrated understanding, or successful outcome)
memory consolidates around the resolved state
prior exploratory context is downgraded or pruned

But I think it goes one step further.

Each problem can be treated as a proposition with state:

open → still being explored
closed → resolved and confirmed

When the user revisits or deepens the topic, the system shouldn’t just overwrite or accumulate more signals. It should either:

reopen the proposition (if new uncertainty appears), or
branch into a new proposition derived from the resolved one

At that point, memory is no longer a log or even a belief system — it becomes a structured cognitive graph.

So instead of optimizing for “what happened” or even “what is likely true,” memory starts optimizing for:

what has been resolved, what is still open, and how understanding evolves over time.

One important distinction here is that this layer represents the user’s cognition, not the model’s.

The model encodes general knowledge about the world
This memory encodes how a specific user understands the world

If we mix the two, we either:

reinforce user bias as if it were truth, or
fail to capture how the user actually thinks

So maybe the direction isn’t just epistemically-aware or even resolution-aware retrieval, but:

proposition-aware memory with the ability to open, close, and branch

Curious how you think about incorporating something like this without overcomplicating the system.