This week Google Research published TurboQuant — a two-stage KV-cache quantization algorithm that achieves 6x memory reduction and 8x attention speedup with zero accuracy loss at 3 bits. No training required.
It's genuinely impressive engineering. But it's worth being precise about what problem it solves.
The two AI memory problems
Most people conflate two distinct problems:
Problem A: memory within a session
As context grows, the KV-cache grows. It becomes expensive in RAM and slow in attention computation. TurboQuant solves this — brilliantly.
Problem B: memory between sessions
When the session ends, the KV-cache is gone. The model starts from zero next time. No memory of past interactions, no accumulated patterns, no structured experience. TurboQuant doesn't touch this.
What TurboQuant actually does
TurboQuant is a two-stage pipeline:
PolarQuant — rotates vectors randomly, converts to polar coordinates, quantizes components without needing per-block normalization constants. This eliminates the 1–2 bit overhead that traditional quantization methods carry.
QJL (Quantized Johnson-Lindenstrauss) — encodes residual error with a single sign bit. Zero memory overhead.
Result: 3-bit KV-cache, 6x compression, 8x speedup, zero accuracy degradation on LongBench, Needle-in-a-Haystack, RULER, and ZeroSCROLLS benchmarks.
This makes long-context inference significantly cheaper and faster. Real value.
The gap it leaves open
The moment the session ends — the KV-cache is gone.
Week 1 with any model: average answers.
Week 4 with any model: still average answers. It forgot everything.
Fine-tuning costs thousands of dollars and weeks. RAG gives you retrieval, not cognition. Context windows bill per token and still reset.
What we built for Problem B
AuraSDK is a local cognitive substrate that sits outside model weights.
It accumulates structured experience across sessions through a 5-layer pipeline:
Record → Belief → Concept → Causal → Policy
Each layer is derived deterministically from the one below — no LLM, no embeddings. Policy hints like "deploy to staging first" aren't written by anyone. They emerge from repeated causal patterns in stored experience.
from aura import Aura
brain = Aura("./agent_memory")
brain.store("Staging deploy prevented 3 production incidents", tags=["workflow"])
brain.store("User always deploys to staging first", tags=["workflow"])
# after run_maintenance(), the cognitive stack derives:
hints = brain.get_surfaced_policy_hints()
# → [{"action": "Prefer", "domain": "workflow", "description": "deploy to staging first"}]
What v1.5.4 adds:
- Autonomous cognitive plasticity — the substrate observes model output and updates itself. No fine-tuning. Full audit trail.
- Salience weighting — what matters persists longer, decays slower
- Contradiction governance — conflicting evidence surfaced explicitly, not averaged silently
Performance (1,000 records, Ryzen 7, v1.5.4):
- Store: 0.91ms
- Recall: 0.076ms (~2,600× faster than Mem0)
- Recall (cached): 1.4µs
- Maintenance cycle: 15ms median
No API keys. No cloud. No LLM dependency. ~3MB binary. Fully offline. MIT license.
The full picture
| TurboQuant | AuraSDK | |
|---|---|---|
| Problem | KV-cache overhead within session | No memory between sessions |
| Approach | Quantization of attention keys/values | Persistent cognitive substrate |
| Scope | Single inference pass | Cross-session accumulation |
| Requires LLM | Yes (runs inside it) | No |
| Works offline | N/A | Fully |
| Open source | Research paper | MIT, ships today |
These are complementary. TurboQuant makes inference cheaper in the moment. AuraSDK makes the model smarter over time.
The field needs both.
pip install aura-memory
GitHub ·
Top comments (4)
Really sharp breakdown — especially the distinction between in-session vs cross-session memory. TurboQuant clearly pushes the efficiency frontier for long-context inference.
That said, I think the “Problem B” side (persistent memory) is more nuanced than it looks.
Building a per-user cognitive layer isn’t just a storage problem — it starts to resemble training a personalized model. Each user has different mental models, biases, and even contradictions over time. Unlike global model training (which is curated and validated), personal memory lacks a clear ground truth.
That creates a few hard issues:
Validation: if incorrect patterns are stored, the system may reinforce them
Contradictions: users don’t behave consistently, so derived “policies” can drift or conflict
Context decay: what was true a month ago may no longer be relevant
Because of this, I’m not fully convinced that turning memory into a deterministic cognitive pipeline (Record → Belief → Policy) is always the right abstraction. It risks overfitting to noisy personal data.
In practice, a simpler model often works better: treat memory as a retrieval layer, not a reasoning layer
Something like: store history retrieve relevant slices based on intent inject into context let the model reason statelessly each time
This is closer to how humans actually operate too — we don’t perfectly internalize everything; we externalize (notes, logs, tools) and recall selectively when needed.
So I’d frame it less as “making the model smarter over time” and more as: “improving context assembly over time”
Curious how you’re thinking about validation and contradiction handling at scale — that feels like the real bottleneck for any persistent cognitive system.
This is the right pushback, and you've identified the real hard problem.
You're correct that "deterministic pipeline" sounds overconfident. But the architecture doesn't assume ground truth — it's built around epistemic uncertainty:
Records carry confidence, support_mass, and conflict_mass — one-off signals stay weak singletons
Beliefs only stabilize when corroborating evidence converges across independent observations
Contradicting patterns suppress each other explicitly — they don't silently average
Everything decays. Volatile signals don't survive consolidation
So "user said X once" never becomes a policy hint. "User did X consistently across 15 independent observations, with no contradicting signals" might.
On your "retrieval layer, not reasoning layer" model — I actually think we agree more than it looks. AuraSDK doesn't replace model reasoning. It improves what gets assembled into context: instead of raw history, you get beliefs with evidence weight, causal patterns with support counts, and policy hints with provenance chains.
The real bottleneck you named — validation at scale — is open. We don't have a good answer for "user's beliefs are systematically wrong." That's why everything stays advisory and auditable, not authoritative.
Curious whether you see a path where the retrieval layer itself could become epistemically aware, or whether you think that's always better left to the model.
I think this also highlights a deeper separation that might explain why systems like ChatGPT or Gemini don’t prioritize this layer.
What we’re discussing here belongs to the user cognition layer, not the model layer.
Most general-purpose models are optimized for:
correctness
consistency
shared knowledge across users
So their memory design naturally focuses on:
problem → solution → reusable context
That works well at scale, because the goal is to serve everyone with stable, generalizable answers.
But the kind of memory we’re talking about here is different.
It’s not about what is universally correct —
it’s about how a specific user understands, resolves, and evolves their thinking over time.
And that introduces a major challenge:
human cognition is not stable.
A user’s beliefs, understanding, and even problem framing can change over time.
So any system that tries to “lock in” user-specific memory as ground truth risks becoming outdated or misleading.
This is likely why most systems avoid building deeply personalized cognitive memory:
it’s dynamic
it drifts
it lacks clear validation signals
Instead, they keep memory shallow and retrieval-based, and let the model handle reasoning.
But that also means they miss an opportunity.
Because if we treat memory not as static truth, but as a structured, evolving cognitive graph — with:
open / closed propositions
resolution events
branching when understanding deepens
decay or superseding of outdated beliefs
Then the system doesn’t need to assume correctness.
It only needs to track:
how the user’s understanding changes over time.
So the goal shifts from:
“store what is true”
to:
“track what has been resolved, what is still open, and how the user’s thinking evolves”
That’s a fundamentally different problem from building a general-purpose model —
and probably why it hasn’t been a focus so far.
I think this is where the discussion becomes really interesting.
The question isn’t just how we stabilize beliefs, but who or what actually validates them. And I’m starting to think repetition alone is a weak signal.
A user can repeat the same unresolved problem 15 times without contradiction, and the system will converge on a strong belief — but nothing has actually been resolved or understood. It’s just stable confusion.
So what seems missing is not just probabilistic weighting, but a notion of resolution as a first-class signal.
Instead of modeling memory as accumulated observations, we could think of it as state transitions:
a problem appears
the user explores / struggles
a resolution event happens (explicit confirmation, demonstrated understanding, or successful outcome)
memory consolidates around the resolved state
prior exploratory context is downgraded or pruned
But I think it goes one step further.
Each problem can be treated as a proposition with state:
open → still being explored
closed → resolved and confirmed
When the user revisits or deepens the topic, the system shouldn’t just overwrite or accumulate more signals. It should either:
reopen the proposition (if new uncertainty appears), or
branch into a new proposition derived from the resolved one
At that point, memory is no longer a log or even a belief system — it becomes a structured cognitive graph.
So instead of optimizing for “what happened” or even “what is likely true,” memory starts optimizing for:
what has been resolved, what is still open, and how understanding evolves over time.
One important distinction here is that this layer represents the user’s cognition, not the model’s.
The model encodes general knowledge about the world
This memory encodes how a specific user understands the world
If we mix the two, we either:
reinforce user bias as if it were truth, or
fail to capture how the user actually thinks
So maybe the direction isn’t just epistemically-aware or even resolution-aware retrieval, but:
proposition-aware memory with the ability to open, close, and branch
Curious how you think about incorporating something like this without overcomplicating the system.