Atharva Vichare

Posted on Apr 20

Two architectures that didn't help small-model agent memory on a free T4

#ai #machinelearning #llm #python

When Google published TurboQuant in March, and Milla Jovovich's team dropped MemPalace two weeks ago, a specific question stuck with me: could you stitch them together into something that runs on a free Colab T4 and actually improves agent memory?

Seven days later, I have an answer. Two answers, actually. Neither is the one I expected, and at my sample size, neither is statistically decisive either.

This is the short version of what happened, what broke, and what the numbers say. All code is on GitHub; every result in this piece comes from 50 stratified questions on LongMemEval_S, judged by Gemma 3 12B with 8/9 raw agreement against my manual labels (Cohen's κ = 0.73, substantial).

The setup

The hypothesis was simple. Flat retrieval over long agent histories has two known problems: you pay prefill cost on everything you retrieve, and relevant information is often diluted by topically adjacent noise. MemPalace proposes spatial hierarchy as a fix to organize memories into "rooms" by topic, route queries to rooms first, then retrieve within. TurboQuant proposes ultra-low-bit KV quantization as a way to pay the prefill cost cheaply. Combined, you get persistent memory that fits on modest hardware.

I couldn't implement TurboQuant's reference path on a T4; the kernels assume FlashAttention 2 as shipped in the mainline flash-attn package, which supports Ampere and later. Turing GPUs like the T4 have a separate flash-attention-turing fork with a subset of features, but TurboQuant's reference implementation doesn't target it. So I used HQQ INT4 KV caching as a stand-in: same 4-bit compression target, a different quantizer, and one that Hugging Face's QuantizedCache supports out of the box on any GPU.

Important caveat: this substitutes one KV-cache compression method for another, so any finding here tests the higher-level idea of "cheap KV on small hardware," not TurboQuant itself.

For MemPalace, I built the minimum viable version: k-means over chunk embeddings to form 16 "rooms," two-stage retrieval routing queries to the top 2 rooms, then top-5 within. Same retrieval depth as flat, same LLM, same judge, only the retrieval architecture changes.

The stack:

Qwen3-4B-Instruct-2507, FP16 weights, ~8 GB VRAM
MiniLM-L6-v2 embeddings (384d), ChromaDB cosine
HQQ INT4 KV cache with optimize=False (the default optimize=True OOMs at 30k+ tokens on T4)
Chunked prefill at 1024 tokens per pass (bypasses T4's attention-matrix wall at ~5k tokens in plain FP16)
Gemma 3 12B via Google AI Studio for judging (free tier)

The benchmark: LongMemEval_S, 500 questions across six types (single-session user/assistant/preference, multi-session, knowledge-update, temporal-reasoning), with ~115k-token haystacks each. I ran on a stratified sample of 50 (seed=42, reproducible) to stay within Colab T4 time budgets.

Before building anything: retrieval looked mostly fine

Before running architectures, I measured retrieval quality in isolation. On Day 5, flat RAG ran across all 50 evaluation questions, Hit@5, the fraction of questions where at least one gold chunk made the top 5 was 47/50 = 94%. A smaller 10-question diagnostic earlier showed 9/10 questions had at least one gold chunk in the top 5 (Recall@5 mean = 0.17; the low mean reflects that some questions have 40+ gold chunks and only 5 slots).

That result reframed everything. If retrieval finds at least one gold chunk 94% of the time at K=5, any architecture that keeps retrieval quality constant but fails to improve accuracy is telling us the bottleneck is downstream. Keep this in mind; it's going to matter.

But note the softer phrasing: "retrieval looked mostly fine," not "retrieval is not the problem." I didn't run an experiment that strictly isolates retrieval from reasoning I'm inferring from behavior across architectures.

Baseline: flat RAG gets 26/50 = 52%

The naive baseline: retrieve top-5 by cosine similarity, stuff into Qwen3's context, generate an answer. 50 questions, stratified, judged by Gemma.

Question type	N	Correct	Parse-fail	Accuracy
single-session-user	10	9	0	90.0%
single-session-assistant	8	7	0	87.5%
single-session-preference	5	2	0	40.0%
multi-session	12	2	1	16.7%
knowledge-update	8	4	0	50.0%
temporal-reasoning	7	2	0	28.6%
Overall	50	26	1	52.0%

The 1 parse failure was an unparseable judge verdict on a multi-session question. Reported as wrong in the "Accuracy" column; if you exclude parse failures from the denominator instead, the overall is 26/49 = 53.1%.

A few things fall out immediately. Single-session categories are near-ceiling 90% on user questions, 87.5% on assistant questions. Any improvement from a fancier architecture would have to come from the hard categories: multi-session, knowledge-update, and temporal-reasoning. On those hard categories, retrieval Hit@5 was 0.88–1.00 across the sample. Retrieval looks fine. The model still fails.

52% is not state-of-the-art, but it's honest: ~2k input tokens per question, 3.7s latency, 9.8 GB peak VRAM, runs on free-tier Colab. This is the number the next two experiments have to beat.

Experiment 1: spatial routing regressed in the easy categories (−4 questions)

The MemPalace hypothesis is specific: spatial clustering of memories should help queries that require synthesizing across topically related sessions. Build 16 rooms via k-means, route each query to its top-2 rooms by centroid similarity, and retrieve top-5 chunks within that ~40-chunk pool.

Same 50 questions, same judge:

Question type	Flat (N=50)	Scoped (N=50)	Δ
single-session-user	9/10 (90%)	8/10 (80%)	−1 Q
single-session-assistant	7/8 (87%)	6/8 (75%)	−1 Q
single-session-preference	2/5 (40%)	2/5 (40%)	0
multi-session	2/12 (17%)	2/12 (17%)	0
knowledge-update	4/8 (50%)	4/8 (50%)	0
temporal-reasoning	2/7 (29%)	2/7 (29%)	0
Overall	26/50 (52%)	24/50 (48%)	−2 Q

A 2-question overall delta on N=50 is well within noise. A McNemar exact test on the discordant pairs (flat-correct-scoped-wrong vs flat-wrong-scoped-correct = 3 vs 1) gives p = 0.625 two-sided. This is not a statistically significant difference, and I shouldn't have called it "falsification" in my earlier drafts. The honest statement is: spatial routing did not improve accuracy in this setup, and if anything, the direction was negative in the easy categories.

The per-question flip analysis tells a more specific story than the aggregate. Out of 50: 23 both-correct, 22 both-wrong, 3 flat-won-scoped-lost, 1 scoped-won-flat-lost, 1 excluded for parse failure. That's 4 discordant pairs total. The direction suggests scoped hurts more than it helps in the easy categories, but with only 4 disagreeing pairs, I can't call that anything stronger than "suggestive."

What I can safely share is the qualitative shape of where they disagree.

The one scoped win:

Q478: "Can you remind me of the name of the last venue you recommended in the list of popular venues in Portland for indie music shows?"
Gold: Revolution Hall
Flat: "The last venue mentioned in the Portland indie music list is not specified in the excerpts."
Scoped: "Revolution Hall"

Scoped won here because its Portland-venues room was topically tight, so gold chunks ranked higher within the 40-chunk pool than against the full 290-chunk haystack. This is what the theory predicts. It happened once in fifty tries.

The three scoped losses:

Q13: "Where did I attend for my study abroad program?"
Gold: University of Melbourne in Australia
Flat: "University of Melbourne."
Scoped: "The excerpts do not contain information about where you attended your study abroad program."

Q445: "Can you remind me of the name of that restaurant in Cihampelas Walk that serves a great Nasi Goreng?"
Gold: Miss Bee Providore
Flat: "Miss Bee Providore."
Scoped: "The restaurant in Cihampelas Walk that serves great nasi goreng is Miss Bee's Nasi Goreng."

Q445 is the worst failure mode visible here: not only did scoped routing miss the answer, it hallucinated a plausible-sounding wrong one. The top-2 rooms didn't contain the gold session, so the model answered from adjacent evidence and invented "Miss Bee's Nasi Goreng" confident, wrong, and unrecoverable.

Why might spatial routing struggle here? A plausible explanation is that LongMemEval haystacks are adversarial by design. Each 115k-token haystack is 50+ unrelated ShareGPT sessions plus 1–2 gold sessions on the actual topic. When you k-means the chunk embeddings, you find generic topical clusters "coding," "recipes," "travel" that match ShareGPT themes, not the specific question. The query might semantically land in a "shopping" room, while the gold evidence got sorted into a "discounts" room three clusters over. Flat retrieval scans all 290 chunks and finds it. Scoped scans 40 and can miss.

MemPalace's own benchmark page is careful to distinguish retrieval recall from QA accuracy; a fair comparison would need MemPalace's full routing and reranking stack, not the MVP I built. What I tested is the core clustering-plus-scoped-retrieval idea in isolation, and it didn't help here.

Experiment 2: LLM reranking regressed more (−5 questions)

Okay, scoped routing didn't help because k-means doesn't match query intent. What if we replaced the clustering step with something that understands what the query is asking? Retrieve top-20 flat, then ask Gemma 3 12B to rerank: "which 5 of these 20 most directly answer the question?"

This is the oldest trick in the RAG playbook. I expected a positive result.

Question type	Flat (N=50)	Rerank (N=50)	Δ
single-session-user	9/10 (90%)	6/10 (60%)	−3 Q
single-session-assistant	7/8 (87%)	5/8 (62%, 1 pf)	−2 Q
single-session-preference	2/5 (40%)	1/5 (20%)	−1 Q
multi-session	2/12 (17%)	2/12 (17%)	0
knowledge-update	4/8 (50%)	5/8 (62%)	+1 Q
temporal-reasoning	2/7 (29%)	2/7 (29%)	0
Overall	26/50 (52%)	21/50 (42%)	−5 Q

Reranking regressed 5 questions relative to flat. McNemar's exact test on discordant pairs (8 vs 4) gives p = 0.388 two-sided, also not statistically significant at N=50, but larger in magnitude than scoped and with a cleaner qualitative pattern.

Here's the diagnostic that makes this result interpretable: out of 50 questions, Gemma kept gold chunks in the top-5 after reranking 47 times. In the remaining 3: gold wasn't even in the top-20 for 1 question (2% of the sample), and Gemma dropped gold from top-5 in 2. Reranking didn't throw away evidence. Retrieval quality after rerank is nearly identical to flat's original retrieval.

So the model saw broadly the same information as in flat RAG and answered worse anyway.

My read: embedding retrieval's top-5 gives Qwen3 a diverse window into the session. Adjacent chunks, context around the answer, and some semantically similar noise. Reranked top-5 is over-focused Gemma confidently ranks chunks that "look like they contain the answer" together. They're often redundant or plausible rather than correct. Qwen3 sees 5 chunks all pointing at one thing and can't self-correct by comparing against a diverse context. The "noise" in embedding retrieval may have been a useful signal.

The one category that benefited was knowledge-update (+1 Q, from 4/8 to 5/8). That makes sense: knowledge-update questions have multiple competing candidate chunks ("my old job," "my current job," irrelevant jobs), and reranking can pick the most recent. But that specific structure ranking among conflicting versions is rare on this benchmark.

What the three accuracy numbers mean together

Two architectures, both failing to improve over flat in this specific setup, on the same 50 questions, with the same model, same retrieval budget, same judge. Three datapoints paint a consistent (though individually non-significant) picture:

Retrieval quality looked adequate. Hit@5 on the flat baseline was 47/50 = 94%. Hit@20 before rerank was 49/50 = 98%, with 47/50 gold chunks surviving rerank to top-5. Retrieval surfaces at least one gold chunk in roughly 94% of questions across all three architectures.

Candidate quality didn't obviously help. Reranking produces "better" top-5 chunks by Gemma's semantic judgment, and accuracy went down, not up.

Downstream answer synthesis appears to be the main observed weakness. Specifically: synthesis across chunks (multi-session stuck at 2/12 = 17% across all three architectures), selecting the right piece from similar-looking candidates (single-session drops sharply under reranking), and computing answers that require arithmetic over timestamps (temporal-reasoning stuck at 2/7 = 29%).

The hardest category multi-session sat at 2/12 across all three architectures. Hit@5 on multi-session was 0.92 in the 10-question diagnostic. The model can see the gold. It just can't add up "how many projects have I led" when the evidence is split across four different sessions, each mentioning one project.

These observations are consistent across my three runs. But with N=50 and none of the pairwise deltas reaching conventional significance, "consistent" is the right word, not "proved."

Limitations

Every claim in this piece should be read with this list attached:

N=50 stratified, seed=42. Overall accuracy has roughly ±10pp sampling uncertainty at this size, and per-category N as small as 5 has much wider bounds. None of the pairwise architecture deltas are statistically significant (all McNemar p > 0.3). I did not test seed sensitivity.
Retrieval quality was only directly probed on a 10-question diagnostic and the 50-question Hit@5 observation. Calling retrieval "not the bottleneck" would require a controlled counterfactual I didn't run.
Judge validation was against 9 clear manual labels (N=10 minus one "partial" case). Raw agreement was 8/9, Cohen's κ = 0.73 ("substantial"). This is decent for a sanity check, but far from definitive judge calibration, and the LLM-as-judge literature increasingly recommends multi-judge or ensemble calibration rather than single-model agreement checks.
I used HQQ INT4 KV as a stand-in for TurboQuant-style KV quantization. The architectures I tested use KV quantization only on the live cache during generation, not as pre-computed persistent blocks, so I didn't test the original combined TurboQuant + MemPalace composition at all. A different quantizer is used only at inference time.
I did not test larger models. Qwen3-8B-AWQ or similar would fit on the T4 at INT4 weights and might eliminate the observed synthesis weakness. I chose not to run this to keep the scope focused, but the results here say nothing about what a bigger model would do.
Free-tier rate limits I encountered on Gemini API varied by account; rates shouldn't be read as general Google AI Studio limits, they were what my account and models had at the time.

Given these limitations, the revised headline conclusion is:

On a 50-question stratified LongMemEval_S sample, neither k-means room routing nor LLM reranking improved over a flat top-5 RAG baseline for Qwen3-4B on a free T4. In this setup, retrieval quality looked adequate, while downstream answer synthesis remained the main observed weakness.

The free-tier takeaways

With those limitations in mind, some practical suggestions if you're building in this space:

Start with flat RAG at top-5. It got 26/50 on LongMemEval_S with a 4B model in my run. That's your floor.
Don't add spatial routing unless your conversation history has a natural topical structure that matches query intent. Adversarial haystacks like LongMemEval may punish k-means.
Don't add LLM reranking without running an ablation on your specific benchmark. It can reduce context diversity in ways that hurt downstream answering.
If accuracy matters more than compute, a reasonable next experiment is scaling model size, not retrieval sophistication. INT4 weight quantization lets an 8B model fit in the VRAM footprint of a 4B FP16 model.

If you're writing a paper, framework, or blog post claiming a memory architecture works: publish the ablation. Test flat RAG on the same benchmark with the same judge. If your architecture doesn't beat flat by a meaningful margin with appropriate significance testing, say so. The field has a surplus of "novel memory systems" with no ablation against the boring baseline.

Practical artifacts

Everything is reproducible on a free Colab T4 in one afternoon:

GitHub repo: rooms-of-kv full notebooks, requirements, sample manifest
LongMemEval_S cleaned JSON: xiaowu0162/longmemeval-cleaned on HuggingFace
Judge: Gemma 3 12B via Google AI Studio (free tier)
Sample manifest: 50 questions stratified across 6 types, seed=42
Total compute: ~2 hours of T4 time across 7 days, plus ~150 Gemma API calls

One thing I didn't implement, which the original TurboQuant + MemPalace combination would enable: persistent per-chunk quantized KV blocks that can be spliced into the cache without reprefill. That's a week of low-level transformers surgery and probably won't fit within Hugging Face's QuantizedCache API as it exists today. If you're building toward it, the key blocker I ran into is that QuantizedCache doesn't natively support splicing pre-computed blocks with their own position encodings. That's a real open problem, and solving it would be a paper, not a post.

Closing thought

The starting question was whether you could combine two 2026 papers into something that runs on a free GPU. The honest answer is: you can combine them, but the specific combinations I could actually build on free hardware with the substitutions and simplifications forced by a T4 didn't help on LongMemEval_S. The hard categories stayed hard. The easy categories sometimes got worse.

That's not a criticism of MemPalace or TurboQuant. Both papers are solid. It's a statement about my specific setup: a 4B model, a small sample, and a benchmark built to stress exactly this class of system. What I can't tell you from this work is what a bigger model, a bigger sample, or the unmodified architectures would do. If you run any of those, I'd genuinely like to hear about it.

Until then: on a free T4 with a 4B model, flat RAG at top-5 is 26/50 on LongMemEval_S. That's where I'd start.

*I built this on a free Colab T4. The GitHub repo has everything needed to reproduce the numbers.

DEV Community