MemPalace Benchmark Claims Don't Hold Up - A Technical Breakdown

#ai #opensource #llm #machinelearning

This past week, MemPalace went viral on GitHub — an open-source AI memory system fronted by actress Milla Jovovich, claiming 100% on LongMemEval and 100% on LoCoMo. I was evaluating it for a production agentic AI pipeline and decided to dig into the actual code and community audits before integrating anything. Here's what I found.
To be fair, the core idea is solid. MemPalace stores your LLM conversation history locally using ChromaDB, organized into a spatial hierarchy:

Wings — people or projects

Halls — memory types

Rooms — conversation threads

Tunnels — cross-connections between memories

Instead of dumping your entire memory store to the LLM (the naive approach), it sends only the top 15 semantically relevant memories (~800 tokens). That's a claimed 250x token reduction vs. brute-force context stuffing. Fully offline, MIT-licensed, costs ~$0.70/year to run.

The spatial retrieval does measurably outperform flat ChromaDB search. The privacy-first architecture is real. This part is genuinely good work.

The Benchmark Problem
Here's where it breaks down.

LongMemEval: 100% → 96.6%
The team identified exactly which questions were failing, engineered fixes targeting those specific questions, then retested on the same dataset. Classic overfitting to a benchmark. After GitHub Issue #29 surfaced this publicly, they revised the score to 96.6% without announcement. The community caught it via commit history.

LoCoMo: 100% (trivially gamed)
They ran evaluation with top_k=50 on a dataset containing only 19–32 items. When your retrieval window exceeds the entire dataset size, you retrieve everything by default. This isn't a memory system benchmark — it's a retrieval window that swallows the whole test set whole.

Real-World Performance
One developer ran manual end-to-end tests by actually asking questions through an LLM connected to MemPalace. Correct answer rate: approximately 17%. Three independent audits reached the same conclusion: solid ChromaDB wrapper, broken marketing claims.
README vs. Codebase Table

README Claim	Code Reality
Contradiction detection	knowledge_graph.py has zero contradiction logic
Palace structure drives benchmark scores	LongMemEval scores are ChromaDB's default embedding performance; palace routing sits above this
MCP Claude Desktop integration	stdout bug corrupts JSON stream, breaks Claude Desktop on first use

The Crypto Context
The primary author is Ben Sigman, a crypto CEO. Milla Jovovich had 7 commits across 2 days at launch. A memecoin spawned within days of the GitHub release. Celebrity face + inflated benchmarks + viral launch + token = a pattern the community rightly recognizes. The MIT license means no software rug-pull, but the marketing playbook is straight from crypto launch culture.
How It Compares to Obsidian / Logseq
Worth noting for anyone using PKM tools: these aren't competitors, they solve different problems.

	MemPalace	Obsidian	Logseq
Storage format	ChromaDB binary vectors	Plain Markdown	Plain Markdown
Human readable	No	Yes	Yes
Portability	Low (Python API only)	Very high	Moderate
Best for	LLM agent memory	Human PKM	Journaling/outlining

The practical hybrid: use Obsidian/Logseq as your human knowledge layer, feed structured data into a vector store only for agent retrieval. Don't get locked into a binary format.

Verdict
MemPalace has a genuinely interesting spatial memory architecture. The local-first, privacy-respecting design is real. But benchmarks were manipulated, multiple advertised features don't exist in the codebase, and the launch was engineered around a celebrity and a memecoin.

Version 0.1 ChromaDB wrapper with good ideas and dishonest marketing. Revisit in 3–6 months once independent benchmark reproductions exist and the known bugs are fixed.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.