DEV Community

Abhinav Goyal
Abhinav Goyal

Posted on

MemPalace Benchmark Claims Don't Hold Up - A Technical Breakdown

This past week, MemPalace went viral on GitHub — an open-source AI memory system fronted by actress Milla Jovovich, claiming 100% on LongMemEval and 100% on LoCoMo. I was evaluating it for a production agentic AI pipeline and decided to dig into the actual code and community audits before integrating anything. Here's what I found.
To be fair, the core idea is solid. MemPalace stores your LLM conversation history locally using ChromaDB, organized into a spatial hierarchy:

Wings — people or projects

Halls — memory types

Rooms — conversation threads

Tunnels — cross-connections between memories

Instead of dumping your entire memory store to the LLM (the naive approach), it sends only the top 15 semantically relevant memories (~800 tokens). That's a claimed 250x token reduction vs. brute-force context stuffing. Fully offline, MIT-licensed, costs ~$0.70/year to run.

The spatial retrieval does measurably outperform flat ChromaDB search. The privacy-first architecture is real. This part is genuinely good work.

The Benchmark Problem
Here's where it breaks down.

LongMemEval: 100% → 96.6%
The team identified exactly which questions were failing, engineered fixes targeting those specific questions, then retested on the same dataset. Classic overfitting to a benchmark. After GitHub Issue #29 surfaced this publicly, they revised the score to 96.6% without announcement. The community caught it via commit history.

LoCoMo: 100% (trivially gamed)
They ran evaluation with top_k=50 on a dataset containing only 19–32 items. When your retrieval window exceeds the entire dataset size, you retrieve everything by default. This isn't a memory system benchmark — it's a retrieval window that swallows the whole test set whole.

Real-World Performance
One developer ran manual end-to-end tests by actually asking questions through an LLM connected to MemPalace. Correct answer rate: approximately 17%. Three independent audits reached the same conclusion: solid ChromaDB wrapper, broken marketing claims.
README vs. Codebase Table

README Claim Code Reality
Contradiction detection knowledge_graph.py has zero contradiction logic
Palace structure drives benchmark scores LongMemEval scores are ChromaDB's default embedding performance; palace routing sits above this
MCP Claude Desktop integration stdout bug corrupts JSON stream, breaks Claude Desktop on first use

The Crypto Context
The primary author is Ben Sigman, a crypto CEO. Milla Jovovich had 7 commits across 2 days at launch. A memecoin spawned within days of the GitHub release. Celebrity face + inflated benchmarks + viral launch + token = a pattern the community rightly recognizes. The MIT license means no software rug-pull, but the marketing playbook is straight from crypto launch culture.
How It Compares to Obsidian / Logseq
Worth noting for anyone using PKM tools: these aren't competitors, they solve different problems.

MemPalace Obsidian Logseq
Storage format ChromaDB binary vectors Plain Markdown Plain Markdown
Human readable No Yes Yes
Portability Low (Python API only) Very high Moderate
Best for LLM agent memory Human PKM Journaling/outlining

The practical hybrid: use Obsidian/Logseq as your human knowledge layer, feed structured data into a vector store only for agent retrieval. Don't get locked into a binary format.

Verdict
MemPalace has a genuinely interesting spatial memory architecture. The local-first, privacy-respecting design is real. But benchmarks were manipulated, multiple advertised features don't exist in the codebase, and the launch was engineered around a celebrity and a memecoin.

Version 0.1 ChromaDB wrapper with good ideas and dishonest marketing. Revisit in 3–6 months once independent benchmark reproductions exist and the known bugs are fixed.

Top comments (0)