Problem: None of the benchmark scores are real.
Yesterday an X account belonging to a developer named Ben Sigman posted the launch of an open-source AI memory project called MemPalace. The post claimed "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." It credited the actress Milla Jovovich as a co-author. The GitHub account hosting the repository is named milla-jovovich/mempalace. The first commit to the repository is dated April 5. As of this writing, less than 24 hours after the launch post, the repository has approximately 5,400 stars and over 1.5 million views on the launch tweet.
For comparison: open-source memory projects with similar architectures and similar honest baseline numbers typically receive just a handful of stars in their first week. The variable producing the orders-of-magnitude difference in engagement is not the engineering. The engineering, as we'll demonstrate, is in some respects interesting and in most respects unexceptional. The variable is the celebrity name on the GitHub account and the celebrity attribution in the launch post. The launch post described her as a co-author. Whatever the underlying collaboration looked like, the practical effect of attaching the name was that a repository created two days ago reached over 1.5 million people on a single tweet, and the methodology errors documented below were carried by that reach to an audience the majority of which is unlikely to read the BENCHMARKS.md file themselves.
We work on a different memory project at Penfield, and a couple of months ago we published an audit of LoCoMo's ground truth documenting roughly ninety-nine wrong, hallucinated, or misattributed answers across the dataset's ten conversations. A 100% score on the published version of LoCoMo is mathematically excluded. The answer key contains errors any honest system would disagree with. So when the launch post showed up in the timeline we sought to understand how this impossible number was produced.
What we found is a methodology stack that contains, in one repository created two days ago, almost every failure mode the AI memory benchmark layer suffers from right now. The interesting thing — the thing that made this worth writing about rather than ignoring — is that the project's own internal documentation discloses most of its failure modes honestly. The launch post strips every caveat. The methodology errors are common across the field. The honesty gap between the repository and the marketing is arguably the bigger story. The celebrity name is the reason anyone heard about it.
The LoCoMo bypass
LoCoMo is a conversational memory benchmark with ten long conversations and 1,986 question-answer pairs. The standard convention in published evaluation is to report on the 1,540 non-adversarial subset; the launch post reports on all 1,986. The ten conversations contain 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than fifty sessions.
The MemPalace LoCoMo runner produces its 100% number with top_k=50. Their own BENCHMARKS.md says this verbatim:
The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions — the embedding retrieval step is bypassed entirely.
Setting top_k=50 against a candidate pool that maxes out at 32 retrieves the entire conversation. At that setting the pipeline reduces to: dump every session into Claude Sonnet, ask Sonnet which one matches. That is cat *.txt | claude. It is not retrieval and it is not memory. The "memory architecture" contributes nothing to the score.
The honest LoCoMo numbers, from the same file, are 60.3% R@10 with no rerank and 88.9% R@10 with the project's hybrid scoring and no LLM. Those are real and unremarkable. The 100% should not be cited at all. It cannot be 100% in any case, because the published ground truth is wrong on roughly 99 questions. It is also worth noting that the LoCoMo judge scores up to 63% of intentionally wrong answers correct.
The LongMemEval metric error
LongMemEval as published is an end-to-end question-answering benchmark. A system has to retrieve from a haystack of prior chat sessions, generate an answer, and have that answer marked correct by a GPT-4 judge. Every score on the published LongMemEval leaderboard is the percentage of questions where the generated answer was judged correct.
The MemPalace LongMemEval runner does the retrieval step only. It never generates an answer and never invokes a judge. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings, returns the top five sessions by cosine distance, and checks set membership against the gold session IDs labeled by the LongMemEval authors. If any one of the gold session IDs appears in the top five, the question scores 1.0. This metric is recall_any@5. The runner also computes recall_all@5 (the stricter version that requires every gold session to be retrieved) and the project reports the softer one.
So the system never reads what is in the retrieved sessions, never produces an answer, and never demonstrates that the sessions it returned actually answer the question. The dataset author labeled them, the runner checks the labels, and credit is awarded on label-set overlap. None of the LongMemEval numbers in this repository — not the 100%, not the 98.4% "held-out" number, not the 96.6% raw baseline — are LongMemEval scores in the sense the published leaderboard means. They are retrieval recall numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error.
The 100% number additionally has a separate problem. The project's hybrid v4 mode was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions. Then the same five hundred are rerun and the result is reported as a perfect score. The project's own BENCHMARKS.md calls this what it is, on line 461, verbatim:
This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns.
The features that don't exist in the code
The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. The file mempalace/knowledge_graph.py contains zero occurrences of the word "contradict." The only deduplication logic in that file is an exact-match check on (subject, predicate, object) triples — it blocks identical triples from being added twice and does nothing else. Conflicting facts about the same subject can accumulate indefinitely. The marketed feature does not exist in the code. Credit to the developer Leonard Lin (lhl), who documented this independently in issue #27 on the same repository within hours of the launch.
AAAK is not lossless
The launch post claims "AAAK compression fits your entire life context into 120 tokens — 30x lossless compression any LLM reads natively." The project's compression module, mempalace/dialect.py, truncates sentences at 55 characters (if len(best) > 55: best = best[:52] + "..."), filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip.
There is also a measurement. The same BENCHMARKS.md reports results_raw_full500.jsonl at 96.6% R@5 and results_aaak_full500.jsonl at 84.2% R@5 — a 12.4 percentage point quality drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop. The project measured the loss, recorded it in the benchmark file, and then published "30x lossless" anyway.
The broken layer underneath
None of these failure modes are unique to MemPalace. LoCoMo's ground truth has been broken since the dataset was published. The benchmark wars in the AI memory space already involve documented methodology disputes that go well beyond normal disagreement: Zep published a detailed article in 2025 titled "Lies, Damn Lies, and Statistics: Is Mem0 Really SOTA in Agent Memory?" arguing that Mem0's published LoCoMo numbers depend on a flawed evaluation harness and on Mem0 having run a misconfigured version of Zep when benchmarking against it. Mem0's CTO replied on Zep's own issue tracker in "Revisiting Zep's 84% LoCoMo Claim: Corrected Evaluation & 58.44% Accuracy" claiming that Zep's real score is 58.44% rather than 84%. Letta has separately published "Benchmarking AI Agent Memory: Is a Filesystem All You Need?" reaching similar conclusions about reproducibility on the same benchmark. The MemPalace launch fits into a pattern that the field is already arguing about. What's new is the scale of the honesty gap between a single repository and their related marketing.
What's unusual about MemPalace is not necessarily that one project did all of these things at once. What's unusual is that the project's own internal documentation discloses these issues honestly, while the launch communication strips these caveats. BENCHMARKS.md is over 5,000 words of careful, self-aware methodology notes that contradict the launch tweet point by point. Whoever reviewed that file knew. It's clearly documented. But then they published the inflated numbers anyway.
Over five thousand stars in less than twenty-four hours
The repository was created on April 5. The launch post went up on April 6. By the morning of April 7, the launch tweet had over 1.5 million views and the repository had over 5,400 stars. There are many open-source memory projects with similar architectures and similar honest baseline numbers. They do not get 5,400 stars in twenty-four hours. They get fifty stars in their first week if they're lucky. The variable is the celebrity name. Strip the celebrity attribution out of the launch post and the project is a Python repository with a regex-based abbreviation scheme, default ChromaDB embeddings, a knowledge-graph file that doesn't implement the feature its README claims, and a benchmark folder whose own internal notes contradict the headline numbers. That repository gets fifty stars at best and dies in a week. The same code with an actress's name on the GitHub account gets 5,400 stars in less than a day and reaches over 1.5 million people on a single tweet.
The engineering result underneath all of this is genuinely interesting in one specific way: it appears that raw verbatim text plus default embeddings does, in fact, beat a number of LLM-extraction approaches at session retrieval on LongMemEval-s. That suggests the field is over-engineering the memory extraction step. It is a useful negative finding. It does not require a perfect score on a benchmark whose ground truth makes a perfect score impossible. It does not require a metric category error. It does not require hand-coded patches against three specific dev questions. It does not require a celebrity attribution. The honest version of this story would have been more interesting than the hyped version, and it would likely have survived more than 24 hours of community scrutiny instead of collapsing under it.
What we're doing about it
We maintain a public LoCoMo ground-truth audit at github.com/dial481/locomo-audit, with per-conversation error files documenting hallucinations, attribution errors, ambiguous questions, and incomplete answers across all ten conversations. The audit is open for contribution. We believe a new and improved version of LoCoMo would benefit every group working on conversational memory, including the MemPalace maintainers and including ourselves. The goal is better benchmarks, not a kill shot on any individual project.
Two other independent technical critiques of MemPalace landed within the same 24 hour window: Leonard Lin's README-versus-code teardown in issue #27, and a Chinese-language warning post for the simplified Chinese developer community in issue #37. If you're evaluating any AI memory system right now, the right thing to do is read the benchmark code yourself before trusting the headline number. If that feels like a lot to ask — and it is — that's the problem this article is about. The celebrity name on the GitHub account is what made the problem visible. The problem itself was already there.
Top comments (0)