MemPalace Benchmark Claims Don't Hold Up - A Technical Breakdown

Abhinav Goyal — Sun, 12 Apr 2026 17:39:00 +0000

This past week, MemPalace went viral on GitHub — an open-source AI memory system fronted by actress Milla Jovovich, claiming 100% on LongMemEval and 100% on LoCoMo. I was evaluating it for a production agentic AI pipeline and decided to dig into the actual code and community audits before integrating anything. Here's what I found.
To be fair, the core idea is solid. MemPalace stores your LLM conversation history locally using ChromaDB, organized into a spatial hierarchy:

Wings — people or projects

Halls — memory types

Rooms — conversation threads

Tunnels — cross-connections between memories

Instead of dumping your entire memory store to the LLM (the naive approach), it sends only the top 15 semantically relevant memories (~800 tokens). That's a claimed 250x token reduction vs. brute-force context stuffing. Fully offline, MIT-licensed, costs ~$0.70/year to run.

The spatial retrieval does measurably outperform flat ChromaDB search. The privacy-first architecture is real. This part is genuinely good work.

The Benchmark Problem
Here's where it breaks down.

LongMemEval: 100% → 96.6%
The team identified exactly which questions were failing, engineered fixes targeting those specific questions, then retested on the same dataset. Classic overfitting to a benchmark. After GitHub Issue #29 surfaced this publicly, they revised the score to 96.6% without announcement. The community caught it via commit history.

LoCoMo: 100% (trivially gamed)
They ran evaluation with top_k=50 on a dataset containing only 19–32 items. When your retrieval window exceeds the entire dataset size, you retrieve everything by default. This isn't a memory system benchmark — it's a retrieval window that swallows the whole test set whole.

Real-World Performance
One developer ran manual end-to-end tests by actually asking questions through an LLM connected to MemPalace. Correct answer rate: approximately 17%. Three independent audits reached the same conclusion: solid ChromaDB wrapper, broken marketing claims.
README vs. Codebase Table

README Claim	Code Reality
Contradiction detection	knowledge_graph.py has zero contradiction logic
Palace structure drives benchmark scores	LongMemEval scores are ChromaDB's default embedding performance; palace routing sits above this
MCP Claude Desktop integration	stdout bug corrupts JSON stream, breaks Claude Desktop on first use

The Crypto Context
The primary author is Ben Sigman, a crypto CEO. Milla Jovovich had 7 commits across 2 days at launch. A memecoin spawned within days of the GitHub release. Celebrity face + inflated benchmarks + viral launch + token = a pattern the community rightly recognizes. The MIT license means no software rug-pull, but the marketing playbook is straight from crypto launch culture.
How It Compares to Obsidian / Logseq
Worth noting for anyone using PKM tools: these aren't competitors, they solve different problems.

	MemPalace	Obsidian	Logseq
Storage format	ChromaDB binary vectors	Plain Markdown	Plain Markdown
Human readable	No	Yes	Yes
Portability	Low (Python API only)	Very high	Moderate
Best for	LLM agent memory	Human PKM	Journaling/outlining

The practical hybrid: use Obsidian/Logseq as your human knowledge layer, feed structured data into a vector store only for agent retrieval. Don't get locked into a binary format.

Verdict
MemPalace has a genuinely interesting spatial memory architecture. The local-first, privacy-respecting design is real. But benchmarks were manipulated, multiple advertised features don't exist in the codebase, and the launch was engineered around a celebrity and a memecoin.

Version 0.1 ChromaDB wrapper with good ideas and dishonest marketing. Revisit in 3–6 months once independent benchmark reproductions exist and the known bugs are fixed.

The reason your live AI demo spins has nothing to do with your model

Abhinav Goyal — Mon, 23 Mar 2026 12:28:37 +0000

There's a specific kind of fear before a live demo.

Not general anxiety. The 30-seconds-before-you-hit-run kind. Where
you're suddenly aware of every API call, every network hop, every
dependency you didn't stress-test. You smile. You keep talking.
Somewhere in the background, something is computing.

And you're just hoping it finishes before the silence gets awkward.

I've been in that position more times than I want to admit. I work
at a global health non-profit. My audience is usually program managers,
M&E specialists, researchers. People who are genuinely skeptical about
whether AI belongs in their workflows. People who need to see it work.

One spinner and you've set back AI adoption in that room by six months.

So I started asking a question I should have asked much earlier:

What actually has to happen live — and what am I running live out of habit?

That question rewired everything.

The split

Most live AI demos fail for the same reason. Not bad models. Not flaky
APIs. The architecture itself — generation and performance running in
the same process, at the same time, in front of people.

Audio synthesis. Script generation. API calls. All happening live,
while a room waits.

The fix isn't faster tools. It's removing the dependency entirely.

Split it into two phases:

Phase 1 — before anyone arrives: narration scripts, audio synthesis,
slide timing, the entire orchestration layer. All of it. Pre-computed,
cached, done.

Phase 2 — during the talk: only the things that genuinely have to
happen live. Real workflows. Real data. Real outputs.

By the time I walk into the room, the orchestration overhead is gone.
What's left is only the actual work — and it runs with nothing in its way.

Phase 1 — Before the room fills (15–20 min)

python -m core.pre_generate

Per slide, in order:

slide_reader.py pulls text and structure from the PPTX
script_agent.py sends it to Claude → 3–4 sentence narration back
voice_engine.py synthesises audio via Edge TTS (free, no API key)
ffprobe measures the actual duration of each generated file
Everything lands in cache/manifest.json

That last step matters more than it sounds. First version used
word-count estimates for timing. It was wrong constantly — natural
pauses, variation in delivery pace. Switching to ffprobe measuring
actual media duration fixed it permanently. Now timing is a property
of the content, not a guess about it.

The pipeline is resumable. Re-run it and completed slides skip.

Phase 2 — During the talk

python -m core.orchestrator

manifest → play audio (pygame)
         → wait actual duration
         → PyAutoGUI advances slide
         → fire n8n webhook (if demo slide)
         → next

No API calls. No generation. Reads the manifest, plays files, fires
webhooks at the right moments.

One thing that took longer than expected: PyAutoGUI focus management.
Controlling another application's window from a separate Python process
sounds trivial. Window focus, keypress timing, different PowerPoint
versions — all of it needed explicit settle delays and focus checking.
Budget an afternoon for this.

pynput handles keyboard controls globally without stealing PowerPoint
focus. SPACE pauses, D skips a demo countdown, Q quits.

Phase 3 — Q&A

Final slide, mic opens automatically.

SpeechRecognition → Google STT
→ Claude (full conversation history maintained across turns)
→ Edge TTS → plays answer aloud → loops

The decision to maintain the full messages array across Q&A turns
was the smallest change with the biggest effect. Follow-up questions
get answered in full session context. Claude remembers what it said
two questions ago. That made the Q&A feel like a completely different
product from the single-turn version.

The three live workflows

These are the part that actually runs live. They fire via webhook at
configured slide numbers. All three ship as importable JSON — connect
credentials and they run immediately.

Email Pipeline

Webhook (triggered by orchestrator)
→ Gmail: fetch latest email
→ Claude: classify intent + draft reply
→ Escalation Router
→ Gmail Draft + CC logic + Sheets log

This is the slide where the phone went face down.

n8n fired. Gmail opened on the secondary screen. Claude read an
incoming email, classified the intent, drafted a reply, routed it
for escalation. 4 seconds. Nobody touched a keyboard.

Meeting Pipeline

Webhook (paste any transcript)
→ Claude: action items, decisions, risks, follow-up email
→ Pull attendees from Sheets
→ Gmail + Slack + Sheets log

I pasted a transcript from a meeting that happened that morning. By the time I finished talking over the slide — action items extracted, follow-ups written to 6 attendees, Slack notified, row in Sheets.

Someone in the back: "wait, that just actually happened?"

Yeah.

Evidence Intelligence Engine

Webhook (research question)
→ Claude: decompose into sub-queries
→ Perplexity Web Search  ─┐
→ Perplexity Academic    ─┘ parallel
→ Claude: is this evidence strong enough?
→ [yes] → Google Doc brief
→ [no]  → rewrite queries → retry (max 2 rounds)
→ Slack + Sheets log

The quality gate is what gets the most questions. Claude evaluates
whether it actually has enough evidence to synthesise before writing
the brief. If not, it rewrites the queries and runs again. Caps at
two retries then forces synthesis anyway.

Live. In the room. Under 90 seconds.

The question afterward wasn't "how did you build this?"

It was: "can I have this?"

That's the only metric that matters.

One thing I'd change immediately

Built with ElevenLabs initially. Swapped to Microsoft Edge TTS when
I realised quality was close enough for narration and the cost
difference was literally $0. If you're building voice into an AI
pipeline for a resource-constrained organisation, start there.
Upgrade only if you actually hit a ceiling.

Structure

ai-presentation-orchestrator/
├── core/
│   ├── orchestrator.py      main runtime controller
│   ├── pre_generate.py      pre-generation pipeline
│   ├── regenerate.py        selective slide regeneration
│   ├── diagnose.py          pre-flight checks
│   └── logger.py
│
├── agents/
│   ├── script_agent.py
│   ├── slide_controller.py
│   └── slide_reader.py
│
├── integrations/
│   ├── voice_engine.py      Edge TTS + pygame
│   ├── heygen_engine.py     avatar video (optional)
│   ├── n8n_trigger.py
│   └── slack_notifier.py
│
├── n8n/                     3 importable workflow JSONs
├── docs/
│   ├── RUNBOOK.md           day-of checklist
│   └── architecture.svg
└── cache/                   auto-created, gitignored

Run it

git clone https://github.com/TrippyEngineer/ai-presentation-orchestrator
cd ai-presentation-orchestrator
pip install -r requirements.txt

cp env.example .env
# ANTHROPIC_API_KEY is the only required key

python -m core.diagnose      # before anything else
python -m core.pre_generate  # 15–20 min before the talk
python -m core.orchestrator  # when the room fills up

ffmpeg separately (not pip):

Windows: gyan.dev/ffmpeg/builds
Mac: brew install ffmpeg
Linux: sudo apt install ffmpeg

TrippyEngineer / ai-presentation-orchestrator

Fully automated AI presentation system — pre-generates narration, fires live n8n demo workflows, and answers audience questions via voice Q&A. Zero manual input during the talk.

AI Presentation Orchestrator

Separate the compute from the performance.
Pre-generate everything. Cache it. Walk in and press Enter.

What This Project Demonstrates

Two-phase system design — generation and runtime are fully decoupled. All LLM calls, TTS synthesis, and video rendering happen before the presentation. The runtime reads a manifest and executes deterministically.
Cache-first pipeline — every output is stored with a manifest. Partial runs resume from where they stopped. Nothing is regenerated unless explicitly forced.
Multi-modal orchestration — synchronises audio playback, slide advancement, and live webhook triggers from a single timing source (actual media duration).
Modular integration layer — TTS, avatar video, n8n, Slack, and Google APIs are all isolated modules. Swapping any one of them does not affect the core pipeline.
Live voice Q&A — final slide activates a mic listener. Audience questions are captured via speech recognition, answered by Claude in persona, and spoken aloud via TTS. Conversation…

View on GitHub

Six talks since I rebuilt around this principle. Zero spinners.

The audience doesn't know there's a manifest. They don't know about
the cache. They see real workflows execute in 4 seconds. They hear
spoken answers before they've finished processing that it worked.

Nothing hesitates — because by the time the lights go down, the only
thing left to compute is the actual work.

Abhinav Goyal — building AI infrastructure for global health programs,
where reliability isn't optional. Drop questions below, especially if you've hit the live demo
reliability problem from a different angle.

DEV Community: Abhinav Goyal