RAXXO Studios

Posted on Apr 8 • Originally published at raxxo.shop

MemPalace Scored 96.6% on LongMemEval. Here Is What That Actually Means.

#claudecode #automation #productivity #ai

MemPalace is an open-source AI memory system scoring 96.6% on LongMemEval, the highest ever published
It stores raw conversations in ChromaDB locally with zero cloud dependency or API costs
The "palace" architecture (wings, rooms, halls, tunnels) improves retrieval by 34% over flat search
AAAK compression is experimental and currently regresses accuracy, but the raw mode benchmark is legit
19 MCP tools let Claude, GPT, and Gemini search your entire conversation history automatically
Specialist agents keep domain-specific diaries across sessions for code review, architecture, and ops
The creators published honest corrections within 48 hours of launch after community scrutiny

MemPalace Scored 96.6% on LongMemEval. Here Is What That Actually Means.

Every AI conversation I have disappears when the session ends. Six months of debugging sessions, architecture decisions, and "we tried X and it failed because Y" reasoning. Gone. I start from scratch every Monday morning.

MemPalace tries to fix that. Based on the benchmarks, it might actually work.

19.5 Million Tokens Per Year, All Gone

If you use AI daily for real development work, you generate roughly 19.5 million tokens per year. Decisions, debugging sessions, trade-off discussions. The usual approaches to keeping that context alive either blow past every context window (paste everything) or cost 500+ EUR per year in LLM summarization that strips out the nuance.

MemPalace flips the model: store everything raw, then make it findable. No AI decides what matters. You keep every word. The structure handles retrieval.

170 tokens loaded on wake-up. Semantic search on demand. About 10 EUR per year in compute. All local.

What MemPalace Actually Is

MemPalace is an open-source Python library that turns your AI conversation history into a searchable, structured memory system. Raw verbatim text goes into ChromaDB (a local vector database), organized using a spatial metaphor borrowed from ancient Greek rhetoric.

Greek orators memorized entire speeches by placing ideas in rooms of an imaginary building. Walk through the building, find the idea. Same principle here, applied to AI memory.

Install it, point it at your conversation exports, search anything you have ever discussed.


pip install mempalace
mempalace init ~/projects/myapp
mempalace mine ~/chats/ --mode convos
mempalace search "why did we switch to GraphQL"

Three commands. Your entire AI history, searchable.

The Palace Architecture

MemPalace organizes memories into a navigable hierarchy:

Wings are top-level containers. One per project, teammate, or topic.

Rooms are specific subjects within a wing. Auth decisions in one room, deployment configs in another, billing logic in a third. The system auto-detects these during mining.

Halls connect related rooms within the same wing. They represent memory types: facts (locked decisions), events (milestones), discoveries (breakthroughs), preferences (habits), and advice (solutions).

Tunnels connect rooms across different wings. When "auth-migration" appears in both a person's wing and a project's wing, a tunnel cross-references them automatically.

Closets hold summaries pointing to the original content. Drawers hold the raw verbatim files. Original words, never summarized.

The spatial structure is not cosmetic. Tested on 22,000+ real conversation memories:

Searching all closets: 60.9% recall
Searching within a wing: 73.1% (+12%)
Searching wing + hall: 84.8% (+24%)
Searching wing + room: 94.8% (+34%)

That 34% improvement from structure alone. That is the product.

The 96.6% Benchmark

MemPalace scored 96.6% R@5 on LongMemEval, the standard benchmark for long-term memory systems. Highest score ever published, free or paid. 500 questions, independently reproduced on an M2 Ultra in under 5 minutes.

Important context: that score comes from raw verbatim mode. Not the experimental AAAK compression. Not rooms-filtered mode. Raw text in ChromaDB with semantic search.

The creators were upfront about this after the community poked holes in their initial claims. Honest breakdown:

Raw mode: 96.6% R@5 (the headline number)
AAAK compressed mode: 84.2% R@5 (12.4 point regression)
The "+34% palace boost" compares unfiltered to metadata-filtered search, which is standard ChromaDB functionality

AAAK: The Experimental Compression Layer

AAAK is a lossy abbreviation dialect designed to pack repeated entities into fewer tokens. Entity codes, structural markers, sentence truncation. Any LLM can read it without a decoder because it is essentially truncated English.

The honest status, straight from the creators:

AAAK is lossy, not lossless (their README originally said "30x lossless," which was wrong)
Does not save tokens at small scales
Can save tokens at scale with many repeated entities across thousands of sessions
Currently regresses LongMemEval accuracy versus raw mode

They corrected these claims within 48 hours of launch after community scrutiny. The original token count example used a rough heuristic instead of an actual tokenizer. Real counts showed the AAAK example used more tokens than plain English at small scale.

The benchmark correction impressed me more than the benchmark itself.

19 MCP Tools for Any AI

MemPalace ships as an MCP server with 19 tools. One-time setup:


claude mcp add mempalace -- python -m mempalace.mcp_server

After that, your AI can search, add, delete, traverse, and query your entire memory palace. Ask Claude "what did we decide about auth last month?" and it calls mempalace_search on its own.

Four tool categories: palace read/write (search, list, deduplicate, manage drawers), knowledge graph (temporal facts in SQLite, not Neo4j, free), navigation (traverse rooms, find cross-wing tunnels), and agent diary (AAAK-encoded domain-specific journals).

The knowledge graph part deserves a closer look. Facts get validity windows. When something stops being true, you invalidate it. Historical queries still return it. Current queries skip it. That solves one of the worst problems with AI memory: stale information presented as current truth.

Specialist Agents

This is the feature I keep thinking about for real-world use. MemPalace lets you create agents that focus on specific domains. A code reviewer, an architecture advisor, an ops tracker. Each one gets its own wing and diary in the palace.

Your CLAUDE.md stays the same size no matter how many agents you add. They live in the palace, not in your config. Each maintains its own memory, reads its own history, builds domain expertise over time.

A code reviewer that remembers every bug pattern across six months. An architecture advisor that recalls every design trade-off and why you picked one over the other. An ops tracker with a log of every incident and what actually fixed it.

Letta (formerly MemGPT) charges 20 to 200 EUR per month for agent-managed memory. MemPalace does it with a wing and a diary file.

The Honest Launch

What made me actually write about this project is not the benchmark. It is what happened after launch.

Within hours, the community found real problems:

The AAAK token example used fake counting
"30x lossless compression" was overstated (AAAK is lossy)
"+34% palace boost" was standard ChromaDB metadata filtering, not a novel mechanism
"Contradiction detection" existed as a utility but was not wired into the actual system
"100% with Haiku rerank" was real but the pipeline was not in the public scripts

Instead of defending or spinning, creators Milla Jovovich and Ben Sigman published a detailed correction directly in the README. Acknowledged every issue. Explained what was still true and reproducible. Listed exactly what they were fixing.

That almost never happens in open source. Especially not for a project that just launched with a headline benchmark number. The instinct is to protect the number. They chose to protect trust.

How It Compares

Claude's built-in memory remembers preferences and key facts between sessions. Good for "I prefer Postgres." MemPalace is for "why did we choose Postgres over SQLite on November 3rd, and what was the specific concern about concurrent writes." Different scope.

Zep/Graphiti takes a similar knowledge graph approach with temporal validity, but runs on Neo4j in the cloud (25+ EUR per month). MemPalace uses SQLite locally. Free.

Letta/MemGPT offers agent-managed memory with subscription pricing (20-200 EUR per month). MemPalace is free and local.

CLAUDE.md works great for rules and preferences. It does not scale to 19.5 million tokens of conversation history. MemPalace is the deep archive. CLAUDE.md is the quick reference card.

Why Local-First Matters Here

Every other AI memory solution (Zep, Letta, Mem0) sends your conversation data to someone else's server. Your debugging sessions, architecture decisions, code review discussions. All of it, hosted externally.

MemPalace stores everything in ChromaDB on your disk. Knowledge graph in SQLite on your disk. Zero API calls for storage or retrieval. The only external call is optional cloud LLM reranking.

For solo devs, that is convenient. For teams on proprietary code, it is a hard requirement. For regulated industries (finance, healthcare, legal), local-only memory is the only option that passes compliance.

The cost gap is real too. Zep: 25 EUR per month. Letta: 20 to 200 EUR per month. MemPalace: your laptop's electricity bill. Over multiple projects and years, that gap compounds.

Auto-Save Hooks for Claude Code

MemPalace ships with two hooks for Claude Code users. A save hook triggers every 15 messages and captures topics, decisions, direct quotes, and code changes. It also regenerates the critical facts layer so your AI wakes up current.

A shutdown hook fires at session end. It writes a structured summary: what happened, what got decided, what is still pending. An automatic session debrief that feeds back into the palace.

You do not have to remember to save anything. Your workflow stays the same. Memory accumulates in the background.

For teams running Claude Code across multiple projects, this is where MemPalace goes from experiment to infrastructure. Every team member's AI sessions feed into a shared palace. Cross-reference who decided what, when, and why, without manual documentation.

What to Watch

The 96.6% raw mode benchmark is legit and independently reproduced. Strong foundation. But a few things to track:

AAAK needs to prove itself. The compression layer currently hurts accuracy. If it stays below 84.2% while not actually saving tokens at normal scales, it might work better as an optional export format than a core feature.

ChromaDB stability is a concern. Several open issues mention segfaults on macOS ARM64 and version pinning problems. Your memory system is only as reliable as its vector store.

Scale is still early. 22,000+ memories works. What about 200,000? Two million? The palace architecture should help, but nobody has tested it yet.

Community momentum matters. The project got serious attention at launch. Maintaining that through honest corrections takes guts, but it also risks losing the hype-driven crowd. I think the honest crowd is more valuable long-term.

Bottom Line

MemPalace is the first open-source AI memory system I have used that feels like it could actually stick. The 96.6% benchmark is real. The architecture makes sense. The creators fix things instead of defending them.

If you use AI for development, you are generating thousands of decisions per year that vanish between sessions. MemPalace is the best attempt I have seen at making those decisions persist.

Code is on GitHub. pip install mempalace. Point it at your conversation exports. See what you have been forgetting.

DEV Community

MemPalace Scored 96.6% on LongMemEval. Here Is What That Actually Means.

MemPalace Scored 96.6% on LongMemEval. Here Is What That Actually Means.

19.5 Million Tokens Per Year, All Gone

What MemPalace Actually Is

The Palace Architecture

The 96.6% Benchmark

AAAK: The Experimental Compression Layer

19 MCP Tools for Any AI

Specialist Agents

The Honest Launch

How It Compares

Why Local-First Matters Here

Auto-Save Hooks for Claude Code

What to Watch

Bottom Line

Top comments (0)