Soham Patel

Posted on Mar 19

How Typed Conflict Resolution Beats Mem0 and MemGPT on the Hardest Memory Benchmark

#ai #llm #opensource #python

When multiple AI agents serve the same user, they lie to each other.

Not intentionally. But Agent A hears "I switched to Vue" while Agent B still has "prefers React" in memory. When the user asks Agent B for a framework recommendation, they get React. The user already told the system they switched. The system forgot — or rather, it never resolved the contradiction.

I built Mnemos, an open-source memory engine that fixes this. And I tested it on the hardest memory benchmark available — MemoryAgentBench from ICLR 2026. The results surprised me.

The published ceiling is 7%. Mnemos hits 12%.

MemoryAgentBench's Conflict Resolution split tests whether a system can handle contradictory facts. The multi-hop variant is the hardest — it requires chaining 2-3 reasoning steps to detect that a contradiction exists.

The paper's own conclusion: "In multi-hop conflict resolution scenarios, all methods achieve single-digit accuracy rates (at most 7%), highlighting this as a critical bottleneck."

Every system they tested — Mem0, MemGPT, Zep, HippoRAG, Self-RAG, even GPT-4o with full 128K context — scored below 7%.

Mnemos scored 12%.

System	Multi-Hop Accuracy
Mnemos	12.0%
Dense RAG (top-10)	7.0%
HippoRAG-v2	6.0%
Self-RAG	5.0%
Zep	4.0%
MemGPT	3.0%
Mem0	2.0%
Cognee	1.0%

On single-hop with short context, Mnemos reached 90%. But I'll be honest about the full picture later — there are splits where it struggles.

The insight: not all contradictions are the same

Here's what existing memory systems do when "Lisa Patel was appointed CEO" arrives after "The CEO is John Smith" was already stored:

They keep both.

When the user asks "Who is the CEO?", the retrieval system finds both facts (they're both highly relevant), sends them to the LLM, and the LLM guesses. Sometimes it picks the old one. On multi-hop questions where the contradiction is indirect, it picks wrong most of the time.

Mnemos takes a different approach. When new information arrives, it runs a conflict detection pipeline:

New fact: "Lisa Patel was appointed CEO"
    |
[1] Embed with sentence-transformers
    |
[2] Find similar memories (cosine > 0.55)
    → Finds: "The CEO is John Smith" (similarity: 0.82)
    |
[3] Verify same topic (entity overlap: "CEO" in both)
    |
[4] Detect contradiction (transition language: "appointed")
    |
[5] Classify: FACTUAL_CORRECTION
    → Strategy: REPLACE (delete old fact)
    |
Result: Only "Lisa Patel was appointed CEO" survives

The classification step is the key differentiator. Mnemos recognizes three types of conflicts:

PREFERENCE_EVOLUTION — "Now prefers Vue" vs "Prefers React." The old preference is archived with a full history trail. You can still query what they used to prefer.

FACTUAL_CORRECTION — "Deadline is April 30" vs "Deadline is March 15." The old fact is deleted. There's one truth.

CONTEXT_DEPENDENT — "Uses Python at work" and "Uses JavaScript for personal projects." Both stay active, scoped to their context. This isn't a contradiction at all.

The reason this matters: Mem0 and MemGPT don't distinguish between these cases. They either keep everything (contradiction persists) or do naive last-write-wins (context-dependent facts get destroyed).

A real example from the benchmark

Question 2 in the benchmark asked: "In which location did the spouse of Igor of Kiev pass away?"

The context contained two conflicting facts about Olga of Kiev's death location — an old fact saying Kyiv and an updated fact saying Rodez.

Naive system (same LLM, same embeddings, same retrieval): Retrieved both facts. The LLM picked Kyiv. Wrong.

Mnemos: Detected the contradiction during ingestion. Removed the stale "Kyiv" fact. When the question came, only "Rodez" was in memory. The LLM had no choice but to answer correctly. Right.

This happened 15 more times across the 100 multi-hop questions in that example. Mnemos got 35 right. Naive got 3.

How the benchmark works

I want to be transparent about methodology because reproducibility is what makes benchmark results credible.

The MemoryAgentBench dataset has a Conflict_Resolution split with 8 examples, each containing ~100 questions. The contexts range from 6K to 262K tokens. Each context is packed with facts, some of which contradict earlier facts.

Memory construction phase: The context is split into sentences. Each sentence is embedded with all-MiniLM-L6-v2 and stored as a semantic memory. Mnemos runs conflict detection at this stage — the naive baseline skips it.

Query phase: Each question is embedded. The top-15 most similar memories are retrieved by cosine similarity. GPT-4.1-mini generates the answer.

Scoring: Substring Exact Match against gold answers, same as the paper's protocol.

The critical detail: The naive baseline uses the exact same LLM, the exact same embeddings, and the exact same retrieval. The ONLY difference is that Mnemos runs conflict resolution during ingestion. So every percentage point above the baseline is purely the value of the conflict engine.

The full results — including where it fails

Here's the complete breakdown:

Split	Context	Mnemos	Naive	Delta
Multi-hop	6K	27.0%	9.0%	+18pp
Multi-hop	32K	11.0%	3.0%	+8pp
Multi-hop	64K	8.0%	6.0%	+2pp
Multi-hop	262K	2.0%	2.0%	tied
Single-hop	6K	90.0%	69.0%	+21pp
Single-hop	32K	65.0%	80.0%	-15pp
Single-hop	64K	55.0%	76.0%	-21pp
Single-hop	262K	28.0%	76.0%	-48pp

The multi-hop results are strong across the board. But look at single-hop on long contexts — Naive wins, and by a lot.

Why? At 262K tokens, the context contains thousands of facts about hundreds of different entities. "David works at Google" and "Sarah works at Microsoft" have an embedding similarity of ~0.35 — both are about someone working somewhere. With a fixed similarity threshold of 0.55, some of these get flagged as conflicts and one gets deleted. Multiply this across thousands of facts and you get massive over-deletion.

The fix is adaptive thresholds — higher threshold for longer contexts where there are more entities. This is the #1 item on the roadmap.

The code

Mnemos is ~2000 lines of Python with zero heavy dependencies beyond sentence-transformers.

from mnemos import MemoryHub, Agent

hub = MemoryHub(similarity_threshold=0.55)

coder = Agent("coding_assistant", hub, write=True, read=True)
planner = Agent("project_planner", hub, write=True, read=True)

# Coder stores a fact
coder.remember("user_123", "User prefers React",
               category="preference", tags=["framework"])

# Planner stores contradictory info — resolved automatically
mem, conflicts = planner.remember(
    "user_123", "User switched to Vue",
    category="preference", tags=["framework"]
)

for c in conflicts:
    print(c.summary())
    # [supersede] preference_evolution: 
    #   Archived 'User prefers React'
    #   Active: 'User switched to Vue'

The memory system has two layers:

Episodic memory decays over time — session events, conversation fragments. These use an exponential decay formula: score = relevance * e^(-rate * days) + frequency_boost. When they fade below a threshold, they're archived.

Semantic memory persists forever — facts about the user. These are never decayed. They're only updated through the conflict resolution engine.

When the same pattern appears in 3+ episodic sessions (configurable), it gets promoted to semantic memory.

Reproduce it yourself

git clone https://github.com/Sohamp2809/mnemos.git
cd mnemos
pip install -e ".[dev]"
pip install datasets sentence-transformers openai

export OPENAI_API_KEY="sk-..."

# Full benchmark: ~$1.50, ~3 hours
python Benchmark_sample/run_MemoryAgentBench.py \
    --llm openai --model gpt-4.1-mini --verbose

Machine-readable results are in results/mabench_cr_full.json.

What's next

Three priorities:

Adaptive thresholds — Scale similarity threshold with context length to fix long-context over-deletion. This is the biggest accuracy gap right now.
LLM-based conflict classification — The current heuristic classifier uses transition language detection and negation matching. A GPT-4o-mini call for borderline cases would catch the contradictions that heuristics miss.
Framework adapters — LangChain, CrewAI, and AutoGen integrations so you can drop Mnemos into existing agent pipelines.

Why this matters beyond benchmarks

Every production multi-agent system has this problem. When your customer support bot, your sales assistant, and your onboarding agent all talk to the same user, they need shared memory that stays accurate. Today, developers either build custom memory management (expensive) or accept that agents will contradict each other (bad UX).

Mnemos is the open-source answer. MIT licensed. One pip install away. And now — benchmarked.

GitHub: Sohamp2809/mnemos

If you're working on agent memory systems, I'd love to hear what conflict patterns you've encountered that the current three types don't cover. Drop a comment or open an issue on the repo.

DEV Community