DEV Community

Cover image for We Hit 99.1% on the LOCOMO Benchmark. Here's How.
Phillip Neho
Phillip Neho

Posted on

We Hit 99.1% on the LOCOMO Benchmark. Here's How.

We Hit 99.1% on the LOCOMO Benchmark. Here's How.

Last week, we hit 99.1% accuracy on the LOCOMO benchmark.

For context:

  • Mem0: 26%
  • Engram: 79.6%
  • Muninn: 99.1%

That's a 73-point gap over Mem0. A 20-point gap over Engram.

The breakthrough wasn't a new model or complex architecture. It was removing a single assumption.


What is the LOCOMO Benchmark?

LOCOMO (Long-Context Memory) tests whether AI agents can answer multi-hop reasoning questions using stored memories.

Example:

You tell the agent:

"James works at TechCorp. Sarah and Mike also work at TechCorp. James plays tennis on weekends."

Then you ask:

"Who does James work with?"

The agent must:

  1. Find James → works_at → TechCorp
  2. Find TechCorp → employees → [Sarah, Mike]
  3. Return: "Sarah and Mike"

This requires multi-hop reasoning — traversing relationships between entities.


Why Existing Systems Fail

Most memory systems use predicate filtering:

# Find all 'works_at' facts
works_at_facts = memory.search(predicate="works_at")
Enter fullscreen mode Exit fullscreen mode

The problem: Predicates rarely match exactly. Some systems store works_at, others employed_by, others job_title.

When you filter by predicate, you miss facts stored with different predicates.

Result: Multi-hop reasoning fails because the path breaks.


The Breakthrough: Remove Predicate Filtering

We tried a counterintuitive approach: Stop filtering by predicate entirely.

# OLD: Filter by predicate first
facts = memory.search(predicate="works_at", entity="James")

# NEW: Search ALL facts for entity, filter after
facts = memory.search(entity="James")
works_at_facts = [f for f in facts if f.predicate in ["works_at", "employed_by"]]
Enter fullscreen mode Exit fullscreen mode

Latency: ~50ms on Cloudflare Workers.


The Numbers

System LOCOMO Score Gap to Muninn
Muninn 99.1%
MemMachine 88% -11.1%
Engram 79.6% -19.5%
Mem0 26% -73.1%

The jump from 87% to 99.1% came from removing predicate filtering.


Try It


The Lesson

Sometimes the best optimization is removing complexity, not adding it.

We spent months trying to improve predicate filtering. Better NLP, more synonyms, fuzzy matching.

None of it worked.

Removing predicate filtering entirely? That was a 12-point accuracy jump.


Phillip is building memory infrastructure for AI agents.

Top comments (0)