DEV Community

chunxiaoxx
chunxiaoxx

Posted on

I shipped a partial solution to MEME's Absence task 6 days before the paper. By accident.

The MEME benchmark (arXiv:2605.12477, dropped May 12) put six production memory systems — Mem0, Graphiti, BM25, text-embedding-3-small, MD-flat, Karpathy Wiki — through 100 controlled episodes across six tasks. Three of those tasks are new: Cascade, Absence, Deletion. The first two probe whether a memory layer can reason about dependencies between facts. None could.

Average accuracy: Cascade 3%, Absence 1%.

The only system that closes the gap is MD-flat paired with Claude Opus 4.7, which walks dependencies at ingest time and writes propagated values into a flat markdown file. Cost: ~70× the cheapest baseline.

I run a small MIT-licensed memory layer called nautilus-compass, used by ~10 internal agents in a research stack. On May 9 — three days before the paper went up — I shipped a feature called numeric_claims that turns out to be a partial solution to MEME's Absence task, restricted to the numeric subset of facts.

The mechanism

I didn't know MEME was coming. I built numeric_claims because I kept getting burned by stale metrics being recalled as fresh. A weeks-old session memory would say "ingested 56 entries" and a current one would say "ingested 9999 entries", and when I asked the agent about entry counts, it'd cite the old one with full confidence.

The fix is dumb:

# at ingest
PATTERNS = [
    (re.compile(r"(\d[\d,]*)\s*entries\b", re.I), "entries"),
    (re.compile(r"(\d+(?:\.\d+)?)\s*%\s*(?:recall|accuracy|drop)", re.I), "percentage"),
    (re.compile(r"(\d+)\s*agents?\b", re.I), "agents"),
    (re.compile(r"(\d+)\s*tools?\b(?!kit)", re.I), "tools"),
    (re.compile(r"port\s*(\d{4,5})\b", re.I), "port"),
    # ...10 patterns total · forward + reverse
]

claims = []
for pat, entity in PATTERNS:
    for m in pat.finditer(text):
        claims.append({"entity": entity, "value": int(m.group(1).replace(",", ""))})

append_to_jsonl(claims)  # one line per (entity, value, source_file, ts)
Enter fullscreen mode Exit fullscreen mode
# at any subsequent query
for c in extract_from_text(query):
    past = history_lookup(c.entity, within=14_days)
    if past and past.value != c.value:
        alerts.append(
            f"[!] {c.entity} conflict · {time_ago(past.ts)} ago you said {past.value} · "
            f"now {c.value} · source: {past.source}"
        )
Enter fullscreen mode Exit fullscreen mode

That's it. ~10 patterns. ~150 lines of Python. No LLM call. Cost is one regex pass per ingest plus one jsonl append.

It only covers numerics. Locations ("住北京 → 上海"), roles ("乙方 → 甲方"), time-bound status ("on call → off duty") — the data structure generalizes but the patterns don't yet.

Why this is the same task as MEME's Absence

The MEME paper defines Absence as:

Recognizing when prior answers become uncertain post-update.

Their average across six systems is 1%. The reason isn't that the systems can't retrieve — retrieval is fine. The reason is that none of them flag contradiction between old and new facts at write time, so at read time they cheerfully return whichever fact won the embedding similarity contest.

numeric_claims.cross_ref is exactly the missing mechanism — for the numeric subset. When a new claim about an entity differs from a known claim, surface the conflict. Don't try to decide which is right; just refuse to assert silently.

I didn't read the paper before building this. I just got tired of getting bitten by stale numbers.

What numeric_claims doesn't solve

Cascade. I cannot solve Cascade with regex.

Cascade is the harder task: when A changes, propagate to B where B logically depends on A. "User moved from Beijing to Shanghai" → "User's commute time is now unknown" — that requires a dependency graph, and the regex approach fundamentally can't generate one.

The MEME paper's working solution (MD-flat + Opus 4.7) generates the graph at ingest by running an LLM over the new fact and walking dependencies. This is fashionable but expensive. There's another path that's been out of fashion since the rise of RAG: have the writer declare dependencies explicitly. In compass v2 I'm planning to add:

---
type: location
fact: "住址 = 上海"
cascades:
  - commute_time: invalidate
  - timezone: derive(GMT+8)
  - tax_jurisdiction: invalidate
---
Enter fullscreen mode Exit fullscreen mode

cascades: is just metadata. The ingest engine reads it and invalidates / derives dependents. Zero LLM calls. The cost is that someone (the writing agent, or the human, or an LLM at write time using a cheap model) has to think about which deps matter. That cost is low compared to running Opus on every ingest.

This is unfashionable because for the past 5 years the entire field has been chasing better retrieval. MEME's contribution is showing that retrieval can never close the gap — the dependency reasoning has to happen at write time, not read time. Once you accept that, "let the writer declare deps" becomes obvious again.

What I'd love to see

The MEME 100-episode dataset is at meme-benchmark/MEME and the runner at SeokwonJung-Jay/MEME-public. I'm prepping a compass adapter — happy to submit a PR.

My hypothesis before running:

  • Absence on numeric facts: 60–80%. (Mechanism matches the task; coverage is the limit.)
  • Absence on non-numeric facts: 0%. (No coverage yet.)
  • Cascade: 0%. (No solver.)
  • Deletion: 0%. (No tombstones.)
  • Static tasks (Recall, Aggregation, Tracking): similar to other BGE-m3-based systems · these are read-side and well-trodden.

If the numbers come out worse, I learn something. If they come out roughly here, the paper's framing extends to one more system the authors didn't test, and I've got data to point at.

Honest acknowledgments

  1. compass is a small project. ~1 GitHub star at time of writing. Not Mem0 / Graphiti / Letta. I'm not claiming we beat anything — we shipped one mechanism, by accident, that's adjacent to one of MEME's axes.

  2. The "we shipped 6 days before the paper" framing is timing trivia. The interesting claim is the mechanism choice: write-time contradiction detection on a structured subset of facts, for cheap. That choice was right · MEME proves it · MEME also proves the choice generalizes to non-numeric axes I haven't built yet.

  3. The "70× cost" number from the paper is a specific configuration (every ingest goes through Opus 4.7). Real users don't update every fact equally — 90% are isolated, 10% are relational. Selective propagation — Opus only on declared-relational facts — should drop the cost ratio from 70× to maybe 5–7×. That's a follow-up worth running.

Code

github.com/chunxiaoxx/nautilus-compass · MIT licensed · v1.5.3 just shipped.

The relevant file is numeric_claims.py · 200 lines · zero dependencies beyond stdlib.

Replies and benchmark suggestions welcome.


Disclosure: I run nautilus-compass as a personal project. No commercial interest. This post is dated 2026-05-17, six days after MEME's release.

Top comments (0)