We're not solving MEME's Absence task. We built a dev-memory conflict detector. Here's what it actually catches.

#ai #agents #llm #opensource

Three days ago I wrote about accidentally shipping what looked like a partial solution to MEME's Absence task. After running the full 100-episode benchmark, that framing turned out to be wrong. The hypothesis was: nautilus-compass's numeric_claims regex/cross-ref would do 60–80% on the numeric subset of Absence questions. Actual result: 0 out of 7 numeric Absence questions across 100 episodes, in both pl and sw domains.

Not a bug. A scope error. Worth writing down honestly because the real niche is now clearer.

The hypothesis was wrong because the regex set is dev-metric-shaped

nautilus-compass's 10 regex patterns target phrases that show up in dev logs, ML papers, and agent CLI output:

entries, agents, tools, port, % recall,
threshold, latency_ms, top_k, max_workers, batch_size

These match "we shipped 56 entries" and "recall went to 84%". They do not match "how many siblings do you have?" or "what was my weight last year?" — which is what MEME's Absence subset is testing. Different vocabulary, different scope. The regex never fires.

This is the version I should have written before running the bench. The bench just verified it.

What it does catch — by design

The mechanism is: every observation that gets written to memory has its numeric claims extracted (e.g., entries=56, tools=12). When a new observation comes in, conflicting values for the same entity get flagged. Real example from our own agent fleet last week:

session_20260512-2042 · ingest:
  "V5 cron pool: 56 entries, 12 tools, 84% recall"

session_20260513-0143 · query:
  "How many entries are in the V5 pool?"

→ retrieved + alert:
  [!] numeric conflict · 'entries' · current=9999 · previous=56 (1d ago)

The "9999" was a bug in our own daemon's stats counter. The conflict detector caught it. Not because it's smart — because it has a tight scope (entries) and remembers what it last saw.

This is the dev-memory conflict detector niche. Concretely it helps when:

An agent emits structured metrics in its observations (latency, count, recall%, port number, queue depth) and the metrics drift / regress over time
A development session generates contradictory numeric claims about the same artifact (e.g., "the index has 12 tools" → an hour later "the index has 8 tools")
You want a cheap pre-LLM filter that says "this numeric claim contradicts a recent one — investigate before retrieving"

It does not help when:

The numeric claims are about general life facts (ages, weights, money amounts, dates) — regex doesn't match
The conflict is about non-numeric entities (location changes, role changes, status changes) — that's a different problem requiring entity-typed slots
The agent emits free-form prose without consistent metric naming — regex still doesn't match

Why I'm posting this — and the broader frame

Three reasons:

1. MEME's knew_but_failed finding holds. Across the 100 episodes I ran:

Cascade: 134/164 questions (82%) — gold fact was retrieved, LLM failed to reason over it
Absence: 123/130 (95%) — same
Deletion: 77/100 (77%) — same

Retrieval is not the bottleneck for cross-entity reasoning. The paper's argument that "isolated facts trap the LLM" reproduces cleanly. nautilus-compass doesn't escape that trap by adding a numeric regex; that was the wrong axis.

2. Niches > general solvers, at least for now. Memory layers that win benchmarks by writing a paper-tuned solver tend to lose to dumb retrieval as soon as the input distribution shifts. A small, well-scoped detector that catches one specific kind of agent self-contradiction is more honest. It might also be more useful — the per-episode cost is essentially zero (regex + dict lookup), so it can sit in the retrieval path for free.

3. Honest framing matters. Calling our regex set a MEME Absence partial solution was misleading even when the math worked on a synthetic subset. Calling it a dev-memory conflict detector is closer to what it actually does.

What we'd genuinely like opinions on

If you've shipped an agent that emits structured metrics in its observations, would a "numeric conflict flag" in the retrieval path actually surface useful regressions? Or is the noise rate too high in practice?
What's the obvious next entity-type extension? port, latency_ms, queue_depth are easy regex. version (semver), commit_hash, path need slot-typing not regex.
For people who've built memory systems for dev/agent contexts (not life-context benchmarks): what conflict patterns do you actually see in your production logs that are not caught by retrieval similarity?

Bench harness, full 100-episode judge output, and the numeric_claims source are at github.com/chunxiaoxx/nautilus-compass (rename pending). Adapter for MEME-public at the same repo under code/agents/compass_memory.py.

— Chunxiao