Three days ago I wrote about accidentally shipping what looked like a partial solution to MEME's Absence task. After running the full 100-episode benchmark, that framing turned out to be wrong. The hypothesis was: nautilus-compass's numeric_claims regex/cross-ref would do 60–80% on the numeric subset of Absence questions. Actual result: 0 out of 7 numeric Absence questions across 100 episodes, in both pl and sw domains.
Not a bug. A scope error. Worth writing down honestly because the real niche is now clearer.
The hypothesis was wrong because the regex set is dev-metric-shaped
nautilus-compass's 10 regex patterns target phrases that show up in dev logs, ML papers, and agent CLI output:
entries, agents, tools, port, % recall,
threshold, latency_ms, top_k, max_workers, batch_size
These match "we shipped 56 entries" and "recall went to 84%". They do not match "how many siblings do you have?" or "what was my weight last year?" — which is what MEME's Absence subset is testing. Different vocabulary, different scope. The regex never fires.
This is the version I should have written before running the bench. The bench just verified it.
What it does catch — by design
The mechanism is: every observation that gets written to memory has its numeric claims extracted (e.g., entries=56, tools=12). When a new observation comes in, conflicting values for the same entity get flagged. Real example from our own agent fleet last week:
session_20260512-2042 · ingest:
"V5 cron pool: 56 entries, 12 tools, 84% recall"
session_20260513-0143 · query:
"How many entries are in the V5 pool?"
→ retrieved + alert:
[!] numeric conflict · 'entries' · current=9999 · previous=56 (1d ago)
The "9999" was a bug in our own daemon's stats counter. The conflict detector caught it. Not because it's smart — because it has a tight scope (entries) and remembers what it last saw.
This is the dev-memory conflict detector niche. Concretely it helps when:
- An agent emits structured metrics in its observations (latency, count, recall%, port number, queue depth) and the metrics drift / regress over time
- A development session generates contradictory numeric claims about the same artifact (e.g., "the index has 12 tools" → an hour later "the index has 8 tools")
- You want a cheap pre-LLM filter that says "this numeric claim contradicts a recent one — investigate before retrieving"
It does not help when:
- The numeric claims are about general life facts (ages, weights, money amounts, dates) — regex doesn't match
- The conflict is about non-numeric entities (location changes, role changes, status changes) — that's a different problem requiring entity-typed slots
- The agent emits free-form prose without consistent metric naming — regex still doesn't match
Why I'm posting this — and the broader frame
Three reasons:
1. MEME's knew_but_failed finding holds. Across the 100 episodes I ran:
- Cascade: 134/164 questions (82%) — gold fact was retrieved, LLM failed to reason over it
- Absence: 123/130 (95%) — same
- Deletion: 77/100 (77%) — same
Retrieval is not the bottleneck for cross-entity reasoning. The paper's argument that "isolated facts trap the LLM" reproduces cleanly. nautilus-compass doesn't escape that trap by adding a numeric regex; that was the wrong axis.
2. Niches > general solvers, at least for now. Memory layers that win benchmarks by writing a paper-tuned solver tend to lose to dumb retrieval as soon as the input distribution shifts. A small, well-scoped detector that catches one specific kind of agent self-contradiction is more honest. It might also be more useful — the per-episode cost is essentially zero (regex + dict lookup), so it can sit in the retrieval path for free.
3. Honest framing matters. Calling our regex set a MEME Absence partial solution was misleading even when the math worked on a synthetic subset. Calling it a dev-memory conflict detector is closer to what it actually does.
What we'd genuinely like opinions on
- If you've shipped an agent that emits structured metrics in its observations, would a "numeric conflict flag" in the retrieval path actually surface useful regressions? Or is the noise rate too high in practice?
- What's the obvious next entity-type extension?
port,latency_ms,queue_depthare easy regex.version(semver),commit_hash,pathneed slot-typing not regex. - For people who've built memory systems for dev/agent contexts (not life-context benchmarks): what conflict patterns do you actually see in your production logs that are not caught by retrieval similarity?
Bench harness, full 100-episode judge output, and the numeric_claims source are at github.com/chunxiaoxx/nautilus-compass (rename pending). Adapter for MEME-public at the same repo under code/agents/compass_memory.py.
— Chunxiao
Top comments (0)