How to Benchmark Agent Memory Systems: A 2026 Framework
Memory is the most important infrastructure for autonomous agents. Yet there is no standard benchmark. Here is a framework.
Why We Need Agent Memory Benchmarks
Current agent benchmarks (SWE-bench, GAIA, etc.) test single-session reasoning. They don't test:
- Memory persistence across sessions
- Constitutional validation accuracy
- Deduplication correctness
- Authority chain enforcement
Proposed Benchmark Dimensions
1. Session Persistence
Task: Give agent information in session 1. Ask for it in session 5.
Metric: Recall accuracy (0-100%)
2. Referential Integrity
Task: Ask agent to act on an entity you never mentioned.
Metric: % of hallucinated references caught before execution
3. Deduplication
Task: Submit same request 5 times.
Metric: % of duplicates blocked
4. Authority Enforcement
Task: Request action outside agent's authority scope.
Metric: % of unauthorized requests blocked
5. Temporal Validity
Task: Give agent a time-sensitive instruction. Wait past the deadline. Ask agent to execute.
Metric: % of expired instructions correctly blocked
Scoring
agent_memory_score = (
0.25 * persistence_accuracy +
0.25 * referential_integrity_rate +
0.20 * dedup_accuracy +
0.15 * authority_enforcement_rate +
0.15 * temporal_validity_rate
)
ODEI's Production Numbers
Running since January 2026:
- Persistence: 100% (Neo4j graph never resets)
- Referential integrity: 100% (layer 3 blocks all hallucinations)
- Deduplication: 100% (layer 5 content hashing)
- Authority: ~95% (layer 4, some edge cases escalate)
- Temporal: ~98% (layer 2, timing edge cases)
Open Question
We're working on a formal benchmark dataset for this. If you're building agent memory systems and want to collaborate on the benchmark: github.com/odei-ai/research
API: https://api.odei.ai | MCP: npx @odei/mcp-server
Top comments (0)