π€ The Problem
If you've built LLM-based agents, you've probably noticed: they forget things.
A lot.
Your agent remembers the user's name in message 1, forgets it by message 5, and then hallucinates a completely different name by message 10.
But why do agents forget? Is it:
- Memory capacity issues?
- Information getting overwritten?
- The LLM hallucinating?
- Something else? Nobody had data. Just anecdotes and frustration. So I built MemTrace to answer this question with actual statistics.
What I Built
-MemTrace** is a testing framework that tracks every single memory operation an agent makes and diagnoses why recalls fail.
Think of it like a "black box recorder" for agent memory.
Core idea:
- Track every WRITE, READ, UPDATE, and EVICT operation
- Compare what the agent returns vs. what was originally stored
- Diagnose failures with evidence from the event log I tested 1000 random scenarios with 3030 memory operations to find patterns.
Key Findings
Finding 1: Agents Forget 60% of the Time
Valid Recall Rate: 39.6%**
That means when an agent tries to recall information it previously stored, it fails 6 out of 10 times.
(This excludes "invalid reads" where the agent tries to read something that was never written - those are test artifacts, not real failures)
Finding 2: Evictions Dominate
Memory Evicted: 46.2% of all failures**
Nearly half of all memory failures happen because the agent ran out of space and had to evict old information.
Breakdown:
- Memory Evicted**: 994 failures (46.2%)
- Invalid Read**: 815 failures (37.9%)
- Memory Overwritten**: 343 failures (15.9%)
- LLM Hallucination**: 0 failures (0.0%) *(Note: No hallucinations in this test because I used a deterministic agent. Real LLMs would show hallucinations too.)
Finding 3: Capacity Matters (Proven)
Validated Invariant: Capacity β β Eviction β
I tested different memory capacities and found a clear pattern:
- Capacity Low (1-5) Evictions 1354 Pass Rate 21.3%
- Capacity Medium (10-15) Evictions 1150 Pass Rate 25.6%
- Capacity High (20-30) Evictions 994 Pass Rate 29.0%
βHigher capacity = fewer evictions = better recall.
Seems obvious, but now we have data to prove it.
Finding 4: Overwrites Are Independent
Overwrites stay constant regardless of capacity*
Whether you have capacity of 5 or 30, you get ~340 overwrites per 1000scenarios.
Why? Overwrites depend on how often you reuse the same keys, not how much memory you have.
Event Sourcing
Every memory operation creates an immutable event:
python
MemoryEvent(
event_id="uuid-1234",
event_type=MemoryEventType.WRITE,
memory_layer=MemoryLayer.STM,
step=1,
timestamp=1706345678.123,
key="user_name",
value="Alice",
metadata={}
)
The event log becomes the single source of truth.
Automated Diagnosis
When a read fails, MemTrace analyzes the event history:
`
Scenario: Agent tries to read "deadline" but gets None
Event log shows:
WRITE deadline="Friday" (step 2)
EVICT deadline="Friday" (step 5, reason: capacity_overflow)
READ deadline=None (step 7)
Diagnosis: "memory_evicted"
Evidence: "Key was written at step 2, evicted at step 5, recall attempted at step 7"
`
4 Failure Types
- Memory Evicted - Removed due to capacity constraints
- Memory Overwritten - Updated with different value
- Invalid Read - Never written in the first place
- LLM Hallucination - Agent returns wrong value despite correct memory
System Invariants (Validated)
After testing 1000 scenarios, these patterns hold:
Capacity β β Eviction β
Overwrite ~ independent of capacity
Invalid Read = scenario artifact
Unknown = 0 always (100% failure categorization)
What I Learned
1. Event Sourcing Is Powerful
Having a complete history of every operation makes debugging so much easier.
Instead of guessing why something failed, you can trace back through the exact sequence of events.
2. Capacity Is Critical
If your agent has limited memory, evictions will dominate your failures.
The data shows a clear linear relationship: double the capacity, reduce evictions by ~15%.
3. Overwrites Are Sneaky
Overwrites happen when you reuse keys. They're independent of capacity, which means you can't solve them by just adding more memory.
You need better key management or versioning.
4. Testing Reveals Patterns
Before building MemTrace, I thought hallucinations would be the main issue.
Nope. Evictions are 3x more common (in my tests with deterministic agents).
Real LLMs would show more hallucinations, but capacity is still a huge factor.
Feedback Welcome
This is my first open-source project and I'm still learning!
If you:
Have ideas for improvements
Found a bug
Have questions
Please reach out!
Top comments (0)