Mahendra Gurjar

Posted on Jan 29

I Tested 3000+ LLM Agent Memory Operations - Here's What I Found

#ai #testing #agents #programming

🤔 The Problem

If you've built LLM-based agents, you've probably noticed: they forget things.

A lot.

Your agent remembers the user's name in message 1, forgets it by message 5, and then hallucinates a completely different name by message 10.

But why do agents forget? Is it:

Memory capacity issues?
Information getting overwritten?
The LLM hallucinating?
Something else? Nobody had data. Just anecdotes and frustration. So I built MemTrace to answer this question with actual statistics.

What I Built

-MemTrace** is a testing framework that tracks every single memory operation an agent makes and diagnoses why recalls fail.
Think of it like a "black box recorder" for agent memory.
Core idea:

Track every WRITE, READ, UPDATE, and EVICT operation
Compare what the agent returns vs. what was originally stored
Diagnose failures with evidence from the event log I tested 1000 random scenarios with 3030 memory operations to find patterns.

Key Findings

Finding 1: Agents Forget 60% of the Time

Valid Recall Rate: 39.6%**
That means when an agent tries to recall information it previously stored, it fails 6 out of 10 times.
(This excludes "invalid reads" where the agent tries to read something that was never written - those are test artifacts, not real failures)

Finding 2: Evictions Dominate

Memory Evicted: 46.2% of all failures**
Nearly half of all memory failures happen because the agent ran out of space and had to evict old information.

Breakdown:

Memory Evicted**: 994 failures (46.2%)
Invalid Read**: 815 failures (37.9%)
Memory Overwritten**: 343 failures (15.9%)
LLM Hallucination**: 0 failures (0.0%) *(Note: No hallucinations in this test because I used a deterministic agent. Real LLMs would show hallucinations too.)

Finding 3: Capacity Matters (Proven)

Validated Invariant: Capacity ↑ → Eviction ↓
I tested different memory capacities and found a clear pattern:

Capacity Low (1-5) Evictions 1354 Pass Rate 21.3%
Capacity Medium (10-15) Evictions 1150 Pass Rate 25.6%
Capacity High (20-30) Evictions 994 Pass Rate 29.0%

↑Higher capacity = fewer evictions = better recall.
Seems obvious, but now we have data to prove it.

Finding 4: Overwrites Are Independent

Overwrites stay constant regardless of capacity*
Whether you have capacity of 5 or 30, you get ~340 overwrites per 1000scenarios.
Why? Overwrites depend on how often you reuse the same keys, not how much memory you have.

Event Sourcing

Every memory operation creates an immutable event:

 python
 MemoryEvent(
     event_id="uuid-1234",
     event_type=MemoryEventType.WRITE,
     memory_layer=MemoryLayer.STM,
     step=1,
     timestamp=1706345678.123,
     key="user_name",
     value="Alice",
     metadata={}
 )

The event log becomes the single source of truth.

Automated Diagnosis

When a read fails, MemTrace analyzes the event history:

Scenario: Agent tries to read "deadline" but gets None

Event log shows:

WRITE deadline="Friday" (step 2) EVICT deadline="Friday" (step 5, reason: capacity_overflow) READ deadline=None (step 7)

Diagnosis: "memory_evicted"

Evidence: "Key was written at step 2, evicted at step 5, recall attempted at step 7"

4 Failure Types

Memory Evicted - Removed due to capacity constraints
Memory Overwritten - Updated with different value
Invalid Read - Never written in the first place
LLM Hallucination - Agent returns wrong value despite correct memory

System Invariants (Validated)

After testing 1000 scenarios, these patterns hold:

Capacity ↑ → Eviction ↓
Overwrite ~ independent of capacity
Invalid Read = scenario artifact
Unknown = 0 always (100% failure categorization)

What I Learned

1. Event Sourcing Is Powerful

Having a complete history of every operation makes debugging so much easier.

Instead of guessing why something failed, you can trace back through the exact sequence of events.

2. Capacity Is Critical

If your agent has limited memory, evictions will dominate your failures.

The data shows a clear linear relationship: double the capacity, reduce evictions by ~15%.

3. Overwrites Are Sneaky

Overwrites happen when you reuse keys. They're independent of capacity, which means you can't solve them by just adding more memory.

You need better key management or versioning.

4. Testing Reveals Patterns

Before building MemTrace, I thought hallucinations would be the main issue.
Nope. Evictions are 3x more common (in my tests with deterministic agents).

Real LLMs would show more hallucinations, but capacity is still a huge factor.

Feedback Welcome

This is my first open-source project and I'm still learning!

If you:

Have ideas for improvements
Found a bug
Have questions

Please reach out!

GitHub

DEV Community