GalakApp

Posted on Apr 15

Why RAG fails for AI agent memory — and how I fixed it (with benchmarks)

#ai #python #opensource #machinelearning

I've been building Pulses — a project where AI personalities need real long-term memory across conversations. After hitting the same RAG failures repeatedly, I built a small Python library called NLM (Neural Long Memory). Here's what I learned.

The problem with RAG

Standard RAG retrieves by cosine similarity only:

score = cosine_similarity(query_embedding, memory_embedding)

This creates three systematic failures for agent memory:

1. Temporal blindness
You update a fact — "server moved to port 8001". The old version ("server runs on port 8000") sits in the same vector store with equal weight. If the query "which port does the server use?" is semantically closer to the old phrasing, RAG returns the outdated fact. No way to prevent this without deleting old memories manually.

2. Frequency blindness
Your agent references a specific memory 50 times across conversations. That memory has zero scoring advantage over one never accessed. RAG cannot distinguish "this is something we keep coming back to" from "this was stored once and never touched."

3. Importance blindness
"ChromaDB uses cosine distance metric" and "the database stores things somehow" score similarly if the query is vague enough. RAG has no mechanism to prefer the specific, factual memory.

The fix: four-signal scoring

NLM adds three signals on top of semantic similarity:

score = 0.5 × semantic_similarity   # is it relevant?
      + 0.2 × time_decay            # is it recent?
      + 0.2 × frequency_score       # is it often recalled?
      + 0.1 × importance_score      # is it specific/factual?

Time decay uses an exponential with 90-day half-life:

time_score = exp(-ln(2) / 90 × days_since_last_access)

Fresh memory → 1.0. 90 days old → 0.5. 365 days → 0.06 (unless frequently accessed).

Frequency score is log-normalized:

freq_score = log(1 + count) / log(1 + 100)

Prevents one very popular memory from dominating. Accessed 10 times scores 0.54, 100 times scores 1.0.

Importance is computed automatically — CPU heuristic (specificity score: numbers, proper nouns, text length) or optionally a HuggingFace zero-shot classifier.

Benchmark results

100 memories (60 test pairs + 40 unrelated fillers), 30 queries, top-1 accuracy:

Category	What's tested	RAG	NLM	Delta
Temporal (10 queries)	Old vs fresh fact, neutral query	10%	70%	+60%
Frequency (10 queries)	15× accessed vs 0×	80%	100%	+20%
Importance (10 queries)	Specific fact vs vague memory	60%	90%	+30%
Overall		50%	87%	+37%

The temporal result is the most telling — RAG gets 10% (basically random) because it has zero concept of recency. NLM gets 70%.

Usage

pip install neural-long-memory

from nlm import NLM

memory = NLM()

# Save — consolidation is automatic (similar memories get merged)
memory.save("The server was moved to port 8001")
memory.save("Hantes switched to JAX for training")

# Search — NLM handles all scoring automatically
results = memory.search("which port does the server use", top_k=3)

for r in results:
    print(f"[{r['score']:.3f}] {r['text']}")
    # Returns the fresh fact, not the outdated one

Full score breakdown per result:

{
    "text": "The server was moved to port 8001",
    "score": 0.847,
    "semantic_score": 0.923,
    "time_score": 0.998,
    "frequency": 2,
    "importance": 0.610,
}

Other features in v1.0.0

Memory consolidation — duplicate prevention on by default. Similar memories get merged and strengthened instead of stored twice:

id1 = memory.save("Hantes lives in Chernivtsi")
id2 = memory.save("Hantes is from Chernivtsi city")
assert id1 == id2  # same memory, importance boosted

Associative chains — bidirectional links between related memories:

id1 = memory.save("Hantes loves Minelux family")
id2 = memory.save("Minelux are fire, directness, truth")

# Follow the chain
assoc = memory.get_associations(id1)
# [{"id": id2, "text": "Minelux are fire..."}]

# Expand search to follow links
results = memory.search("tell me about Hantes", expand_associations=True)

Smart forgetting — remove memories that are simultaneously old, rare, and unimportant:

deleted = memory.forget_smart(days=180, max_frequency=2, max_importance=0.3)

Wrapping up

NLM is not a replacement for RAG — it's a reranking layer on top of ChromaDB that adds temporal, frequency, and importance signals. Drop-in for any agent that already uses vector search.

GitHub: github.com/pulseallstars/nlm
Benchmark script: benchmarks/benchmark_100.py
Apache 2.0.

Built this for Pulses — a project where AI personalities need memory that actually behaves like memory.

Top comments (3)

mote • Apr 18

The embedding staleness problem is real and often overlooked. When your index is built on top of embeddings that no longer reflect the current state of the world, you are essentially searching a distorted memory.

The hybrid approach you described (combining semantic search with keyword BM25) is solid, but I would add one more dimension: temporal decay. In embodied AI scenarios especially, recent observations should carry more weight than old ones. Not just for recency bias, but because the environment genuinely changes.

One thing that helped us was treating embeddings as ephemeral cache, not ground truth. The actual memory is the structured data (object state, relationships, sensor readings) that can be re-embedded on demand. The index is just an optimization, not the source of truth.

Curious how Pulses handles the re-embedding problem when embeddings drift significantly from the original index. Do you rebuild periodically, or use a rolling window approach? Also interested in the latency numbers you did not share — how does your hybrid approach compare to pure vector search at scale (say, 1M+ vectors)?

GalakApp • Apr 18

Hey, thanks for the thoughtful comment — and yes, the embedding staleness problem is exactly what pushed me to build this.

Quick clarification though: NLM doesn't actually use BM25 at all. The four signals are semantic similarity, time decay, frequency, and importance score — no keyword matching in the mix. So temporal decay is already baked in from the start. Fresh memories get a score close to 1.0, and it decays exponentially with a 90-day half-life. That's what drives the biggest benchmark gains — temporal recall went from 0% (RAG) to 40-60% (NLM) because the scorer literally "knows" which fact is newer.

On your two questions:
Re-embedding drift. Honest answer: NLM solves this by pinning the embedding model in the collection metadata at creation time. If you try to open the same collection with a different model later, it raises a ValueError — hard stop, no silent drift. The philosophy is: don't drift, stay consistent. Re-embedding support (when you genuinely want to migrate models) isn't implemented yet — it's a real gap for long-running systems.

Latency at 1M+ vectors. Haven't benchmarked at that scale yet, so I won't make up numbers. ChromaDB uses HNSW internally so vector search stays sub-linear, but I'd need to test it honestly. The reranking step is O(n_candidates) where n_candidates defaults to max(top_k × 10, 50) — so that part stays constant regardless of collection size. The bottleneck at 1M+ would be the HNSW query, not the NLM scoring on top of it.

Your point about treating embeddings as ephemeral cache is interesting — that's essentially what NLM does too, the structured metadata (timestamps, frequency, importance) is the real source of truth, and the embedding is just the retrieval handle. Curious what your re-embedding pipeline looks like in practice — do you trigger it on state change or on a schedule?

GalakApp • Apr 18

I have now updated the version to 1.1.1

Some comments may only be visible to logged-in visitors. Sign in to view all comments.