I've been building Pulses — a project where AI personalities need real long-term memory across conversations. After hitting the same RAG failures repeatedly, I built a small Python library called NLM (Neural Long Memory). Here's what I learned.
The problem with RAG
Standard RAG retrieves by cosine similarity only:
score = cosine_similarity(query_embedding, memory_embedding)
This creates three systematic failures for agent memory:
1. Temporal blindness
You update a fact — "server moved to port 8001". The old version ("server runs on port 8000") sits in the same vector store with equal weight. If the query "which port does the server use?" is semantically closer to the old phrasing, RAG returns the outdated fact. No way to prevent this without deleting old memories manually.
2. Frequency blindness
Your agent references a specific memory 50 times across conversations. That memory has zero scoring advantage over one never accessed. RAG cannot distinguish "this is something we keep coming back to" from "this was stored once and never touched."
3. Importance blindness
"ChromaDB uses cosine distance metric" and "the database stores things somehow" score similarly if the query is vague enough. RAG has no mechanism to prefer the specific, factual memory.
The fix: four-signal scoring
NLM adds three signals on top of semantic similarity:
score = 0.5 × semantic_similarity # is it relevant?
+ 0.2 × time_decay # is it recent?
+ 0.2 × frequency_score # is it often recalled?
+ 0.1 × importance_score # is it specific/factual?
Time decay uses an exponential with 90-day half-life:
time_score = exp(-ln(2) / 90 × days_since_last_access)
Fresh memory → 1.0. 90 days old → 0.5. 365 days → 0.06 (unless frequently accessed).
Frequency score is log-normalized:
freq_score = log(1 + count) / log(1 + 100)
Prevents one very popular memory from dominating. Accessed 10 times scores 0.54, 100 times scores 1.0.
Importance is computed automatically — CPU heuristic (specificity score: numbers, proper nouns, text length) or optionally a HuggingFace zero-shot classifier.
Benchmark results
100 memories (60 test pairs + 40 unrelated fillers), 30 queries, top-1 accuracy:
| Category | What's tested | RAG | NLM | Delta |
|---|---|---|---|---|
| Temporal (10 queries) | Old vs fresh fact, neutral query | 10% | 70% | +60% |
| Frequency (10 queries) | 15× accessed vs 0× | 80% | 100% | +20% |
| Importance (10 queries) | Specific fact vs vague memory | 60% | 90% | +30% |
| Overall | 50% | 87% | +37% |
The temporal result is the most telling — RAG gets 10% (basically random) because it has zero concept of recency. NLM gets 70%.
Usage
pip install neural-long-memory
from nlm import NLM
memory = NLM()
# Save — consolidation is automatic (similar memories get merged)
memory.save("The server was moved to port 8001")
memory.save("Hantes switched to JAX for training")
# Search — NLM handles all scoring automatically
results = memory.search("which port does the server use", top_k=3)
for r in results:
print(f"[{r['score']:.3f}] {r['text']}")
# Returns the fresh fact, not the outdated one
Full score breakdown per result:
{
"text": "The server was moved to port 8001",
"score": 0.847,
"semantic_score": 0.923,
"time_score": 0.998,
"frequency": 2,
"importance": 0.610,
}
Other features in v1.0.0
Memory consolidation — duplicate prevention on by default. Similar memories get merged and strengthened instead of stored twice:
id1 = memory.save("Hantes lives in Chernivtsi")
id2 = memory.save("Hantes is from Chernivtsi city")
assert id1 == id2 # same memory, importance boosted
Associative chains — bidirectional links between related memories:
id1 = memory.save("Hantes loves Minelux family")
id2 = memory.save("Minelux are fire, directness, truth")
# Follow the chain
assoc = memory.get_associations(id1)
# [{"id": id2, "text": "Minelux are fire..."}]
# Expand search to follow links
results = memory.search("tell me about Hantes", expand_associations=True)
Smart forgetting — remove memories that are simultaneously old, rare, and unimportant:
deleted = memory.forget_smart(days=180, max_frequency=2, max_importance=0.3)
Wrapping up
NLM is not a replacement for RAG — it's a reranking layer on top of ChromaDB that adds temporal, frequency, and importance signals. Drop-in for any agent that already uses vector search.
GitHub: github.com/pulseallstars/nlm
Benchmark script: benchmarks/benchmark_100.py
Apache 2.0.
Built this for Pulses — a project where AI personalities need memory that actually behaves like memory.
Top comments (1)
A common issue with RAG architectures is that they often underestimate the importance of token limits in ChatGPT, which can severely restrict long-term memory. In our experience with enterprise teams, a clever workaround is using embeddings to summarize and store context efficiently, allowing agents to recall past interactions without hitting token limits. This approach not only optimizes memory but also maintains performance across varied use cases. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)