DEV Community

Cover image for [Update] VAC: A Memory Layer That Makes LLMs Remember You

[Update] VAC: A Memory Layer That Makes LLMs Remember You

Introduction

What if your LLM could actually remember who you are — across sessions, projects, and time? Existing systems either rely entirely on input context (limited length) or suffer from issues like hallucinations and relevance loss.

The VAC Memory System is a unique Retrieval-Augmented Generation (RAG) architecture that provides persistent memory for LLMs.


The Problem and the Solution

LLMs inherently maintain static statistical "memory" embedded in parameters. VAC's objective is to enable dynamic memory retrieval, extracting accurate data without modifying the model.

Key VAC advantages:

  1. MCA (Candidate Filtering): The Multi-Candidate Assessment addresses the false-positive problem found in traditional vector databases (like FAISS), achieving entity-level precision filtering before expensive computations.
   def calculate_query_coverage(query_keywords: set, memory_keywords: set) -> float:
       intersection = len(query_keywords & memory_keywords)
       return intersection / len(query_keywords)
Enter fullscreen mode Exit fullscreen mode
  1. Physics-Inspired Ranking: By conceptualizing text documents as "planets" with "mass" and "gravity," VAC innovates new retrieval mechanisms:
   def calculate_force(query_mass, memory_mass, distance):
       force = G * (query_mass * memory_mass) / (distance ** 2 + DELTA)
       return force
Enter fullscreen mode Exit fullscreen mode
  1. Orchestration: VAC operates modularly, minimizing reliance on LLMs beyond the answer-generation phase.

Why LLMs Need Real Memory

Even the best LLMs today:

  • cannot retain long-term context
  • cannot remember past conversations
  • cannot update their “understanding”
  • cannot store evolving user profiles
  • cannot track long projects or goals

They operate in a stateless bubble.
Everything outside a single prompt window is forgotten forever.
This is why RAG exploded in popularity — but RAG itself has major flaws:

  • Vector search retrieves semantically similar docs, not logically correct ones
  • Important memories get buried
  • Retrieval is non-deterministic
  • Noise increases with dataset growth
  • There is no notion of priority or recency

So I decided to build a memory architecture that fixes these issues.


System Architecture

The VAC Memory System pipeline consists of the following steps:

  1. MCA-PreFilter: Filtering candidates by entity coverage to reduce computational costs.
  2. Vector Processing with FAISS: Embedding and semantic search through 1024D vectors (BGE-Large).
  3. BM25 Search: Traditional exact-matching methods.
  4. Cross-Encoder Reranking: Precision optimization for the top N candidates.

⚙️ Full Architecture (8 Steps):

Query: "Where did I meet Alice?"

[1] Query Classification (factual/temporal/conceptual)

[2] LLM Synonym Expansion (Qwen 14B via Ollama)
"alice" → ["alice", "alicia", "her"]
"meet" → ["meet", "met", "encountered", "ran into"]

[3] MCA-FIRST FILTER (coverage ≥ 0.1)
1000 memories → ~30 candidates

[4] FAISS (BGE-large, 1024D)
Adds semantic matches: "visited Alice", "saw her"
→ 100 candidates

[5] BM25 (Okapi with custom tokenization)
Catches keyword variations FAISS missed
→ 40 more candidates

[6] Union + Deduplication → ~120 unique

[7] Cross-Encoder Reranking (bge-reranker-v2-m3, 278M params)
120 → 15 best

[8] GPT-4o-mini (T=0.0, max_tokens=150)
→ Final answer

Here is the ranking pipeline code:

def rank_memories(query, memories):
    query_keywords = extract_keywords_simple(query)
    scored_mem = [
        calculate_mass(mem, query_keywords)
        for mem in memories
    ]
    return sorted(scored_mem, key=lambda x: x['force'], reverse=True)
Enter fullscreen mode Exit fullscreen mode

📊 Head-to-Head Comparison

Aspect VAC Memory Mem0 Letta/MemGPT Zep
LoCoMo Accuracy 80.1% 66.9% 74. 0% 75.1%
Architecture MCA + FAISS + BM25 + Cross-Encoder LLM extraction + Graph OS-like paging + Archive search Summarize + Vector
Entity Protection ✅ MCA pre-filter ❌ Semantic only ❌ Semantic only ❌ Semantic only
Latency 2. 5 sec/query ~3-5 sec ~2-4 sec ~2-3 sec
Cost per 1M tokens <$0.10 ~$0.50+ ~$0. 30+ ~$0.20+
Reproducibility 100% (seed-locked) Variable Variable Variable
Conversation Isolation 100% Partial Partial Partial

Results

  • VAC Memory: 80.1%
  • Zep: 75.14%
  • Mem0: 66.9%

Validated across:

  • 10 conversations × 10 seeds = 100 runs
  • 1,540 total questions
  • 4 question types: Single-hop (87%), Multi-hop (78%), Temporal (72%), Commonsense (87%).

Component Recall (ground truth coverage):

  • MCA alone: 40-50%
  • FAISS alone: 65-70%
  • BM25 alone: 50%
  • Union Recall (MCA + FAISS + BM25): 85-95%**

Key insight: No single retrieval method is sufficient. The union catches what each individual method misses.


Open Source

Experience it yourself:


🧪 Reproducibility

Every result is verifiable:

# Run with seed
SEED=2001 LOCOMO_CONV_INDEX=0 python orchestrator.py

# Same seed = same results
# 100 runs validated
Enter fullscreen mode Exit fullscreen mode

🙏 I'd love feedback from anyone building memory systems for LLMs, or experimenting with LoCoMo benchmarks.

What do you think about combining MCA + BM25 + FAISS? Any ideas for further improvements?

Let’s connect!

Top comments (1)

Collapse
 
vacarchitector profile image
Viktor Kuznetsov | VAC Memory System CEO

Which memory systems are the best in your opinion right now? What’s still missing in VAC? Any recommendations?