Matteo Tuzi

Posted on Jan 8

Beyond RAG: Building Intelligent Memory Systems for AI Agents

#ai #rag #memory #productivity

Vector search alone isn't memory. Real AI memory needs structured extraction, multi-strategy retrieval, and separation of concerns. Here's how i built it.

The Problem Nobody Talks About
You've built a RAG system. Congratulations! You can now retrieve documents based on semantic similarity. But here's the uncomfortable truth:

Vector search ≠ Memory
When a user asks "What did I order last week?", your RAG system needs to:

Understand that "last week" is a temporal filter, not a search term
Know that "orders" live in a specific logical bucket
Recognize the intent is direct lookup, not semantic search

A simple "cosine_similarity(query_embedding, document_embeddings)" won't cut it.

The Three Paths
When building memory for AI agents, developers typically face three choices. We didn't like any of the first two.

1. The "Do It Yourself" RAG
The Promise: Full control over every component.
The Reality: Weeks of infrastructure work. You're building ingestion pipelines, vector stores, and retrieval logic from scratch. You end up maintaining glue code instead of building your product.

2. Black-box Memory APIs
The Promise: Quick start, "just works".
The Reality: Zero control over the schema. You dump text in, you get text out. You can't define structured fields or custom extraction logic.

3. The Middle Ground: memorymodel.dev
The Approach: You define the schema (the "Memory Nodes") and the intent, memorymodel.dev handle the infrastructure (embedding, storage, retrieval strategies).
Result: The flexibility of DIY with the speed of a managed service.

The Code Reality
DIY RAG Setup (simplified — real implementations are worse):

# Vector store setup
vector_store = Pinecone(api_key=os.environ["PINECONE_KEY"])
index = vector_store.create_index("memories", dimension=1536)

# Embedding pipeline
embedder = OpenAIEmbeddings(model="text-embedding-3-small")

# Extraction logic (you write this)
def extract_fields(text):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": f"Extract order_id, date, items from: {text}"}]
    )
    return json.loads(response.choices[0].message.content)

# Ingestion (you write this)
def ingest(text):
    fields = extract_fields(text)
    embedding = embedder.embed(text)
    index.upsert([(fields["order_id"], embedding, fields)])

# Retrieval with temporal logic (you write this)
def search(query):
    # TODO: Parse dates, detect intent, handle entity lookups...
    pass  # Good luck.

memorymodel.dev:

from memory_model import MemoryClient

memory = MemoryClient(api_key="your-key", cluster_id="your-cluster")
memory.add("Customer ordered 3 units of SKU-789 on 2025-01-15")
results = memory.search("orders from last week")  # Just works.

The Architecture
Here's what's actually needed for production-grade AI memory:

Let's break down the key components.

1. Memory Nodes: Beyond Flat Vector Stores
The core abstraction is the Memory Node - a logical classification with its own extraction schema.

Instead of dumping everything into one vector collection, you define nodes like:

UserProfile
├── extraction_prompt: "Extract user_id, plan, balance..."
├── embedding_template: "User {{user_id}} on {{plan}} plan"
└── fields: [user_id, plan, balance, activity]

OrderHistory  
├── extraction_prompt: "Extract order_id, items, total..."
├── embedding_template: "Order {{order_id}}: {{items}}"
└── fields: [order_id, items, total, date]

Why this matters:

Each node has its own LLM extraction prompt - structured data, not raw text
The embedding template controls what gets vectorized
Retrieval can target specific nodes or let the system decide

2. The Relevance Router: Intent-Aware Retrieval
When a query comes in, we don't just embed and search. First, the Relevance Router determines which Memory Nodes are semantically relevant.

The router returns relevance scores (0-1) for each available node. Only high-scoring nodes are queried, reducing noise and improving precision.

Key insight: This is zero-shot routing - no training required, just node names that semantically describe their content.

Performance note: The router uses a fast model (Gemini Flash) with aggressive in-memory caching. Repeated or similar queries hit the cache, keeping latency low.

3. Multi-Strategy Retrieval
Here's where it gets interesting. MemoryModel don't rely on a single retrieval method. The system detects query intent and selects the appropriate strategy:

Semantic Intent: "Tell me about the refund policy"
Uses standard Vector similarity search.
Direct Lookup: "Show order #12345"
Uses Exact match on the order_id field.
Temporal Query: "What happened last week?"
Uses Date range filters combined with vector search.
Entity Anchor: "Everything about Company X"
Uses Entity filtering + expansion.
Visual Search: [Image input]
Uses Multimodal embedding search.

The Resonator orchestrates these strategies:

Intent Detection Examples

// "Show me order ORD-12345" → Direct Lookup
detectIntent("Show me order ORD-12345")
// → { type: 'direct', key: 'order_id', value: 'ORD-12345' }

// "What happened before January 15th?" → Temporal
detectIntent("What happened before January 15th?")  
// → { type: 'temporal', range: { lte: '2025-01-15' } }

// "Tell me about Acme Corp" → Entity Anchor
detectIntent("Tell me about Acme Corp")
// → { type: 'anchor', entity: 'Acme Corp' }

4. Automatic Field Injection
When you ingest data, the Extraction Engine doesn't just run your custom prompt. It automatically injects system fields that power advanced retrieval:

"entity_anchors[]" — Business entities, structural references
"happened_at" — Temporal context (resolved from relative dates)
"context_ref_id" — Links to parent documents

You define what you need. Memorymodel.dev add what retrieval needs.

5. The Developer Experience

Console: Visual Schema Design
Configure your memory architecture visually:

Create Projects and Clusters (logical environments)
Define Memory Nodes with extraction schemas
Configure which nodes are active for ingestion vs retrieval
Monitor memory usage and analytics

SDK: 4 Lines to Production

import { MemoryClient } from 'memory-model';

const memory = new MemoryClient({ 
  apiKey: 'your-key',
  clusterId: 'your-cluster' 
});

// Ingest
await memory.add("Customer ordered 3 units of SKU-789 on 2025-01-15");

// Retrieve  
const results = await memory.search("recent orders");

Python SDK also available:

from memory_model import MemoryClient

memory = MemoryClient(api_key="your-key", cluster_id="your-cluster")

memory.add("Customer ordered 3 units of SKU-789")
results = memory.search("recent orders")

Real-World Pattern: Asymmetric Clusters
Here's a powerful pattern i've seen in production: the same Memory Node can behave differently across clusters.

Use Case: Customer Care + Sales Intelligence

What's happening:

The Customer Care agent ingests conversations into all three nodes
It only retrieves from AppKnowledge and UserProfile to answer questions
The SalesInsight node is ingestion-only in this cluster
A separate Sales Ops cluster has SalesInsight as extraction-only

Result: Customer support automatically generates sales leads without any additional code. The sales team sees a real-time dashboard of opportunities with structured fields:

{
  "target_user_id": "user001",
  "implied_need": "High withdrawal limits for travel",
  "life_event_trigger": "Traveling to Japan",
  "suggested_product": "Metal Plan",
  "conversion_probability": "High"
}

What Runs Behind the Scenes
This is where memorymodel.dev stops being "just a managed RAG" and becomes an autonomous memory system.

The Architect: Self-Tuning Retrieval
Every 24 hours, the Architect analyzes your retrieval logs and automatically adjusts system parameters using Control Theory principles (PID-like dampening):

"meta_threshold" — How aggressive to filter low-relevance results
"cluster_margin" — Tolerance for centroid-based matching
"top_k" limits — How many results to return per strategy

It adapts based on your actual usage patterns. High-precision workload? It tightens thresholds. Recall-heavy queries? It loosens them. You don't configure this. It learns.

The Dreamer: Meta-Insight Synthesis
The Dreamer periodically scans recent memories and generates higher-order insights that weren't explicitly stated:

Input memories:
- "Bought organic vegetables at Whole Foods"
- "Searched for plant-based protein recipes"  
- "Cancelled steakhouse reservation"

Generated insight:
→ "User is likely transitioning to a vegetarian lifestyle"
  (confidence: 0.85)

These synthesized insights become searchable memories themselves, enabling queries like "What are my behavioral patterns?" to return meaningful results.

Maintenance Workers
Deduplication: Detects semantically similar memories and merges them to avoid redundancy.
Consolidation: Compacts old memories into summaries to prevent database bloat.
Cleanup: Prunes stale or low-confidence data to maintain high quality.
Centroid Calculation: Pre-computes cluster centroids to enable ultra-fast retrieval.

The result: A memory system that doesn't just store — it evolves.

When to Use This
✅ AI Agents that need persistent, structured memory across conversations
✅ Document Intelligence — contracts, manuals, knowledge bases
✅ Customer Support Bots with user context and history
✅ Personal Assistants that remember preferences and events

❌ Simple Q&A over static documents (vanilla RAG is fine)
❌ Real-time streaming use cases (we're optimized for persistence)

💡 Memorymodel.dev approach: I optimize for retrieval quality and developer experience over raw throughput. If you need sub-100ms ingestion for IoT streams, use a time-series DB. If you need AI agents that actually remember context correctly, i've got you.

📊 See memorymodel's LoCoMo benchmark results for a detailed accuracy comparison with other memory systems.

Getting Started

Sign up at memorymodel.dev
Create a Project and Cluster via the Console
Define your first Memory Node
Install the SDK: "npm install memory-model" or "pip install memory-model"
Start ingesting and retrieving

Conclusion
Building AI memory isn't about finding the nearest vector. It's about:

Structured extraction — LLM-powered field parsing
Intent-aware retrieval — knowing how to search, not just what to search
Schema control — you define the shape of your memories
Managed complexity — workers, deduplication, and optimization handled for you

I built memorymodel.dev because i needed this myself. Now it's available for everyone.

DEV Community

Beyond RAG: Building Intelligent Memory Systems for AI Agents

Top comments (0)