varun pratap Bhardwaj

Posted on Mar 4 • Originally published at superlocalmemory.com

Building a Universal Memory Layer for AI Agents: Architecture and Patterns

#aiagents #memoryarchitecture #hybridsearch #agentstatemanagement

AI agents without memory are stateless functions. They process a prompt, return a response, and forget everything. That works for one-shot tasks, but it fails the moment an agent needs to recall a previous conversation, learn from a mistake, or coordinate with another agent. The moment you need continuity — across sessions, across tools, across a team of agents — you need a memory layer.

This post teaches you how to build one. Not a toy demo, but an architecture you can reason about and extend. We will cover how memory gets stored, how it gets retrieved (and why naive vector search is not enough), and how multiple agents can share and trust each other's memories. Every pattern comes with runnable code.

What You Will Learn

The three types of agent memory (episodic, semantic, procedural) and when to use each

How to implement hybrid retrieval combining semantic search and BM25 keyword search

How Reciprocal Rank Fusion merges results from different retrieval strategies

State management patterns for multi-agent systems with shared memory

Trust scoring: how agents decide which memories to rely on

Practical Python code for each component

Conceptual Foundation: What Is Agent Memory?

Human memory is not a single system. You have short-term working memory (what you are thinking about right now), long-term episodic memory (what happened at lunch yesterday), and procedural memory (how to ride a bike). Agent memory works the same way — different types serve different purposes.

Three Types of Agent Memory

Episodic memory stores specific interactions. "The user asked me to summarize Q3 revenue on Tuesday. I retrieved data from the finance API and the user corrected my interpretation of 'net revenue'." This is the agent's autobiography — a timestamped log of what happened, what it did, and what feedback it received.

Semantic memory stores facts and knowledge extracted from interactions. "Net revenue for this company means revenue after returns and discounts, not after all expenses." These are distilled truths the agent can reference without replaying entire episodes.

Procedural memory stores learned behaviors and strategies. "When the user asks about revenue, always clarify whether they mean gross or net before querying." These are the agent's acquired skills.

Most memory layer implementations today conflate these three types into a single vector store. That works at small scale, but it creates retrieval noise as the memory grows. A well-designed memory layer distinguishes between them.

graph TD
    A[Agent Interaction] --> B{Memory Classification}
    B -->|Raw interaction log| C[Episodic Memory Store]
    B -->|Extracted facts| D[Semantic Memory Store]
    B -->|Learned strategies| E[Procedural Memory Store]

    C --> F[Embedding + BM25 Index]
    D --> F
    E --> F

    F --> G[Hybrid Retrieval Engine]
    G --> H[Reciprocal Rank Fusion]
    H --> I[Ranked Memory Results]
    I --> J[Agent Context Window]

How It Works: The Memory Write/Read Pipeline

Every memory layer has two fundamental operations: writing memories and reading them back. The architecture of each determines everything about your system's quality.

Writing Memories: Ingestion Pipeline

When an agent completes an interaction, the memory layer needs to process and store that interaction. This is not as simple as appending text to a database.

1. Capture the Raw Interaction

The agent sends the full interaction context — the user's input, the agent's reasoning trace, tool calls made, the final response, and any feedback received. Store this as structured data, not a flat string.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

@dataclass
class MemoryRecord:
    content: str
    memory_type: str  # "episodic", "semantic", "procedural"
    agent_id: str
    session_id: str
    timestamp: datetime = field(default_factory=datetime.utcnow)
    metadata: dict = field(default_factory=dict)
    trust_score: float = 1.0  # 0.0 to 1.0
    source_agent: Optional[str] = None
    embedding: Optional[list[float]] = None

2. Generate Embeddings

Convert the text content into a dense vector representation. This enables semantic search — finding memories that are conceptually similar even when they use different words.

import numpy as np
from openai import OpenAI

client = OpenAI()

def generate_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    """Generate a dense vector embedding for the given text."""
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

# Example usage
record = MemoryRecord(
    content="User prefers revenue figures in EUR, not USD.",
    memory_type="semantic",
    agent_id="finance-agent-01",
    session_id="sess-abc-123",
    metadata={"domain": "finance", "confidence": 0.95}
)
record.embedding = generate_embedding(record.content)
# Embedding is a 1536-dimensional float vector for text-embedding-3-small
print(f"Embedding dimensions: {len(record.embedding)}")
# Output: Embedding dimensions: 1536

3. Build the BM25 Index

Embeddings capture meaning, but they can miss exact keyword matches. BM25 (Best Matching 25) is a term-frequency ranking function that excels at finding documents containing specific terms. You need both.

import math
from collections import Counter

class BM25Index:
    """Simple BM25 index for keyword-based retrieval."""

    def __init__(self, k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.docs: list[dict] = []      # [{id, tokens}]
        self.avg_dl: float = 0.0        # average document length
        self.doc_freqs: dict = {}        # term -> number of docs containing term
        self.n_docs: int = 0

    def add_document(self, doc_id: str, text: str):
        tokens = text.lower().split()
        self.docs.append({"id": doc_id, "tokens": tokens})
        self.n_docs += 1

        # Update document frequencies
        unique_terms = set(tokens)
        for term in unique_terms:
            self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1

        # Recalculate average document length
        self.avg_dl = sum(len(d["tokens"]) for d in self.docs) / self.n_docs

    def score(self, query: str) -> list[tuple[str, float]]:
        """Return (doc_id, score) pairs sorted by BM25 relevance."""
        query_tokens = query.lower().split()
        scores = []

        for doc in self.docs:
            doc_score = 0.0
            doc_len = len(doc["tokens"])
            term_counts = Counter(doc["tokens"])

            for term in query_tokens:
                if term not in self.doc_freqs:
                    continue

                df = self.doc_freqs[term]
                # Inverse document frequency
                idf = math.log((self.n_docs - df + 0.5) / (df + 0.5) + 1.0)

                tf = term_counts.get(term, 0)
                # BM25 term frequency normalization
                numerator = tf * (self.k1 + 1)
                denominator = tf + self.k1 * (1 - self.b + self.b * doc_len / self.avg_dl)

                doc_score += idf * (numerator / denominator)

            scores.append((doc["id"], doc_score))

        return sorted(scores, key=lambda x: x[1], reverse=True)

4. Persist to Storage

Store the record with its embedding in a vector database and index the text in BM25. In production, you would use something like PostgreSQL with pgvector, Qdrant, Weaviate, or ChromaDB for vector storage.

import json

class MemoryStore:
    """Unified memory store with both vector and BM25 indexing."""

    def __init__(self):
        self.records: dict[str, MemoryRecord] = {}
        self.bm25_index = BM25Index()
        self.embeddings: dict[str, list[float]] = {}  # doc_id -> embedding

    def write(self, record: MemoryRecord) -> str:
        doc_id = f"{record.agent_id}:{record.timestamp.isoformat()}"

        # Store the record
        self.records[doc_id] = record

        # Index for BM25 keyword search
        self.bm25_index.add_document(doc_id, record.content)

        # Store embedding for vector search
        if record.embedding is None:
            record.embedding = generate_embedding(record.content)
        self.embeddings[doc_id] = record.embedding

        return doc_id

Reading Memories: Hybrid Retrieval

Here is where most implementations get it wrong. They use only vector search. Pure vector search retrieves memories that are semantically similar, but it can miss results that contain exact terms the query specifies. Pure BM25 finds keyword matches but misses conceptually related memories.

The solution is hybrid retrieval: run both searches, then merge the results.

Why Pure Vector Search Is Not Enough

Suppose your agent stores the memory: "Customer account ID is CX-7742-B." A later query for "CX-7742-B" will likely fail with pure semantic search because the embedding of an alphanumeric ID carries almost no semantic meaning. BM25 handles this trivially because it matches the exact token. Always combine both retrieval strategies.

Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion is a simple, effective algorithm for merging ranked lists from different retrieval methods. For each document, its RRF score is calculated as:

RRF(d) = sum over all rankers of 1 / (k + rank(d))

Where k is a constant (typically 60) that dampens the influence of high-ranking outliers.

def reciprocal_rank_fusion(
    ranked_lists: list[list[tuple[str, float]]],
    k: int = 60,
    top_n: int = 10
) -> list[tuple[str, float]]:
    """
    Merge multiple ranked result lists using Reciprocal Rank Fusion.

    Args:
        ranked_lists: List of [(doc_id, score)] lists, each sorted by relevance
        k: Damping constant (default 60, from the original Cormack et al. paper)
        top_n: Number of results to return

    Returns:
        Merged [(doc_id, rrf_score)] list sorted by fused score
    """
    rrf_scores: dict[str, float] = {}

    for ranked_list in ranked_lists:
        for rank, (doc_id, _original_score) in enumerate(ranked_list, start=1):
            if doc_id not in rrf_scores:
                rrf_scores[doc_id] = 0.0
            rrf_scores[doc_id] += 1.0 / (k + rank)

    # Sort by fused score descending
    fused = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return fused[:top_n]

Now let's wire it all together in the MemoryStore:

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

class MemoryStore:
    # ... (previous methods remain)

    def search_vector(self, query_embedding: list[float], top_n: int = 20) -> list[tuple[str, float]]:
        """Brute-force cosine similarity search. Replace with ANN in production."""
        scores = []
        for doc_id, emb in self.embeddings.items():
            sim = cosine_similarity(query_embedding, emb)
            scores.append((doc_id, sim))
        return sorted(scores, key=lambda x: x[1], reverse=True)[:top_n]

    def search_hybrid(self, query: str, top_n: int = 10) -> list[tuple[str, float]]:
        """Run both BM25 and vector search, fuse with RRF."""
        query_embedding = generate_embedding(query)

        bm25_results = self.bm25_index.score(query)
        vector_results = self.search_vector(query_embedding, top_n=20)

        fused = reciprocal_rank_fusion(
            [bm25_results, vector_results],
            k=60,
            top_n=top_n
        )
        return fused

Feature	BM25 (Keyword)	Vector Search (Semantic)	Hybrid (RRF)
Exact term matching	Excellent	Poor	Excellent
Semantic similarity	None	Excellent	Excellent
Handles typos	Poor	Moderate	Moderate
Alphanumeric IDs	Excellent	Very poor	Excellent
Latency (10K docs)	~1ms	~5ms (brute force) / ~1ms (ANN)	~6ms combined
Index storage	Inverted index (~small)	Vectors (~6KB per doc at 1536d)	Both

Multi-Agent State Management

When multiple agents share a memory layer, two new problems emerge: state consistency and trust.

State Consistency

If Agent A writes a memory and Agent B reads it simultaneously, you need to decide on a consistency model. For most agent workloads, eventual consistency is fine — agents are not database transactions. But you need a clear ownership model.

@dataclass
class AgentMemoryNamespace:
    """Each agent gets its own namespace. Shared memories are explicitly published."""
    agent_id: str
    private_store: MemoryStore
    shared_store: MemoryStore  # reference to the shared memory pool

    def remember(self, content: str, memory_type: str, share: bool = False):
        record = MemoryRecord(
            content=content,
            memory_type=memory_type,
            agent_id=self.agent_id,
            session_id="current",
            source_agent=self.agent_id,
        )
        # Always write to private store
        self.private_store.write(record)

        # Optionally publish to shared store
        if share:
            self.shared_store.write(record)

    def recall(self, query: str, include_shared: bool = True, top_n: int = 5):
        """Search private memories, optionally include shared pool."""
        private_results = self.private_store.search_hybrid(query, top_n=top_n)

        if not include_shared:
            return private_results

        shared_results = self.shared_store.search_hybrid(query, top_n=top_n)

        # Fuse private and shared results, giving private a slight boost
        # by prepending them in the ranked list
        return reciprocal_rank_fusion(
            [private_results, shared_results],
            k=60,
            top_n=top_n
        )

Trust Scoring

Not all memories deserve equal weight. A memory written by a well-tested agent with human-confirmed feedback is more trustworthy than one written by an experimental agent's first run. Trust scoring assigns a confidence weight to each memory.

def compute_trust_score(record: MemoryRecord, agent_registry: dict) -> float:
    """
    Compute a trust score for a memory record based on:
    - Source agent's historical accuracy
    - Recency of the memory
    - Whether a human confirmed it
    - Number of times other agents corroborated it
    """
    base_score = agent_registry.get(record.agent_id, {}).get("accuracy", 0.5)

    # Recency decay: memories older than 30 days lose trust
    age_days = (datetime.utcnow() - record.timestamp).days
    recency_factor = max(0.5, 1.0 - (age_days / 365))  # floor at 0.5

    # Human confirmation boost
    human_confirmed = record.metadata.get("human_confirmed", False)
    confirmation_boost = 1.3 if human_confirmed else 1.0

    # Corroboration: how many other agents wrote similar memories
    corroboration_count = record.metadata.get("corroboration_count", 0)
    corroboration_factor = min(1.5, 1.0 + corroboration_count * 0.1)

    score = base_score * recency_factor * confirmation_boost * corroboration_factor
    return min(1.0, score)  # Cap at 1.0

Why Trust Scoring Matters for Multi-Agent Systems

In a system with 10+ agents, one misconfigured agent can pollute the shared memory pool with incorrect facts. Trust scoring acts as an immune system — it lets the memory layer deprioritize memories from unreliable sources without deleting them outright. This is especially important in domains like finance or healthcare where incorrect context leads to real harm.

Seeing This in Practice

The patterns described above — hybrid search combining vector and BM25 retrieval, Reciprocal Rank Fusion for result merging, and trust-scored multi-agent memory — are the building blocks of production agent memory systems. Memonto's agent memory system implements these patterns as a working reference. Their architecture exposes memory namespaces per agent, provides hybrid retrieval out of the box, and includes trust scoring for cross-agent memory sharing.

If you want to see how the ingestion-to-retrieval pipeline looks in a deployed system rather than in isolated code snippets, the Memonto GitHub repository is worth studying as a reference implementation of these concepts.

# Conceptual usage pattern (based on the architecture described above)
# This illustrates how a production memory layer surfaces relevant context

agent_memory = AgentMemoryNamespace(
    agent_id="research-agent",
    private_store=MemoryStore(),
    shared_store=shared_memory_pool  # Shared across all agents
)

# Agent stores a learned fact
agent_memory.remember(
    content="The client's fiscal year ends in March, not December.",
    memory_type="semantic",
    share=True  # Make available to other agents
)

# Later, any agent can retrieve it
results = agent_memory.recall(
    query="When does the client's fiscal year end?",
    include_shared=True
)
# Returns the stored fact ranked by hybrid retrieval score

Real-World Considerations

When NOT to Build a Memory Layer

Not every agent system needs persistent memory. If your agent handles fully self-contained tasks (e.g., "convert this CSV to JSON"), adding a memory layer adds complexity without benefit. Memory layers pay off when:

Agents interact with the same users or datasets repeatedly
Multiple agents need to coordinate and share context
The cost of re-deriving information is high (expensive API calls, slow computations)

Failure Modes to Watch For

Memory bloat. Without a pruning strategy, your memory store grows indefinitely. Implement TTLs (time-to-live) for episodic memories and periodic deduplication for semantic memories.

Stale memories. Facts change. The client's preferred currency might switch from EUR to GBP. Your memory layer needs an update-or-supersede mechanism, not just append.

Context window overflow. Retrieving 50 relevant memories and stuffing them all into the agent's prompt defeats the purpose. Rank aggressively and inject only the top 3-5 most relevant memories.

Security: Memory Injection Attacks

If an agent's memory can be written to by external input (e.g., user messages get stored as semantic memories), an adversary can inject false facts: "The admin password is 'letmein'." Always sanitize memories before storage, and never store user input as trusted procedural memory without explicit human review.

Choosing Your Storage Backend

Backend	Vector Search	BM25/Keyword	Multi-tenancy	Operational Complexity
PostgreSQL + pgvector	Good (IVFFlat, HNSW)	via `tsvector`	Native (schemas/RLS)	Low (if you already run Postgres)
Qdrant	Excellent (HNSW)	Built-in sparse vectors	Collection-level	Medium (separate service)
Weaviate	Excellent	Built-in BM25	Native	Medium
ChromaDB	Good (HNSW)	Limited	Limited	Low (embedded mode)
SQLite + sqlite-vec	Adequate for <100K docs	via FTS5	Manual	Very low

For prototyping, start with ChromaDB or SQLite. For production multi-agent systems, PostgreSQL with pgvector gives you the best balance of features and operational simplicity — you get vector search, full-text search via tsvector, row-level security for multi-agent namespaces, and ACID transactions, all in a database you probably already run.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community