varun pratap Bhardwaj

Posted on Mar 3 • Edited on Apr 21 • Originally published at superlocalmemory.com

Building a Universal Memory Layer for AI Agents: Architecture Patterns and Implementation

#aiagents #agentmemory #vectorsearch #hybridsearch

AI agents are stateless by default. Every time you invoke an LLM, it has no recollection of what it did five minutes ago unless you explicitly provide that context. This is fine for single-turn interactions, but it falls apart the moment you need agents that learn from past tasks, coordinate with other agents, or accumulate knowledge over time. The problem compounds in multi-agent systems where Agent A's discoveries need to be accessible to Agent B without piping everything through a shared prompt.

A universal memory layer — a core component of AI Reliability Engineering — solves this by abstracting persistent storage, retrieval, and state management behind a single interface that any agent — regardless of the underlying LLM provider — can read from and write to. This post teaches you how to build one.

What You Will Learn

Why AI agents need a dedicated memory layer separate from the LLM context window

How to design a memory storage schema that supports episodic, semantic, and procedural memory types

Three retrieval strategies (semantic search, keyword search, hybrid ranking) and when to use each

How to manage shared state across multi-agent systems with trust scoring and conflict resolution

Runnable Python code for a minimal but functional universal memory layer

Why Agents Need a Memory Layer

An LLM's context window is not memory. It is a scratchpad. When a context window fills up, older content gets truncated or summarized, and information is permanently lost. Even with 128k or 200k token windows, you cannot fit weeks of accumulated agent interactions into a single prompt.

Consider an agent that triages customer support tickets. On day one, it learns that "Project Atlas" is an internal code name for a database migration. On day thirty, a new ticket references "Atlas" without explanation. Without persistent memory, the agent has to re-learn this every session or rely on an engineer to hardcode it into the system prompt.

A memory layer provides three capabilities that context windows cannot:

Persistence — information survives beyond a single session or context window.
Selective retrieval — the agent fetches only the memories relevant to the current task instead of stuffing everything into the prompt.
Shared access — multiple agents can read and write to the same memory store, enabling coordination without direct message passing.

As Nowaczyk argues in Architectures for Building Agentic AI, reliability in agentic systems is "chiefly an architectural property" that emerges from principled componentization. Memory is one of those core components — alongside planners, tool routers, and safety monitors — that must be explicitly designed rather than bolted on.

Conceptual Foundation: Types of Agent Memory

Before writing any code, you need a mental model for what "memory" means in the context of AI agents. Cognitive science gives us a useful taxonomy that maps surprisingly well to software architecture.

Memory Type	Cognitive Analogy	Agent Example	Storage Pattern
Episodic	Personal experiences	"Last time I called the weather API, it returned a 429 error"	Timestamped event log
Semantic	General knowledge	"The capital of France is Paris"	Key-value or document store
Procedural	Skills and how-to	"To deploy to staging, run `make deploy-staging`"	Structured instructions
Working	Short-term scratch	Current task context and intermediate results	In-memory / context window

A universal memory layer must handle the first three types. Working memory is typically managed within the agent's execution loop and the LLM context window itself — it does not need to persist.

The critical insight is that these memory types have different access patterns. Episodic memory is usually queried by time range or similarity to a current situation. Semantic memory is queried by topic or concept. Procedural memory is queried by task type. Your storage schema and retrieval strategy must account for these differences.

Architecture: How the Memory Layer Works

Here is the high-level architecture. An agent sends a query to the memory layer, which fans out across multiple storage backends, merges the results using hybrid ranking, and returns a unified set of relevant memories.

graph TD
    A["Agent Request<br/>(query + metadata)"] --> B["Memory Layer API"]
    B --> C["Embedding Service"]
    B --> D["Query Parser"]
    C --> E["Vector Store<br/>(Semantic Search)"]
    D --> F["Keyword Index<br/>(BM25 / Full-Text)"]
    D --> G["Knowledge Graph<br/>(Entity Relationships)"]
    E --> H["Hybrid Ranker<br/>(RRF / Weighted Fusion)"]
    F --> H
    G --> H
    H --> I["Ranked Memory Results"]
    I --> A

    style B fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style H fill:#d4944a,stroke:#8a5f2c,color:#fff

Let us walk through each component.

1. Memory Layer API

This is the single interface every agent interacts with. It exposes two primary operations: store(memory) and retrieve(query, filters). By standardizing this interface, you decouple agents from storage implementation details. An agent using OpenAI's GPT-4 and an agent using Claude both call the same API.

The API also enforces a common memory schema — every memory object carries metadata like agent_id, timestamp, memory_type, and trust_score alongside the content itself.

2. Parallel Retrieval Backends

When a retrieval request comes in, the memory layer queries multiple backends in parallel:

Vector Store — converts the query into an embedding and performs approximate nearest neighbor (ANN) search. This excels at finding semantically similar memories even when the exact words differ.
Keyword Index — performs BM25 or full-text search. This catches exact matches that semantic search might rank lower, such as specific error codes, proper nouns, or configuration values.
Knowledge Graph (optional) — traverses entity relationships. If the query mentions "Project Atlas," the graph can surface all memories connected to Atlas entities — team members, related services, deployment dates.

3. Hybrid Ranking and Fusion

Each backend returns a ranked list of candidates. The hybrid ranker merges these lists into a single ordering. The most common approach is Reciprocal Rank Fusion (RRF), which we will implement below. RRF does not require score normalization across backends — it only needs the rank positions, making it robust when combining search systems that produce incomparable score scales.

Practical Implementation

Let us build a minimal but functional memory layer in Python. We will use SQLite for metadata storage, numpy for vector operations, and a simple BM25 implementation for keyword search.

Memory Schema

First, define the data model:

import uuid
import time
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Memory:
    content: str
    memory_type: str  # "episodic", "semantic", "procedural"
    agent_id: str
    metadata: dict = field(default_factory=dict)
    memory_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    timestamp: float = field(default_factory=time.time)
    trust_score: float = 1.0  # 0.0 to 1.0, explained below
    embedding: Optional[list] = None

The trust_score field deserves explanation. In multi-agent systems, not all memories are equally reliable. An agent that hallucinates or operates on stale data should have its memories weighted lower during retrieval. Trust scores provide a mechanism for this — they can be updated based on verification outcomes, agent reputation, or human feedback.

Storage Layer

import sqlite3
import json
import numpy as np

class MemoryStore:
    def __init__(self, db_path: str = ":memory:"):
        self.conn = sqlite3.connect(db_path)
        self._create_tables()
        self._memories_vectors = {}  # memory_id -> np.array

    def _create_tables(self):
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS memories (
                memory_id TEXT PRIMARY KEY,
                content TEXT NOT NULL,
                memory_type TEXT NOT NULL,
                agent_id TEXT NOT NULL,
                timestamp REAL NOT NULL,
                trust_score REAL DEFAULT 1.0,
                metadata TEXT DEFAULT '{}'
            )
        """)
        # Full-text search index for keyword retrieval
        self.conn.execute("""
            CREATE VIRTUAL TABLE IF NOT EXISTS memories_fts
            USING fts5(memory_id, content)
        """)
        self.conn.commit()

    def store(self, memory: Memory):
        self.conn.execute(
            "INSERT INTO memories VALUES (?, ?, ?, ?, ?, ?, ?)",
            (memory.memory_id, memory.content, memory.memory_type,
             memory.agent_id, memory.timestamp, memory.trust_score,
             json.dumps(memory.metadata))
        )
        self.conn.execute(
            "INSERT INTO memories_fts VALUES (?, ?)",
            (memory.memory_id, memory.content)
        )
        self.conn.commit()

        if memory.embedding is not None:
            self._memories_vectors[memory.memory_id] = np.array(
                memory.embedding, dtype=np.float32
            )

SQLite FTS5 Is Not Production-Grade Vector Search

This implementation uses SQLite's FTS5 for keyword search and an in-memory dictionary for vectors. This works for learning and prototyping. For production systems, replace the vector store with a dedicated ANN index like FAISS, pgvector, Qdrant, or Weaviate. The keyword index can stay as FTS5 for moderate-scale workloads or move to Elasticsearch/Tantivy for larger datasets.

Retrieval: Semantic Search

Semantic search computes cosine similarity between the query embedding and all stored memory embeddings. For small datasets (under 100k memories), brute-force search is fast enough. Beyond that, use an ANN index.

def retrieve_semantic(
    self, query_embedding: list, top_k: int = 10
) -> list[tuple[str, float]]:
    """Returns list of (memory_id, similarity_score) tuples."""
    query_vec = np.array(query_embedding, dtype=np.float32)
    # Normalize for cosine similarity
    query_norm = query_vec / (np.linalg.norm(query_vec) + 1e-10)

    scores = []
    for mid, vec in self._memories_vectors.items():
        vec_norm = vec / (np.linalg.norm(vec) + 1e-10)
        similarity = float(np.dot(query_norm, vec_norm))
        scores.append((mid, similarity))

    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

Retrieval: Keyword Search (BM25 via FTS5)

SQLite's FTS5 provides a bm25() ranking function out of the box:

def retrieve_keyword(
    self, query: str, top_k: int = 10
) -> list[tuple[str, float]]:
    """Returns list of (memory_id, bm25_score) tuples."""
    cursor = self.conn.execute(
        """
        SELECT memory_id, bm25(memories_fts) as score
        FROM memories_fts
        WHERE memories_fts MATCH ?
        ORDER BY score
        LIMIT ?
        """,
        (query, top_k)
    )
    # FTS5 bm25() returns negative scores; lower = better match
    results = [(row[0], -row[1]) for row in cursor.fetchall()]
    return results

Hybrid Ranking with Reciprocal Rank Fusion

RRF merges ranked lists without needing to normalize scores across different retrieval systems. The formula for each document's fused score is:

RRF(d) = sum over all lists of 1 / (k + rank(d))

where k is a constant (typically 60) that dampens the impact of high-ranking positions.

def reciprocal_rank_fusion(
    *ranked_lists: list[tuple[str, float]],
    k: int = 60
) -> list[tuple[str, float]]:
    """
    Merges multiple ranked lists using RRF.

    Each ranked_list is [(memory_id, score), ...] sorted by relevance.
    Returns a unified ranked list of (memory_id, rrf_score).
    """
    rrf_scores = {}

    for ranked_list in ranked_lists:
        for rank, (memory_id, _score) in enumerate(ranked_list):
            if memory_id not in rrf_scores:
                rrf_scores[memory_id] = 0.0
            # rank is 0-indexed, so rank + 1 gives 1-indexed position
            rrf_scores[memory_id] += 1.0 / (k + rank + 1)

    fused = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return fused

Let us see the full retrieval pipeline:

def retrieve(
    self, query: str, query_embedding: list, top_k: int = 5
) -> list[Memory]:
    """
    Hybrid retrieval: combines semantic and keyword search via RRF.
    """
    semantic_results = self.retrieve_semantic(query_embedding, top_k=20)
    keyword_results = self.retrieve_keyword(query, top_k=20)

    fused = reciprocal_rank_fusion(semantic_results, keyword_results)

    # Fetch full memory objects for top-k results
    memories = []
    for memory_id, rrf_score in fused[:top_k]:
        cursor = self.conn.execute(
            "SELECT * FROM memories WHERE memory_id = ?", (memory_id,)
        )
        row = cursor.fetchone()
        if row:
            mem = Memory(
                memory_id=row[0], content=row[1], memory_type=row[2],
                agent_id=row[3], timestamp=row[4], trust_score=row[5],
                metadata=json.loads(row[6])
            )
            memories.append(mem)

    return memories

Putting It Together

# Example usage
store = MemoryStore(db_path="agent_memory.db")

# Simulate storing a memory with a pre-computed embedding
# In practice, you'd call an embedding API (OpenAI, Cohere, etc.)
fake_embedding = np.random.randn(384).tolist()  # 384-dim for all-MiniLM-L6-v2

mem = Memory(
    content="Project Atlas refers to the Q3 database migration from PostgreSQL to CockroachDB.",
    memory_type="semantic",
    agent_id="support-triage-agent",
    embedding=fake_embedding,
    metadata={"source": "engineering-slack", "verified": True}
)
store.store(mem)

# Retrieve later
query = "What is Project Atlas?"
query_emb = np.random.randn(384).tolist()  # Would be real embedding in practice
results = store.retrieve(query, query_emb, top_k=3)

for r in results:
    print(f"[{r.memory_type}] {r.content[:80]}... (trust: {r.trust_score})")

Expected output (content will match since we only stored one memory):

[semantic] Project Atlas refers to the Q3 database migration from PostgreSQL to C... (trust: 1.0)

Multi-Agent State Management

When multiple agents share a memory layer, you face two challenges: conflicts and trust.

Conflict Resolution

Two agents might store contradictory memories. Agent A writes "deployment window is 2-4 AM UTC" and Agent B writes "deployment window is 3-5 AM UTC." A naive last-write-wins policy is dangerous.

Instead, use a versioned approach:

@dataclass
class VersionedMemory(Memory):
    version: int = 1
    parent_id: Optional[str] = None  # Points to the memory this updates
    superseded: bool = False  # True if a newer version exists

When an agent updates existing knowledge, it creates a new VersionedMemory with parent_id pointing to the original. The retrieval layer can then present the latest version while maintaining a full audit trail.

Trust Scoring

Trust scores allow the system to weight memories by reliability. Here is a straightforward scoring model:

def compute_trust_score(memory: Memory, store: MemoryStore) -> float:
    score = 0.5  # Base score

    # Verified by a human or authoritative source
    if memory.metadata.get("verified"):
        score += 0.3

    # Recency boost: memories less than 7 days old get a bump
    age_days = (time.time() - memory.timestamp) / 86400
    if age_days < 7:
        score += 0.1

    # Corroboration: other agents stored similar information
    # (simplified — in practice, check semantic similarity)
    cursor = store.conn.execute(
        "SELECT COUNT(*) FROM memories WHERE agent_id != ? AND content LIKE ?",
        (memory.agent_id, f"%{memory.content[:50]}%")
    )
    corroboration_count = cursor.fetchone()[0]
    if corroboration_count > 0:
        score += min(0.1 * corroboration_count, 0.2)  # Cap at 0.2

    return min(score, 1.0)

Trust Is Not Binary

A memory with a trust score of 0.3 is not "untrustworthy" — it is unverified. The retrieval layer should still return low-trust memories but can flag them for the consuming agent. This mirrors how humans treat information: you might act on a rumor while still seeking confirmation.

Seeing This in Practice

The patterns described above — hybrid retrieval, trust scoring, and shared memory across different AI providers — are implemented in SuperLocalMemory, an open-source project that runs these operations entirely on your local machine with zero cloud dependency.

The architecture maps directly to what we covered. Memories are stored with per-agent trust scores. Retrieval uses a hybrid approach combining semantic embeddings with keyword matching. Multiple agents — whether backed by OpenAI, Claude, or Gemini — read from and write to the same local memory store through a unified API.

You can inspect the implementation to see how these patterns translate to a working system. The repository is a useful reference if you are building your own memory layer and want to see how trust propagation, memory versioning, and multi-provider support interact in practice.

Real-World Considerations

When Not to Use a Universal Memory Layer

Not every agent system needs one. If your agent is stateless by design (a single-turn classifier, a one-shot code generator), adding persistent memory introduces complexity without benefit. The maintenance cost of a memory layer — schema migrations, index tuning, garbage collection of stale memories — is only justified when agents genuinely need to learn and coordinate over time.

Storage Scaling

The in-memory vector approach shown above works for up to roughly 100k memories with 384-dimensional embeddings (about 150MB of RAM). Beyond that, you need a proper ANN index. FAISS with an IVF index reduces search from O(n) to approximately O(sqrt(n)) at the cost of some recall. For most agent workloads, 95% recall at 10x speed is an acceptable tradeoff.

Embedding Model Choice

Your embedding model determines the quality of semantic retrieval. Smaller models like all-MiniLM-L6-v2 (384 dimensions, 80MB) are fast and good enough for many use cases. Larger models like OpenAI's text-embedding-3-large (3072 dimensions) provide better recall but increase storage and latency proportionally.

Never Mix Embedding Models in the Same Vector Space

If you compute memory embeddings with all-MiniLM-L6-v2 and query embeddings with text-embedding-3-small, the similarity scores will be meaningless. Every vector in a given store must be produced by the same model. If you need to change models, you must re-embed all existing memories.

Memory Garbage Collection

Without cleanup, a memory store grows indefinitely. Implement time-based expiration for episodic memories (do you really need a record of every API call from six months ago?) and access-frequency-based pruning for semantic memories. A simple heuristic: if a memory has not been retrieved in 90 days and its trust score is below 0.5, archive or delete it.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community