varun pratap Bhardwaj

Posted on Mar 5 • Originally published at superlocalmemory.com

Universal Memory Layer Architecture for AI Agents

#aiagents #memoryarchitecture #vectorsearch #agentstatemanagement

Most AI agents today are stateless. They receive a prompt, generate a response, and forget everything. If you have built anything with an LLM, you have felt this limitation firsthand: your agent cannot recall what happened two conversations ago, cannot learn from its mistakes, and cannot share context with other agents in your system. The context window is not memory. It is a scratchpad that gets wiped clean. To build agents that genuinely improve over time, you need a dedicated memory layer — one that persists knowledge, organizes it for fast retrieval, and works across multiple agents without coupling them together.

This post walks you through the architecture of such a memory layer from first principles. No handwaving, no black boxes. By the end, you will understand the design decisions behind memory persistence for AI agents and have runnable code you can adapt for your own systems.

What You Will Learn

The difference between episodic and semantic memory in agent systems, and when to use each.

How to design an interoperable memory schema that works across multi-agent architectures.

How hybrid search (vector + keyword + graph) enables efficient memory recall at scale.

Concrete implementation patterns in Python with working code examples.

Trade-offs and failure modes you will encounter in production.

Why Agents Need a Memory Layer

An LLM's context window is finite. GPT-4 Turbo offers 128K tokens; Claude gives you 200K. That sounds like a lot until your agent has been running for days, processing hundreds of tasks, accumulating observations, and collaborating with other agents. You cannot stuff all of that into a single prompt.

More importantly, a context window is ephemeral. When the session ends, the context is gone. Your agent starts from zero next time. This is the equivalent of a developer who loses all their notes every time they close their laptop.

A memory layer solves this by acting as persistent, queryable storage that sits outside the LLM. The agent writes to it during execution and reads from it when it needs context. This is not a new idea — cognitive architectures like SOAR and ACT-R modeled human memory as distinct subsystems decades ago. What is new is applying these patterns to LLM-based agents at scale.

As Krishnan (2025) notes in AI Agents: Evolution, Architecture, and Real-World Applications, modern agent architectures integrate large language models with dedicated modules for perception, planning, and tool use. Memory is the connective tissue between those modules.

Conceptual Foundation: Types of Agent Memory

Human cognitive science distinguishes between several types of memory. Two are particularly useful for agent design: episodic memory and semantic memory.

Episodic memory stores specific events and experiences. For an agent, this means: "At 2:30 PM, I called the weather API and got a timeout error." Episodic memories are timestamped, ordered, and tied to a specific context.

Semantic memory stores general knowledge and facts distilled from experience. For an agent, this means: "The weather API tends to time out during peak hours; use the cached endpoint instead." Semantic memories are abstracted, deduplicated, and context-independent.

There is a third type worth mentioning: procedural memory, which stores how to do things. In agent systems, this maps to learned tool-use patterns, prompt templates that worked well, or refined chain-of-thought strategies. We will not cover procedural memory in depth here, but know that it exists in the design space.

Property	Episodic Memory	Semantic Memory
Content	Specific events, observations	Distilled facts, generalizations
Structure	Timestamped, sequential	Key-value, graph-linked
Retention	Can be pruned over time	Long-lived, updated in place
Retrieval	By time range, similarity to current context	By concept, relationship, keyword
Example	"User asked about refund policy on March 3"	"Refund policy requires order ID and proof of purchase"

A well-designed memory layer supports both types. Episodic memory gives your agent a detailed history to reason over. Semantic memory gives it compressed, reliable knowledge to act on quickly.

Architecture Overview

Here is the core architecture for a universal memory layer. Study the flow from agent action through to retrieval.

graph TD
    A["Agent Action / Observation"] --> B["Episodic Buffer"]
    B --> C["Memory Processor"]
    C --> D["Long-Term Memory Store"]
    D --> E["Vector Index"]
    D --> F["Keyword Index"]
    D --> G["Graph Index"]
    H["Agent Context Window"] -->|"retrieval query"| I["Hybrid Retrieval Engine"]
    E --> I
    F --> I
    G --> I
    I -->|"ranked memories"| H
    C -->|"semantic extraction"| J["Semantic Memory Store"]
    J --> G
    J --> E

Let us walk through each component.

1. Agent Action Produces an Observation

Every time your agent does something — calls a tool, receives user input, generates a response — it produces an observation. This is the raw material for memory. The observation includes the content itself, a timestamp, the agent's identifier, and any metadata (tool name, user ID, task ID).

2. Episodic Buffer Stages the Memory

Before writing to long-term storage, observations land in an episodic buffer. This is a short-lived queue that allows batching, deduplication, and importance scoring. Not every observation deserves to be a long-term memory. The buffer gives you a place to apply filters.

3. Memory Processor Writes to Long-Term Storage

The processor takes buffered observations, generates embeddings (for vector search), extracts entities and relationships (for graph search), and indexes keywords (for BM25/keyword search). It also runs a semantic extraction step: distilling episodic memories into semantic memories when patterns emerge.

4. Hybrid Retrieval Fetches Relevant Memories

When the agent needs context, it sends a retrieval query to the hybrid retrieval engine. This engine queries all three indexes — vector, keyword, and graph — then merges and re-ranks the results. The top-ranked memories are injected into the agent's context window as part of the system prompt or as retrieved context.

Designing an Interoperable Memory Schema

If you are building a multi-agent system, your memory schema needs to work across agents that may have different roles, different tools, and different LLM backends. This means the schema must be self-describing and loosely coupled.

Here is a practical schema in JSON that covers both episodic and semantic memories:

from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime
import uuid
import json

@dataclass
class MemoryRecord:
    """Universal memory record usable across any agent in the system."""
    content: str                          # The actual memory content
    memory_type: str                      # "episodic" or "semantic"
    agent_id: str                         # Which agent created this memory
    timestamp: str = field(
        default_factory=lambda: datetime.utcnow().isoformat()
    )
    memory_id: str = field(
        default_factory=lambda: str(uuid.uuid4())
    )
    embedding: Optional[list[float]] = None  # Vector embedding, set during processing
    entities: list[str] = field(default_factory=list)  # Extracted entities for graph index
    relationships: list[dict] = field(default_factory=list)  # Entity-to-entity links
    metadata: dict = field(default_factory=dict)  # Flexible metadata (tool, task_id, etc.)
    trust_score: float = 1.0              # How much other agents should trust this memory
    access_count: int = 0                 # How often this memory has been retrieved
    ttl: Optional[int] = None             # Time-to-live in seconds; None = permanent

    def to_dict(self) -> dict:
        return {k: v for k, v in self.__dict__.items() if v is not None}

# Example: creating an episodic memory
episode = MemoryRecord(
    content="Called /api/weather for zip 94103. Received 504 Gateway Timeout after 30s.",
    memory_type="episodic",
    agent_id="weather-agent-01",
    entities=["weather_api", "zip_94103"],
    relationships=[{"from": "weather_api", "relation": "timed_out_for", "to": "zip_94103"}],
    metadata={"tool": "http_get", "status_code": 504, "task_id": "task-abc-123"}
)

print(json.dumps(episode.to_dict(), indent=2))

Expected output:

{
  "content": "Called /api/weather for zip 94103. Received 504 Gateway Timeout after 30s.",
  "memory_type": "episodic",
  "agent_id": "weather-agent-01",
  "timestamp": "2026-03-05T14:22:01.123456",
  "memory_id": "a1b2c3d4-...",
  "entities": ["weather_api", "zip_94103"],
  "relationships": [{"from": "weather_api", "relation": "timed_out_for", "to": "zip_94103"}],
  "metadata": {"tool": "http_get", "status_code": 504, "task_id": "task-abc-123"},
  "trust_score": 1.0,
  "access_count": 0
}

The trust_score field is critical in multi-agent systems. When Agent B reads a memory written by Agent A, it needs to assess reliability. Trust scores can be updated based on whether memories led to successful outcomes. The access_count field enables least-recently-used pruning for storage management.

Schema Versioning Matters

Once multiple agents share a memory store, changing the schema becomes a coordination problem. Include a schema_version field in your metadata from day one. Migrations in multi-agent systems are significantly harder than in single-service architectures because you cannot take the memory layer offline for a migration while agents are running.

Hybrid Search: Vector + Keyword + Graph

No single search method handles all memory retrieval scenarios well. Here is why you need all three.

Vector search (using embeddings) excels at semantic similarity. "The API returned an error" will match "HTTP request failed" even though they share no keywords. But vector search struggles with precise lookups: searching for "task-abc-123" by embedding similarity is unreliable.

Keyword search (BM25 or similar) excels at exact and partial matches. Searching for a specific task ID, agent name, or error code is fast and deterministic. But keyword search misses semantic relationships entirely.

Graph search excels at relationship traversal. "What do we know about all entities connected to the weather API?" is a graph query. Neither vector nor keyword search can answer this efficiently.

Here is a practical implementation of hybrid retrieval using Python:

from dataclasses import dataclass
import numpy as np
from typing import Optional

@dataclass
class SearchResult:
    memory_id: str
    content: str
    score: float
    source: str  # "vector", "keyword", or "graph"

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr) + 1e-10))

def vector_search(
    query_embedding: list[float],
    memory_store: list[dict],
    top_k: int = 5
) -> list[SearchResult]:
    """Search memories by embedding similarity."""
    scored = []
    for mem in memory_store:
        if mem.get("embedding"):
            sim = cosine_similarity(query_embedding, mem["embedding"])
            scored.append(SearchResult(
                memory_id=mem["memory_id"],
                content=mem["content"],
                score=sim,
                source="vector"
            ))
    scored.sort(key=lambda r: r.score, reverse=True)
    return scored[:top_k]

def keyword_search(
    query_terms: list[str],
    memory_store: list[dict],
    top_k: int = 5
) -> list[SearchResult]:
    """Simple BM25-style keyword matching."""
    scored = []
    for mem in memory_store:
        content_lower = mem["content"].lower()
        # Count how many query terms appear in the content
        hits = sum(1 for term in query_terms if term.lower() in content_lower)
        if hits > 0:
            # Normalize by number of query terms
            score = hits / len(query_terms)
            scored.append(SearchResult(
                memory_id=mem["memory_id"],
                content=mem["content"],
                score=score,
                source="keyword"
            ))
    scored.sort(key=lambda r: r.score, reverse=True)
    return scored[:top_k]

def graph_search(
    entity: str,
    memory_store: list[dict],
    max_hops: int = 2
) -> list[SearchResult]:
    """Find memories connected to a given entity through relationships."""
    results = []
    visited_entities = {entity}
    frontier = [entity]

    for hop in range(max_hops):
        next_frontier = []
        for mem in memory_store:
            for rel in mem.get("relationships", []):
                if rel["from"] in frontier or rel["to"] in frontier:
                    results.append(SearchResult(
                        memory_id=mem["memory_id"],
                        content=mem["content"],
                        score=1.0 / (hop + 1),  # Closer hops score higher
                        source="graph"
                    ))
                    # Add connected entities to next hop
                    for key in ["from", "to"]:
                        if rel[key] not in visited_entities:
                            next_frontier.append(rel[key])
                            visited_entities.add(rel[key])
        frontier = next_frontier
    return results

def hybrid_search(
    query_embedding: list[float],
    query_terms: list[str],
    entity: Optional[str],
    memory_store: list[dict],
    weights: dict = None,
    top_k: int = 5
) -> list[SearchResult]:
    """
    Merge results from all three search methods using weighted reciprocal rank fusion.
    """
    if weights is None:
        weights = {"vector": 0.5, "keyword": 0.3, "graph": 0.2}

    all_results: dict[str, float] = {}  # memory_id -> fused score
    result_content: dict[str, str] = {}

    for search_fn, args, source_key in [
        (vector_search, (query_embedding, memory_store), "vector"),
        (keyword_search, (query_terms, memory_store), "keyword"),
    ]:
        results = search_fn(*args)
        for rank, result in enumerate(results):
            # Reciprocal rank fusion: 1/(rank+1) * weight
            rrf_score = (1.0 / (rank + 1)) * weights[source_key]
            all_results[result.memory_id] = all_results.get(result.memory_id, 0) + rrf_score
            result_content[result.memory_id] = result.content

    if entity:
        graph_results = graph_search(entity, memory_store)
        for rank, result in enumerate(graph_results):
            rrf_score = (1.0 / (rank + 1)) * weights["graph"]
            all_results[result.memory_id] = all_results.get(result.memory_id, 0) + rrf_score
            result_content[result.memory_id] = result.content

    # Sort by fused score
    sorted_ids = sorted(all_results.keys(), key=lambda mid: all_results[mid], reverse=True)
    return [
        SearchResult(
            memory_id=mid,
            content=result_content[mid],
            score=all_results[mid],
            source="hybrid"
        )
        for mid in sorted_ids[:top_k]
    ]

The hybrid_search function uses reciprocal rank fusion (RRF) to combine results. RRF is simple and effective: it assigns each result a score of 1/(rank+1), weighted by the search method's importance. Memories that appear in multiple search results get boosted naturally. This approach outperforms naive score averaging because scores from different search methods are not on the same scale.

Tuning Hybrid Search Weights

The default weights (0.5 vector, 0.3 keyword, 0.2 graph) are a reasonable starting point. In practice, tune these based on your retrieval evaluation set. If your agent frequently needs exact ID lookups, increase the keyword weight. If your domain is heavily relational (e.g., knowledge graphs, org charts), increase the graph weight.

Semantic Extraction: Turning Episodes into Knowledge

Episodic memories accumulate fast. If your agent runs 1,000 tasks per day, you will have 1,000+ episodic records within hours. Retrieval degrades as the store grows. Semantic extraction addresses this by periodically distilling episodic memories into compact semantic memories.

Here is a simplified extraction pipeline:

def extract_semantic_memories(
    episodic_memories: list[dict],
    similarity_threshold: float = 0.85
) -> list[dict]:
    """
    Group similar episodic memories and distill them into semantic memories.
    In production, you would use an LLM to generate the summary.
    """
    clusters: list[list[dict]] = []

    for mem in episodic_memories:
        placed = False
        for cluster in clusters:
            # Compare against the first memory in each cluster (simplified)
            sim = cosine_similarity(mem["embedding"], cluster[0]["embedding"])
            if sim >= similarity_threshold:
                cluster.append(mem)
                placed = True
                break
        if not placed:
            clusters.append([mem])

    semantic_memories = []
    for cluster in clusters:
        if len(cluster) >= 3:  # Only distill if we have enough evidence
            # In production, pass cluster contents to an LLM for summarization
            combined_content = " | ".join(m["content"] for m in cluster)
            semantic_memories.append({
                "content": f"[Distilled from {len(cluster)} episodes] {combined_content[:200]}...",
                "memory_type": "semantic",
                "agent_id": "memory-processor",
                "source_count": len(cluster),
                "source_ids": [m["memory_id"] for m in cluster],
                "trust_score": min(m.get("trust_score", 1.0) for m in cluster),
            })

    return semantic_memories

In a production system, you would replace the simple concatenation with an LLM call that generates a proper summary. The key design choice is the similarity_threshold: too low and you merge unrelated memories; too high and nothing gets distilled.

Do Not Delete Episodic Memories After Extraction

It is tempting to delete episodic memories once they have been distilled into semantic memories. Do not do this in your first iteration. Semantic extraction is lossy — the summary may miss critical details. Keep episodic memories with a TTL (e.g., 30 days) so you can fall back to them during retrieval. Archive rather than delete.

Real-World Considerations

Storage Costs and Pruning

Embeddings are not small. A 1536-dimensional float32 embedding consumes about 6 KB. At 10,000 memories, that is 60 MB of embeddings alone. At 10 million, it is 60 GB. Plan your pruning strategy early: TTL-based expiry for episodic memories, access-count-based eviction for infrequently-used semantic memories, and compression for archived records.

Latency Budgets

Your agent is waiting for memories before it can generate a response. If hybrid retrieval takes 500ms, that 500ms is added to every agent turn. For real-time applications, consider a two-tier cache: a fast in-memory cache of recently accessed memories (hits in under 5ms) backed by the full indexed store (hits in 50-200ms).

Multi-Agent Trust and Conflict Resolution

When Agent A writes "the refund policy requires a receipt" and Agent B writes "the refund policy does not require a receipt," your memory layer has a conflict. Trust scores help here — you can weight memories by the trust score of the authoring agent — but they do not eliminate the problem. Consider adding a conflict detection step during retrieval that flags contradictory memories for the consuming agent to resolve.

When Not to Use a Memory Layer

Not every agent needs persistent memory. If your agent handles isolated, stateless tasks (e.g., a code formatter, a one-shot classifier), the overhead of a memory layer adds complexity without benefit. The memory layer pays off when agents run over extended periods, collaborate with other agents, or need to improve based on past experience.

Seeing This in Practice

The patterns described above — interoperable memory schemas, trust scoring across agents, and hybrid retrieval from a shared memory store — are implemented in SuperLocalMemory, an open-source memory layer that runs entirely on your local machine with no cloud dependency.

Its multi-agent shared memory architecture assigns trust scores to memories based on the authoring agent's track record and uses a schema compatible with 16+ tools including Claude and Cursor. You can inspect how the memory write/read flow works by cloning the repository:

git clone https://github.com/varun369/SuperLocalMemoryV2.git
cd SuperLocalMemoryV2
# Examine the memory schema and retrieval logic
cat src/memory/schema.py
cat src/memory/retrieval.py

This gives you a concrete reference implementation to compare against the architectural patterns discussed here. Reviewing working code is often more instructive than reading about patterns in the abstract.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community