AI agents without memory are stateless functions. They process a prompt, return a response, and forget everything. That works for one-shot tasks, but it fails the moment an agent needs to recall a previous conversation, learn from a mistake, or coordinate with another agent. The moment you need continuity — across sessions, across tools, across a team of agents — you need a memory layer.
This post teaches you how to build one. Not a toy demo, but an architecture you can reason about and extend. We will cover how memory gets stored, how it gets retrieved (and why naive vector search is not enough), and how multiple agents can share and trust each other's memories. Every pattern comes with runnable code.
What You Will Learn
- The three types of agent memory (episodic, semantic, procedural) and when to use each
- How to implement hybrid retrieval combining semantic search and BM25 keyword search
- How Reciprocal Rank Fusion merges results from different retrieval strategies
- State management patterns for multi-agent systems with shared memory
- Trust scoring: how agents decide which memories to rely on
- Practical Python code for each component
Conceptual Foundation: What Is Agent Memory?
Human memory is not a single system. You have short-term working memory (what you are thinking about right now), long-term episodic memory (what happened at lunch yesterday), and procedural memory (how to ride a bike). Agent memory works the same way — different types serve different purposes.
Three Types of Agent Memory
Episodic memory stores specific interactions. "The user asked me to summarize Q3 revenue on Tuesday. I retrieved data from the finance API and the user corrected my interpretation of 'net revenue'." This is the agent's autobiography — a timestamped log of what happened, what it did, and what feedback it received.
Semantic memory stores facts and knowledge extracted from interactions. "Net revenue for this company means revenue after returns and discounts, not after all expenses." These are distilled truths the agent can reference without replaying entire episodes.
Procedural memory stores learned behaviors and strategies. "When the user asks about revenue, always clarify whether they mean gross or net before querying." These are the agent's acquired skills.
Most memory layer implementations today conflate these three types into a single vector store. That works at small scale, but it creates retrieval noise as the memory grows. A well-designed memory layer distinguishes between them.
graph TD
A[Agent Interaction] --> B{Memory Classification}
B -->|Raw interaction log| C[Episodic Memory Store]
B -->|Extracted facts| D[Semantic Memory Store]
B -->|Learned strategies| E[Procedural Memory Store]
C --> F[Embedding + BM25 Index]
D --> F
E --> F
F --> G[Hybrid Retrieval Engine]
G --> H[Reciprocal Rank Fusion]
H --> I[Ranked Memory Results]
I --> J[Agent Context Window]
How It Works: The Memory Write/Read Pipeline
Every memory layer has two fundamental operations: writing memories and reading them back. The architecture of each determines everything about your system's quality.
Writing Memories: Ingestion Pipeline
When an agent completes an interaction, the memory layer needs to process and store that interaction. This is not as simple as appending text to a database.
1. Capture the Raw Interaction
The agent sends the full interaction context — the user's input, the agent's reasoning trace, tool calls made, the final response, and any feedback received. Store this as structured data, not a flat string.
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
@dataclass
class MemoryRecord:
content: str
memory_type: str # "episodic", "semantic", "procedural"
agent_id: str
session_id: str
timestamp: datetime = field(default_factory=datetime.utcnow)
metadata: dict = field(default_factory=dict)
trust_score: float = 1.0 # 0.0 to 1.0
source_agent: Optional[str] = None
embedding: Optional[list[float]] = None
2. Generate Embeddings
Convert the text content into a dense vector representation. This enables semantic search — finding memories that are conceptually similar even when they use different words.
import numpy as np
from openai import OpenAI
client = OpenAI()
def generate_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
"""Generate a dense vector embedding for the given text."""
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
# Example usage
record = MemoryRecord(
content="User prefers revenue figures in EUR, not USD.",
memory_type="semantic",
agent_id="finance-agent-01",
session_id="sess-abc-123",
metadata={"domain": "finance", "confidence": 0.95}
)
record.embedding = generate_embedding(record.content)
# Embedding is a 1536-dimensional float vector for text-embedding-3-small
print(f"Embedding dimensions: {len(record.embedding)}")
# Output: Embedding dimensions: 1536
3. Build the BM25 Index
Embeddings capture meaning, but they can miss exact keyword matches. BM25 (Best Matching 25) is a term-frequency ranking function that excels at finding documents containing specific terms. You need both.
import math
from collections import Counter
class BM25Index:
"""Simple BM25 index for keyword-based retrieval."""
def __init__(self, k1: float = 1.5, b: float = 0.75):
self.k1 = k1
self.b = b
self.docs: list[dict] = [] # [{id, tokens}]
self.avg_dl: float = 0.0 # average document length
self.doc_freqs: dict = {} # term -> number of docs containing term
self.n_docs: int = 0
def add_document(self, doc_id: str, text: str):
tokens = text.lower().split()
self.docs.append({"id": doc_id, "tokens": tokens})
self.n_docs += 1
# Update document frequencies
unique_terms = set(tokens)
for term in unique_terms:
self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1
# Recalculate average document length
self.avg_dl = sum(len(d["tokens"]) for d in self.docs) / self.n_docs
def score(self, query: str) -> list[tuple[str, float]]:
"""Return (doc_id, score) pairs sorted by BM25 relevance."""
query_tokens = query.lower().split()
scores = []
for doc in self.docs:
doc_score = 0.0
doc_len = len(doc["tokens"])
term_counts = Counter(doc["tokens"])
for term in query_tokens:
if term not in self.doc_freqs:
continue
df = self.doc_freqs[term]
# Inverse document frequency
idf = math.log((self.n_docs - df + 0.5) / (df + 0.5) + 1.0)
tf = term_counts.get(term, 0)
# BM25 term frequency normalization
numerator = tf * (self.k1 + 1)
denominator = tf + self.k1 * (1 - self.b + self.b * doc_len / self.avg_dl)
doc_score += idf * (numerator / denominator)
scores.append((doc["id"], doc_score))
return sorted(scores, key=lambda x: x[1], reverse=True)
4. Persist to Storage
Store the record with its embedding in a vector database and index the text in BM25. In production, you would use something like PostgreSQL with pgvector, Qdrant, Weaviate, or ChromaDB for vector storage.
import json
class MemoryStore:
"""Unified memory store with both vector and BM25 indexing."""
def __init__(self):
self.records: dict[str, MemoryRecord] = {}
self.bm25_index = BM25Index()
self.embeddings: dict[str, list[float]] = {} # doc_id -> embedding
def write(self, record: MemoryRecord) -> str:
doc_id = f"{record.agent_id}:{record.timestamp.isoformat()}"
# Store the record
self.records[doc_id] = record
# Index for BM25 keyword search
self.bm25_index.add_document(doc_id, record.content)
# Store embedding for vector search
if record.embedding is None:
record.embedding = generate_embedding(record.content)
self.embeddings[doc_id] = record.embedding
return doc_id
Reading Memories: Hybrid Retrieval
Here is where most implementations get it wrong. They use only vector search. Pure vector search retrieves memories that are semantically similar, but it can miss results that contain exact terms the query specifies. Pure BM25 finds keyword matches but misses conceptually related memories.
The solution is hybrid retrieval: run both searches, then merge the results.
Why Pure Vector Search Is Not Enough
Suppose your agent stores the memory: "Customer account ID is CX-7742-B." A later query for "CX-7742-B" will likely fail with pure semantic search because the embedding of an alphanumeric ID carries almost no semantic meaning. BM25 handles this trivially because it matches the exact token. Always combine both retrieval strategies.
Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion is a simple, effective algorithm for merging ranked lists from different retrieval methods. For each document, its RRF score is calculated as:
RRF(d) = sum over all rankers of 1 / (k + rank(d))
Where k is a constant (typically 60) that dampens the influence of high-ranking outliers.
def reciprocal_rank_fusion(
ranked_lists: list[list[tuple[str, float]]],
k: int = 60,
top_n: int = 10
) -> list[tuple[str, float]]:
"""
Merge multiple ranked result lists using Reciprocal Rank Fusion.
Args:
ranked_lists: List of [(doc_id, score)] lists, each sorted by relevance
k: Damping constant (default 60, from the original Cormack et al. paper)
top_n: Number of results to return
Returns:
Merged [(doc_id, rrf_score)] list sorted by fused score
"""
rrf_scores: dict[str, float] = {}
for ranked_list in ranked_lists:
for rank, (doc_id, _original_score) in enumerate(ranked_list, start=1):
if doc_id not in rrf_scores:
rrf_scores[doc_id] = 0.0
rrf_scores[doc_id] += 1.0 / (k + rank)
# Sort by fused score descending
fused = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return fused[:top_n]
Now let's wire it all together in the MemoryStore:
def cosine_similarity(a: list[float], b: list[float]) -> float:
a_arr, b_arr = np.array(a), np.array(b)
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
class MemoryStore:
# ... (previous methods remain)
def search_vector(self, query_embedding: list[float], top_n: int = 20) -> list[tuple[str, float]]:
"""Brute-force cosine similarity search. Replace with ANN in production."""
scores = []
for doc_id, emb in self.embeddings.items():
sim = cosine_similarity(query_embedding, emb)
scores.append((doc_id, sim))
return sorted(scores, key=lambda x: x[1], reverse=True)[:top_n]
def search_hybrid(self, query: str, top_n: int = 10) -> list[tuple[str, float]]:
"""Run both BM25 and vector search, fuse with RRF."""
query_embedding = generate_embedding(query)
bm25_results = self.bm25_index.score(query)
vector_results = self.search_vector(query_embedding, top_n=20)
fused = reciprocal_rank_fusion(
[bm25_results, vector_results],
k=60,
top_n=top_n
)
return fused
| Feature | BM25 (Keyword) | Vector Search (Semantic) | Hybrid (RRF) |
|---|---|---|---|
| Exact term matching | Excellent | Poor | Excellent |
| Semantic similarity | None | Excellent | Excellent |
| Handles typos | Poor | Moderate | Moderate |
| Alphanumeric IDs | Excellent | Very poor | Excellent |
| Latency (10K docs) | ~1ms | ~5ms (brute force) / ~1ms (ANN) | ~6ms combined |
| Index storage | Inverted index (~small) | Vectors (~6KB per doc at 1536d) | Both |
Multi-Agent State Management
When multiple agents share a memory layer, two new problems emerge: state consistency and trust.
State Consistency
If Agent A writes a memory and Agent B reads it simultaneously, you need to decide on a consistency model. For most agent workloads, eventual consistency is fine — agents are not database transactions. But you need a clear ownership model.
@dataclass
class AgentMemoryNamespace:
"""Each agent gets its own namespace. Shared memories are explicitly published."""
agent_id: str
private_store: MemoryStore
shared_store: MemoryStore # reference to the shared memory pool
def remember(self, content: str, memory_type: str, share: bool = False):
record = MemoryRecord(
content=content,
memory_type=memory_type,
agent_id=self.agent_id,
session_id="current",
source_agent=self.agent_id,
)
# Always write to private store
self.private_store.write(record)
# Optionally publish to shared store
if share:
self.shared_store.write(record)
def recall(self, query: str, include_shared: bool = True, top_n: int = 5):
"""Search private memories, optionally include shared pool."""
private_results = self.private_store.search_hybrid(query, top_n=top_n)
if not include_shared:
return private_results
shared_results = self.shared_store.search_hybrid(query, top_n=top_n)
# Fuse private and shared results, giving private a slight boost
# by prepending them in the ranked list
return reciprocal_rank_fusion(
[private_results, shared_results],
k=60,
top_n=top_n
)
Trust Scoring
Not all memories deserve equal weight. A memory written by a well-tested agent with human-confirmed feedback is more trustworthy than one written by an experimental agent's first run. Trust scoring assigns a confidence weight to each memory.
def compute_trust_score(record: MemoryRecord, agent_registry: dict) -> float:
"""
Compute a trust score for a memory record based on:
- Source agent's historical accuracy
- Recency of the memory
- Whether a human confirmed it
- Number of times other agents corroborated it
"""
base_score = agent_registry.get(record.agent_id, {}).get("accuracy", 0.5)
# Recency decay: memories older than 30 days lose trust
age_days = (datetime.utcnow() - record.timestamp).days
recency_factor = max(0.5, 1.0 - (age_days / 365)) # floor at 0.5
# Human confirmation boost
human_confirmed = record.metadata.get("human_confirmed", False)
confirmation_boost = 1.3 if human_confirmed else 1.0
# Corroboration: how many other agents wrote similar memories
corroboration_count = record.metadata.get("corroboration_count", 0)
corroboration_factor = min(1.5, 1.0 + corroboration_count * 0.1)
score = base_score * recency_factor * confirmation_boost * corroboration_factor
return min(1.0, score) # Cap at 1.0
Why Trust Scoring Matters for Multi-Agent Systems
In a system with 10+ agents, one misconfigured agent can pollute the shared memory pool with incorrect facts. Trust scoring acts as an immune system — it lets the memory layer deprioritize memories from unreliable sources without deleting them outright. This is especially important in domains like finance or healthcare where incorrect context leads to real harm.
Seeing This in Practice
The patterns described above — hybrid search combining vector and BM25 retrieval, Reciprocal Rank Fusion for result merging, and trust-scored multi-agent memory — are the building blocks of production agent memory systems. Memonto's agent memory system implements these patterns as a working reference. Their architecture exposes memory namespaces per agent, provides hybrid retrieval out of the box, and includes trust scoring for cross-agent memory sharing.
If you want to see how the ingestion-to-retrieval pipeline looks in a deployed system rather than in isolated code snippets, the Memonto GitHub repository is worth studying as a reference implementation of these concepts.
# Conceptual usage pattern (based on the architecture described above)
# This illustrates how a production memory layer surfaces relevant context
agent_memory = AgentMemoryNamespace(
agent_id="research-agent",
private_store=MemoryStore(),
shared_store=shared_memory_pool # Shared across all agents
)
# Agent stores a learned fact
agent_memory.remember(
content="The client's fiscal year ends in March, not December.",
memory_type="semantic",
share=True # Make available to other agents
)
# Later, any agent can retrieve it
results = agent_memory.recall(
query="When does the client's fiscal year end?",
include_shared=True
)
# Returns the stored fact ranked by hybrid retrieval score
Real-World Considerations
When NOT to Build a Memory Layer
Not every agent system needs persistent memory. If your agent handles fully self-contained tasks (e.g., "convert this CSV to JSON"), adding a memory layer adds complexity without benefit. Memory layers pay off when:
- Agents interact with the same users or datasets repeatedly
- Multiple agents need to coordinate and share context
- The cost of re-deriving information is high (expensive API calls, slow computations)
Failure Modes to Watch For
Memory bloat. Without a pruning strategy, your memory store grows indefinitely. Implement TTLs (time-to-live) for episodic memories and periodic deduplication for semantic memories.
Stale memories. Facts change. The client's preferred currency might switch from EUR to GBP. Your memory layer needs an update-or-supersede mechanism, not just append.
Context window overflow. Retrieving 50 relevant memories and stuffing them all into the agent's prompt defeats the purpose. Rank aggressively and inject only the top 3-5 most relevant memories.
Security: Memory Injection Attacks
If an agent's memory can be written to by external input (e.g., user messages get stored as semantic memories), an adversary can inject false facts: "The admin password is 'letmein'." Always sanitize memories before storage, and never store user input as trusted procedural memory without explicit human review.
Choosing Your Storage Backend
| Backend | Vector Search | BM25/Keyword | Multi-tenancy | Operational Complexity |
|---|---|---|---|---|
| PostgreSQL + pgvector | Good (IVFFlat, HNSW) | via tsvector
|
Native (schemas/RLS) | Low (if you already run Postgres) |
| Qdrant | Excellent (HNSW) | Built-in sparse vectors | Collection-level | Medium (separate service) |
| Weaviate | Excellent | Built-in BM25 | Native | Medium |
| ChromaDB | Good (HNSW) | Limited | Limited | Low (embedded mode) |
| SQLite + sqlite-vec | Adequate for <100K docs | via FTS5 | Manual | Very low |
For prototyping, start with ChromaDB or SQLite. For production multi-agent systems, PostgreSQL with pgvector gives you the best balance of features and operational simplicity — you get vector search, full-text search via tsvector, row-level security for multi-agent namespaces, and ACID transactions, all in a database you probably already run.
Further Reading and Sources
- Architectures for Building Agentic AI by Slawomir Nowaczyk (2025) — Argues that reliability in agentic AI is an architectural property, with memory as a core component alongside planners, verifiers, and safety monitors.
- Foundations of GenIR by Qingyao Ai, Jingtao Zhan, Yiqun Liu (2025) — Covers how generative AI models reshape information retrieval paradigms, relevant background for understanding why hybrid retrieval matters.
- Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods by Cormack, Clarke, and Butt (SIGIR 2009) — The original RRF paper. Short, readable, and the math is straightforward.
- BM25: The Next Generation of Lucene Relevance — Practical explanation of BM25 scoring with worked examples.
- pgvector documentation — If you go the PostgreSQL route, this is the starting point for vector indexing.
- Safe, Untrusted, "Proof-Carrying" AI Agents by Jacopo Tagliabue and Ciro Greco (2025) — Explores trust and governance patterns for agentic workflows, directly relevant to the trust scoring concepts discussed here.
Key Takeaways
- Agent memory is not one thing. Distinguish episodic (what happened), semantic (what is true), and procedural (what to do) memory for cleaner retrieval.
- Always use hybrid retrieval. Combine BM25 keyword search with vector semantic search. Pure vector search will miss exact matches; pure keyword search will miss conceptual connections.
- Reciprocal Rank Fusion is your friend. It is simple to implement, has no hyperparameters to tune beyond
k=60, and consistently outperforms single-method retrieval.- Multi-agent memory needs namespaces and trust. Give each agent its own private store, use a shared pool for cross-agent knowledge, and weight memories by source reliability.
- Start simple, add complexity when retrieval quality degrades. A PostgreSQL database with pgvector and
tsvectorcovers most production use cases without introducing additional infrastructure.
Top comments (0)