MongoDB Guests

Posted on May 19

Designing the Agent Memory Schema: Document Shapes for Short-Term, Episodic, and Semantic Memory in MongoDB

#mongodb #ai

This tutorial was written by Yunying Karen Zhang.

If you've been following the recent wave of writing on AI agent memory, you've probably read about the taxonomy of memory types, or how frameworks like LangGraph expose a Memory Store API. All of that is a valuable foundation. But there's a gap: none of it shows you what the documents actually look like.

The taxonomy tells you what the memory types are. The framework docs tell you how to call the API. This post fills the space in between — it's about how to design the collections when you are the one sitting in front of MongoDB Compass with a blank schema.

By the end, you'll have three concrete document shapes, the indexes that back them, and a clear picture of how they wire together. Let's build it from the ground up.

Short-Term Memory: The Conversation Window

What it stores

Short-term memory is the live message log for an active session. It answers the question: what has been said in this conversation so far? It also holds a rolling summary—a compressed version of older turns that gets substituted in when the raw message log would overflow the model's context window.

This is the most transient of the three stores. Once a session ends, this data is largely superseded by the episodic record we'll cover in Section 2. It's a natural candidate for TTL expiration.

Document shape

Below is a representative document for a live conversation session. We'll walk through each field afterward:

{
  _id: ObjectId("..."),
  thread_id: "thread_abc123",
  user_id: "user_9921",
  messages: [
    {
      role: "user",
      content: "What's the status of my refund?",
      timestamp: ISODate("2026-04-29T14:15:00Z"),
      token_count: 12
    },
    {
      role: "assistant",
      content: "Let me look that up for you.",
      timestamp: ISODate("2026-04-29T14:15:03Z"),
      token_count: 9
    }
  ],
  summary: "User initiated chat at 14:00 regarding general policy; now specifically asking for refund status on order #8842.",
  summary_updated_at: ISODate("2026-04-29T14:10:00Z"),
  created_at: ISODate("2026-04-29T14:00:00Z"),
  expires_at: ISODate("2026-04-30T14:00:00Z")
}

Field-by-field walkthrough

thread_id is the primary lookup key. Every read against this collection is "give me everything for this thread", it's always a point query on this field.
user_id is on every document in every collection in this schema. We scope all queries to a user first, always. Cross-user memory leakage is an easy mistake and a serious one, never retrieve a thread without a user_id filter unless you have an explicit reason to.
messages[] is an embedded array. Each element carries role, content, timestamp, and token_count. Tracking token count per message lets us calculate how close we are to context window limits without re-tokenizing the whole conversation on every turn.
summary and summary_updated_at work together. When the cumulative token count of messages[] exceeds a threshold — say, 80% of the model's context window, we compress the oldest turns into summary, drop those messages from the array, and update summary_updated_at. This timestamp tells us how stale the summary is: if several new messages have arrived since the last summary, we may want to regenerate before injecting it into the prompt.
expires_at pairs with a TTL index. Sessions shouldn't persist forever; 24 to 48 hours is a reasonable default for most applications.

Embed vs. reference

We embed messages inside the document rather than storing them in a separate collection because the read pattern always gives the whole thread. There is no use case for fetching a single message in isolation. Embedding keeps the read to a single document fetch and lets us atomically append a new message with a simple $push. The one trade-off is document size, but the summary-and-truncate mechanism keeps this bounded in practice.

Index

db.short_term_memory.createIndex({ thread_id: 1 }, { unique: true })
db.short_term_memory.createIndex({ expires_at: 1 }, { expireAfterSeconds: 0 })

That's all we need here. No vector index — we are never searching this collection by meaning. We always know the thread_id before we read. For more on createIndex options, see the MongoDB documentation.

Episodic Memory: What the Agent Actually Did

What it stores

Episodic memory is a record of a completed agent run. Not what was said — that's the chat log. What was done: which tools were called, what their inputs and outputs were, and what the overall outcome was.

Think of it as the agent's after-action report. When a future session asks "Did I already try to book that flight?" or "What happened the last time this user asked about a refund?", episodic memory is where we look.

Document shape

{
  _id: ObjectId("..."),
  episode_id: "ep_20260429_xk92",
  thread_id: "thread_abc123",
  user_id: "user_9921",
  summary: "Looked up refund status for order #8842. Confirmed refund is processing, ETA 3-5 business days.",
  tool_calls: [
    {
      tool: "lookup_order",
      inputs: { order_id: "8842" },
      output: { status: "refund_processing", eta_days: 5 },
      success: true,
      called_at: ISODate("2026-04-29T14:01:05Z")
    }
  ],
  outcome: "succeeded",
  tags: ["refund", "order-lookup"],
  started_at: ISODate("2026-04-29T14:01:00Z"),
  ended_at: ISODate("2026-04-29T14:01:08Z")
}

Field-by-field walkthrough

episode_id uniquely identifies this run. It becomes the source_episode_id on any semantic memories that get extracted from this episode, giving us a clear provenance trail.
thread_id + user_id links the episode back to the session it came from. thread_id lets us find all episodes that share a conversation context; user_id lets us find all episodes for a given user across all sessions.
summary is a human-readable description of what happened, written or generated at episode close. We keep it factual and outcome-oriented: what was the goal, what was done, what was the result. This is the field we search when a future agent needs to recall prior work.
tool_calls[] is the detailed trace. We store inputs and outputs here so we can diagnose failures, avoid repeating failed approaches, and give the agent evidence to reason from. For example, "last time I called lookup_order with this ID, it returned status not_found." Note that, to avoid hitting the 16MB BSON document limit, you should implement a truncation policy for large tool outputs or cap the maximum number of tool calls per episode. If your agents perform hundreds of steps with heavy payloads, consider moving these traces to a separate tool_call_logs collection referenced by episode_id. For this reference schema, we assume bounded episodes where the action history remains within the document limit.
outcome is a controlled vocabulary field: succeeded, failed, or abandoned. It's a fast filter; if a future agent is looking for precedent, it might specifically want the last successful episode of a given type.
tags[] are optional but valuable for retrieval. They let us filter episodic recall to a domain, "show me recent refund episodes for this user", without having to run a full-text search on summary.

Why this is different from the chat log

The distinction matters in practice. The chat log captures conversation, turns, tone, and clarifications. The episode captures action, what the agent committed to, and what happened. We need both, and they serve different retrieval purposes. We pull the chat log to reconstruct conversational context; we pull the episode to reconstruct operational history.

Index

db.episodic_memory.createIndex({ episode_id: 1 }, { unique: true })
db.episodic_memory.createIndex({ user_id: 1, started_at: -1 })
db.episodic_memory.createIndex({ user_id: 1, tags: 1, started_at: -1 })
db.episodic_memory.createIndex(
{ summary: "text" },
{ default_language: "english" }
)
)

The unique index on episode_id is our primary integrity guard, ensuring that provenance links from other collections remain stable. The compound index on user_id + started_at serves a double-duty: it powers the most common "most recent N episodes" queries while providing the necessary performance for background retention sweeps (e.g., deleting records older than 90 days).

Adding tags to the compound index supports filtered recall by domain, allowing the agent to retrieve history specific to a topic like "refunds" or "billing." However, since tags is an array, this becomes a Multikey Index. It is a critical MongoDB constraint that a compound index can only contain one array field; if you later extend this schema with another array (like tools_used[]), you cannot add it to this existing index without causing inserts to fail. Finally, the text index on summary handles the "fuzzy" keyword-based discovery that vector search often misses

Semantic Memory: What the Agent Knows About the User

What it stores

Semantic memory is the long-term knowledge store, persistent facts and preferences that should survive across sessions indefinitely. This is where the agent remembers that a user prefers metric units, dislikes upsells, or has a standing instruction to always confirm before booking.

Unlike short-term and episodic memory, we retrieve semantic memories by meaning, not by session or timestamp. This is the collection that genuinely needs a vector index.

Document shape

{
  _id: ObjectId("..."),
  memory_id: "mem_u9921_0047",
  user_id: "user_9921",
  type: "preference",
  content: "User prefers distances and weights in metric units.",
  embedding: [0.023, -0.117, 0.204, ...],
  source_episode_id: "ep_20260429_xk92",
  strength: 0.87,
  last_accessed_at: ISODate("2026-04-29T14:01:00Z"),
  created_at: ISODate("2026-04-01T09:15:00Z")
}

Field-by-field walkthrough

user_id is critical here. We should always filter by user_id before running vector search. MongoDB Atlas Vector Search supports pre-filters on indexed fields — we use them. Running a pure ANN search across all users and then filtering the results is both slower and a potential data isolation bug waiting to happen.
type is a controlled vocabulary: preference (the user wants something a certain way), fact (something true about the user or their context), or instruction (an explicit standing directive). This lets us selectively inject memories by type depending on what the current task needs.
content is the plain-text statement that gets injected into the prompt. We keep it short and declarative — i.e., one fact per document. Chunking multiple facts into a single document makes both retrieval and decay harder to reason about.
embedding is the vector representation of content, generated at write time and stored as a BSON array of doubles. While often prototyped as a BSON array of doubles, you should use BSON BinData (with the vector subtype) for production systems. Using BinData allows MongoDB to compress your embeddings, requiring roughly three times less disk space. More importantly, it enables Atlas Vector Search to leverage quantization (like int8 or binary), which can reduce RAM requirements by up to 24x—a critical optimization for 2026-scale agentic memory stores where high-dimensional vector storage costs can otherwise become prohibitive. On retrieval, we embed the current query and run a nearest-neighbor search filtered by user_id.
strength is a float between 0 and 1 that decays over time and resets toward 1 on access. We recommend an exponential decay with a half-life (e.g., 30 days) combined with a multiplicative reset on access. For example, every time a memory is retrieved, you might close 30% of the gap toward 1.0 ( $strength_{new} = strength_{old} + (1.0 - strength_{old}) \times 0.3$ ). This ensures that frequently used preferences stay near 1.0, while unreferenced "facts" naturally sink to the bottom of your search results over time, allowing the agent's "personality" to evolve with the user. Memories that haven't been relevant in a long time fade; memories that keep being retrieved stay strong. This field is our alternative to TTL expiration for long-term memory - we don't want to hard-delete a preference just because it hasn't been triggered in 90 days, but we do want to deprioritize it in retrieval ranking. To manage this, we use a periodic background sweep (e.g., a daily cron job) that targets the last_accessed_at field to apply decay, rather than calculating it at read-time, which would add unnecessary latency to every retrieval.
source_episode_id is the provenance link, it tells us which agent run produced this memory. This matters for auditability and for bulk-invalidating memories that came from a faulty episode (e.g., if a tool malfunctioned). We don't index this field by default to save on write overhead, as bulk invalidation is typically an infrequent administrative task. However, if your pipeline requires high-frequency memory rollbacks, you should add a standard index: db.semantic_memory.createIndex({ source_episode_id: 1 }).

Embed vs. reference

Each semantic memory is its own document. We retrieve these semantically and independently — there is no "give me all memories for this thread" read pattern. References would add indirection without benefit. The document-per-memory model also makes it straightforward to update strength and last_accessed_at atomically without touching sibling memories.

Index

// Primary lookup and uniqueness constraint
db.semantic_memory.createIndex({ memory_id: 1 }, { unique: true })

// Compound index for non-vector queries (retrieval ranking)
db.semantic_memory.createIndex({ user_id: 1, type: 1, strength: -1 })

// Supporting provenance lookups and bulk invalidation
db.semantic_memory.createIndex({ source_episode_id: 1 })

// Index for background decay sweeps
db.semantic_memory.createIndex({ user_id: 1, last_accessed_at: 1 })

// Atlas Vector Search Index Definition
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1024, // Optimized for voyage-3-large
      "similarity": "cosine"
    },
    {
      "type": "filter",
      "path": "user_id"
    },
    {
      "type": "filter",
      "path": "type"
    }
  ]
}

The vector index definition includes user_id and type as filter fields; this is what enables the $vectorSearch pre-filter. Always filter by user_id before vector search to prevent cross-user leakage.

The Referential Model: How the Three Collections Wire Together

The three join keys are user_id, thread_id, and episode_id. Here is how they move through a session lifecycle:

Session starts
  → short_term_memory document created
      { thread_id: "thread_abc123", user_id: "user_9921", messages: [] }

Turns accumulate
  → messages[] grows via $push
  → summary updated when token budget is reached

Session ends / agent run completes
  → episodic_memory record written
      { episode_id: "ep_...", thread_id: "thread_abc123", user_id: "user_9921" }

Semantic extraction runs
  → facts and preferences identified from the episode
  → semantic_memory documents written
      { source_episode_id: "ep_...", user_id: "user_9921", ... }

The read pattern per memory type

Short-term: fetch by thread_id. This is a single point query on one document. We always know the thread before we read.
Episodic: query by user_id sorted by started_at descending for recency, or combine with a tags filter for domain-specific recall. Text search on summary for keyword-based retrieval.
Semantic: $vectorSearch with a user_id pre-filter, optionally narrowed by type. Rank results by vector similarity, break ties, or rerank by strength.

What we never do

Join across these collections at query time. Each retrieval path is independent by design. The thread_id and episode_id fields are provenance links, useful for audit, debugging, and bulk operations, not foreign keys that we join on in the hot path.

Index Strategy: What Actually Needs a Vector Index (and What Doesn't)

The most common mistake in early agent memory implementations is adding vector indexes everywhere. Here is where we actually need them.

Short-term memory: no vector index. We always retrieve by thread_id. We know the session before we read. A vector index here would never be used and would slow down writes for no benefit.
Episodic memory: maybe, later. A text index on summary covers the majority of episodic recall needs — keyword-based retrieval like "find episodes involving refunds for this user." A vector index is only justified when we need semantic episode recall: "when did I last help this user with X?" — and most applications don't need this at launch. Start without it, add it only when we have a concrete retrieval requirement that text search can't satisfy.
Semantic memory: yes, this is the one. Semantic memory is our knowledge retrieval layer. Retrieval by meaning is the whole point. This collection needs a vector index from day one.

TTL strategy

Short-term memory is the natural TTL candidate. We wire up the TTL index on expires_at and let MongoDB handle the cleanup. A 24 to 48-hour expiry is reasonable for most applications.
Semantic memory should not use TTL expiration. A preference learned six months ago and not recently triggered should deprioritize in retrieval, not disappear entirely. We use the strength field for soft decay and reserve hard deletion for explicit user requests or compliance requirements.

Summary

Here are the three schemas as a single reference:

Attribute	short_term_memory	episodic_memory	semantic_memory
Primary key	`thread_id`	`episode_id`	`memory_id`
Scope key	`user_id`	`user_id`	`user_id`
Retrieved by	`thread_id`	`user_id` + recency/tags	vector search + `user_id`
Vector index	No	Optional	Yes
TTL	TTL index	No	No
Expiry mechanism	`expires_at` TTL	`started_at` policy	strength decay

The taxonomy, short-term, episodic, semantic, maps cleanly onto three MongoDB collections with distinct retrieval patterns and distinct index strategies. Each collection is optimized for how it is actually read, not for a generalized "memory store" abstraction.

Short-term memory is optimized for high-speed session lookups and uses native TTL indexes for automatic cleanup.
Episodic memory serves as the immutable audit trail. We reuse the started_at index for both recency-based retrieval and manual retention sweeps, avoiding the overhead of a separate age-based index.
Semantic memory acts as the long-term knowledge base. It is the only collection requiring a vector index and uses a background "strength decay" sweep based on last_accessed_at to keep retrieval results fresh and relevant.

For the conceptual background on agent memory types, the LangGraph Memory Store documentation and Richmond's taxonomy post are the right starting points. This schema is what you build once you've read those and are ready to sit down with a database.

DEV Community

Designing the Agent Memory Schema: Document Shapes for Short-Term, Episodic, and Semantic Memory in MongoDB

Short-Term Memory: The Conversation Window

What it stores

Document shape

Field-by-field walkthrough

Embed vs. reference

Index

Episodic Memory: What the Agent Actually Did

What it stores

Document shape

Field-by-field walkthrough

Why this is different from the chat log

Index

Semantic Memory: What the Agent Knows About the User

What it stores

Document shape

Field-by-field walkthrough

Embed vs. reference

Index

The Referential Model: How the Three Collections Wire Together

The read pattern per memory type

What we never do

Index Strategy: What Actually Needs a Vector Index (and What Doesn't)

TTL strategy

Summary

Top comments (0)