No messages table! The data model behind my own Claude-based chatbot

#mongodb #ai #claude #tutorial

This tutorial was written by Néstor Daza.

This is the second article in a series about building Claudius, my own Claude-based chatbot (Github). The prologue made the case for building it, and for choosing MongoDB as its foundation.

Open the conversations collection in Claudius’ database and you find the usual fields of a thread header but nothing else: a userId, a title, some timestamps, and so on, but no array of messages, no messages collection sitting beside it either! The text of every conversation lives somewhere else entirely, in the LangGraph checkpointer, which I wire up later in this series. This absence is a modeling decision, and how I came up with the database schema for my chatbot is the theme of this article.

If you come from a relational background, you're used to modeling the data first when designing a database. For a project like this, you would start by finding the entities and normalizing them, and the final schema would come out of the data's structure: a conversations table and a messages table with a foreign key between them, because that is what the data looks like.

Document modeling runs the other way. You start from how the application reads and writes, and the shape of the document follows the access patterns. Claudius never reads conversation messages without the agent's full working state wrapped around them, and that state is persisted using the LangGraph checkpointer. A separate messages table would add nothing, since the app would always have to join it back to that state on every read. The access pattern says the messages belong with the agent state, so that is where they go, and conversations are left as the lightweight header the list view actually needs.

That inversion, modeling around use rather than around the data, runs through everything below.

Schema-flexible is not schemaless

This is the misconception lots of people often carry, and it is worth killing on the way in. A document database does not mean no schema; it means a flexible, use-case-based data model. A relational engine holds the schema for you through column types and constraints; MongoDB will store documents in whatever shape your application gives it, and can validate that shape with a collection validator if you want. Claudius keeps the schema in code, as one Zod schema per collection, because that single definition gives both the runtime check and the TypeScript type.

import { z } from "zod";
import { zObjectId } from "./common";

export const ConversationSchema = z.object({
  _id: zObjectId.optional(),
  userId: zObjectId,
  title: z.string(),
  modelId: z.string(),
  createdAt: z.date(),
  updatedAt: z.date(),
  archived: z.boolean(),
  expiresAt: z.date().optional(),
});

export type Conversation = z.infer<typeof ConversationSchema>;

The reason is drift. Declare a TypeScript interface for a document and a separate validator for the data coming in, and the two will disagree eventually, while the type system will lie to you with complete confidence. Deriving the type from the validator collapses them into one artifact. One definition gives you the runtime check at the boundary and the compile-time type everywhere else. For a relational reader, this is your DDL (Data Definition Language) and your ORM (Object-Relational Mapping) models together in the same structure, which is something relational stacks rarely pull off.

Another advantage is that every external input, whether an incoming HTTP body or a model's structured output, is parsed through Zod at the boundary, so by the time data reaches the logic, it is already validated and typed.

A quick tour of the model

There are eight collections in this phase: users, conversations, memories, documents, chunks, usage_events, settings, and, arriving later, jobs. Two more, checkpoints and checkpoint_writes, belong entirely to the checkpointer and are never written to directly, so I do not model them. Most of these schemas are unremarkable once you see the access-pattern framing, but two modeling moves are worth pausing on.

The first is settings. It holds a handful of global singletons, each a different shape, each identified by a string _id: the member allowlist, the model catalog, the per-role tiers, and the guest spend circuit breaker. They share one collection, so the Zod schema is a discriminated union based on _id, and parsing a settings document narrows it to the correct shape from its _id literal.

export const SettingsSchema = z.discriminatedUnion("_id", [
  AllowlistSettingsSchema,        // _id: "allowlist"
  ModelCatalogSettingsSchema,     // _id: "modelCatalog"
  TiersSettingsSchema,            // _id: "tiers"
  GuestCircuitBreakerSettingsSchema, // _id: "guestCircuitBreaker"
]);
export type Settings = z.infer<typeof SettingsSchema>;

In relational terms, this is a single typed configuration table whose rows legitimately carry different columns, which is usually awkward to model relationally without resorting to a JSON column, and here comes out clean and readable.

The second move is to use document embedding as a data modeling technique. The users document carries its dailyMessageCount as an embedded object holding the count and its reset time together, rather than as two flat fields or a side table, because the application always reads and writes them together as a unit. That is the document modeling mantra: data used together is stored together.

Let the database keep the rule

The most satisfying part of this phase is how little policy I had to write. Three product rules that I would normally enforce in application code are carried by the storage engine instead, because I shaped the data so the engine could hold them.

Retention is a TTL index

Guest data in Claudius is ephemeral. Guest-created conversations and memories carry an expiresAt date; member and admin documents leave the field off entirely. A TTL (time-to-live) index with expireAfterSeconds set to 0 tells MongoDB to delete each document at the exact moment stored in its own expiresAt field; documents without this field are never touched.

// TTL on the guest-only expiresAt field. expireAfterSeconds:0 
// means "expire exactly at the date stored in the field"; 
// documents without the field are never touched, so member/admin
// data is permanent.

await db
  .collection(COLLECTIONS.conversations)
  .createIndex({ expiresAt: 1 }, { name: "expiresAt_ttl", expireAfterSeconds: 0 });

await db
  .collection(COLLECTIONS.memories)
  .createIndex({ expiresAt: 1 }, { name: "expiresAt_ttl", expireAfterSeconds: 0 });

So, guest ephemerality, a product rule, is enforced by the engine. There is no cron job and no cleanup endpoint to remember to run. The relational equivalent is a scheduled sweep; here, the policy lives in the index.

One detail ties the TypeScript compiler to the database behavior. With exactOptionalPropertyTypes turned on in tsconfig.json, an optional field means the key is absent, not present, and set to undefined. The compiler setting and the storage rule reinforce each other.

Telemetry is a time-series collection

Every billable model call writes one usage_events document, recording the token counts, the latency, and what the call was for. That data is append-only and gets queried along a few low-cardinality dimensions over a time window, which is the textbook shape for a MongoDB time-series collection. It has a timeField, timestamp, and a single metaField that holds the grouping dimensions (userId, modelId, purpose) nested alongside the measurements.

Two things follow from this choice. A time-series collection has to be created explicitly with a timeseries spec before the first insert (they cannot be created on first write, the way an ordinary collection does), so the index script creates it behind an existence check. Once it exists, the engine's own machinery becomes visible: list the collections and a system.buckets.usage_events sits next to usage_events, which is where the bucketed storage lives:

[
  {"name":"memories","type":"collection"},
  {"name":"usage_events","type":"timeseries"},
  {"name":"system.buckets.usage_events","type":"collection"},
  {"name":"settings","type":"collection"},
  {"name":"users","type":"collection"},
  {"name":"conversations","type":"collection"},
  {"name":"accounts","type":"collection"},
  {"name":"chunks","type":"collection"},
  {"name":"system.views","type":"collection"}
]

The relational analogy is a metrics or rollup table you would otherwise hand-roll and maintain, except the bucketing is automatic and the query you write against it stays simple.

A pre-filter field is a security control

The project's hardest rule is that a vector search pre-filters by userId and can never return another user's data. That rule is not bolted onto the search in the application code. It is a property of the index.

Both memories and chunks get an Atlas Vector Search index on the embedding field, with a filter on userId (and on chunks, a second filter on documentId).

await ensureVectorIndex(db, COLLECTIONS.memories, "memories_vector", [
  { type: "vector", path: "embedding", numDimensions: 1024, similarity: "cosine" },
  { type: "filter", path: "userId" },
], created, skipped);

await ensureVectorIndex(db, COLLECTIONS.chunks, "chunks_vector", [
  { type: "vector", path: "embedding", numDimensions: 1024, similarity: "cosine" },
  { type: "filter", path: "userId" },
  { type: "filter", path: "documentId" },
], created, skipped);

Pre-filtering means the userId condition is applied during the indexed search on the same documents being scored, not afterward to whatever the search returns. Post-filtering is not only slower, since you score documents you then throw away, but it is a correctness and security risk, since the wrong documents can slip through ranking and limits before your filter ever runs.

This is also why chunks denormalizes userId. A vector search has no notion of a join, so the filter has to sit on the very documents being searched. A chunk belongs to a document that belongs to a user, but the chunk has to carry the owner itself, or the index cannot enforce the boundary. That is the embed-versus-reference tension made concrete: a chunk references its parent document by id and duplicates the one field the security boundary depends on.

Two gotchas surfaced while creating these indexes, and both are worth saying out loud. A search index can only be built on a collection that already exists, and chunks had no classic index, so the collection did not exist yet the first time the script reached for its vector index:

MongoServerError: Error retrieving collection UUID and view info ::
caused by :: Collection 'claudius.chunks' does not exist. (NamespaceNotFound)

The fix is to add an existence check and create the collection first only if it is missing. The second gotcha is that Atlas builds search indexes asynchronously, so createSearchIndex returns when the build is queued, not when it is queryable! We also need a name check that keeps reruns safe while it builds.

This is the final function to build the indexes from code using the driver's createSearchIndex, with the checks that keep the script safe to run on every deploy:

async function ensureVectorIndex(
  db: Db,
  collectionName: string,
  indexName: string,
  fields: Array<Record<string, unknown>>,
  created: string[],
  skipped: string[],
): Promise<void> {
  // A vector search index can only be built on an existing
  // collection, so create it first if nothing else has (e.g.
  // chunks has no regular index).
  const collections = await db
    .listCollections({ name: collectionName })
    .toArray();
  if (collections.length === 0) {
    await db.createCollection(collectionName);
  }

  const collection = db.collection(collectionName);
  const present = await collection.listSearchIndexes().toArray();
  if (present.some((idx) => idx.name === indexName)) {
    skipped.push(`${collectionName}.${indexName}`);
    return;
  }

  // Atlas builds the index asynchronously; createSearchIndex
  // returns once the build is queued. The name check above 
  // keeps re-runs idempotent.
  await collection.createSearchIndex({
    name: indexName,
    type: "vectorSearch",
    definition: { fields },
  });
  created.push(`${collectionName}.${indexName} (vector)`);
}

The plumbing that keeps the code honest

Two small things hold the userId rule together in practice. First, userId is an ObjectId everywhere, matching the user’s _id, so filters and lookups line up without conversion. Second, no collection is ever queried through a raw driver call. Every access goes through a typed accessor, conversationsCol(), memoriesCol(), and so on, each returning a typed collection for the right document shape. Centralizing the collection names means a rename happens in one place, and typing each accessor keeps the userId filter visible to the compiler at every call site, which is how a rule that lives in prose stays honest in code.

The whole application and the Auth.js adapter share a single, serverless-safe MongoDB client to a single fixed database name, so there is exactly one connection pool. How that database name quietly caused two separate failures is a good story, and it belongs to the next article, where I stand the foundation up.

What this article bought

No chatbot yet, but the shape of the thing is already decided. Retention, the telemetry layout, and the tenant-isolation boundary are properties of the schema and its indexes now, not code I have to remember to run later. That is the payoff of modeling around the access patterns first: the database carries more of the load because you shaped it to. The data model is the foundational piece of the whole app.

Next, we'll build this foundation up, with identity the client never gets a vote in, and a health check that proves the app can reach Atlas and Bedrock, plus a view on the unglamorous errors that cost me real time. After that comes the streaming chat backbone, which is where the conversation finally gets a place to live, and I open a real checkpoint document to show you the messages that were never in their own collection.