Ryan Carter

Posted on Apr 28 • Edited on May 7 • Originally published at stormcloudy.com

Building a Context-Aware AI Chat Without a Vector Database

#ai #llm #webdev #tutorial

You can ground an AI chat in your own data without a vector database by assembling the relevant documents directly into the system prompt before each request. No embedding pipeline, no similarity search, no separate infrastructure — just your structured data, formatted cleanly, injected as system context. It works well when your dataset is modest (hundreds of documents, not millions) and naturally segmented into logical groups.

This is the pattern I used building Wiskr, a multi-model chat app that grounds conversations in documents from a connected document store. The rest of this post walks through how to implement it, where it breaks down, and how to upgrade to full RAG when you outgrow it.

TL;DR

The pattern: Group documents into named contexts, load active contexts on each request, format them into a system prompt, prepend it to every API call.
No vector DB needed: For modest datasets, the model reads structured JSON directly — embeddings and similarity search are unnecessary overhead.
Token-limit guardrails: Cap documents per context, summarize long ones, let users pin important ones, then add vector search only when those run out of room.
Upgrade path: When you need real RAG later, the context-assembly layer stays put — you just add smarter document selection in front of it.
Best fit: Personal assistants, support tools, document Q&A, and any AI feature that needs to reason about a bounded, structured user-specific dataset.

The core idea

A standard LLM chat call looks like this:

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4",
  messages: [
    { role: "user", content: "What's my copay for metformin?" }
  ]
});

The model has no idea who you are or what documents you have. It can only work with what's in the messages array.

The context assembly pattern adds a system message that packages your relevant data before the conversation begins:

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4",
  messages: [
    { role: "system", content: assembledContext },
    { role: "user", content: "What's my copay for metformin?" }
  ]
});

Now the model has your data and can reason against it. The question is how to build assembledContext well.

Step 1: Organize data into contexts

The first thing you need is a way to group related documents. In Wiskr these are called contexts — named buckets like "Medical," "Vehicle," "Insurance," or "House." Each conversation has a set of active contexts the user selects before chatting.

In the database this is a simple structure:

CREATE TABLE contexts (
  id uuid PRIMARY KEY,
  user_id uuid,
  name text,
  created_at timestamptz
);

CREATE TABLE documents (
  id uuid PRIMARY KEY,
  context_id uuid REFERENCES contexts(id),
  title text,
  content jsonb,
  created_at timestamptz
);

Documents belong to contexts. Contexts belong to users. When a chat starts, the user picks which contexts are active — and only those get assembled into the prompt.

Step 2: Load active context documents

When a conversation starts, load the documents for each active context:

async function loadContextDocuments(db, contextIds) {
  const result = await db.query(
    `SELECT c.name as context_name, d.title, d.content
     FROM documents d
     JOIN contexts c ON c.id = d.context_id
     WHERE d.context_id = ANY($1)
     ORDER BY c.name, d.created_at DESC`,
    [contextIds]
  );
  return result.rows;
}

Step 3: Assemble the system prompt

With the documents loaded, format them into a readable system prompt:

function assembleSystemPrompt(documents) {
  // Group documents by context name
  const byContext = documents.reduce((acc, doc) => {
    if (!acc[doc.context_name]) acc[doc.context_name] = [];
    acc[doc.context_name].push(doc);
    return acc;
  }, {});

  const contextBlocks = Object.entries(byContext).map(([contextName, docs]) => {
    const docBlocks = docs.map(doc => `
### ${doc.title}
${JSON.stringify(doc.content, null, 2)}
    `).join('\n');

    return `## ${contextName}\n${docBlocks}`;
  }).join('\n\n');

  return `You are a helpful assistant with access to the user's personal documents.
Use the information below to give accurate, personalized responses.
If the answer isn't in the documents, say so — don't guess.

${contextBlocks}`;
}

Raw JSON is fine for the document content. Current models read it well, and it preserves the structure of your data without you having to write custom serializers for every document type.

Step 4: Inject into every request

Pass the assembled context as the system message on every API call, alongside the full conversation history:

async function chat(db, conversationId, userMessage) {
  // Load conversation state
  const conversation = await getConversation(db, conversationId);
  const documents = await loadContextDocuments(db, conversation.activeContextIds);
  const history = await getMessageHistory(db, conversationId);

  // Assemble context
  const systemPrompt = assembleSystemPrompt(documents);

  // Build messages array
  const messages = [
    { role: "system", content: systemPrompt },
    ...history,
    { role: "user", content: userMessage }
  ];

  // Call the model
  const response = await client.chat.completions.create({
    model: conversation.model,
    messages,
  });

  const assistantMessage = response.choices[0].message.content;

  // Save to history
  await saveMessage(db, conversationId, "user", userMessage);
  await saveMessage(db, conversationId, "assistant", assistantMessage);

  return assistantMessage;
}

Handling token limits

The obvious risk with this approach is bloated prompts. If a user has 50 documents in their active contexts you'll hit token limits fast.

A few practical strategies:

Cap documents per context. The simplest option — include only the N most recent documents per context. For most use cases, the newest 10-15 documents per context are the most relevant anyway.

const result = await db.query(
  `SELECT c.name as context_name, d.title, d.content
   FROM documents d
   JOIN contexts c ON c.id = d.context_id
   WHERE d.context_id = ANY($1)
   ORDER BY c.name, d.created_at DESC
   LIMIT 15`,  // cap per context
  [contextIds]
);

Summarize large documents. If individual documents are long, run them through a cheap fast model first to produce a condensed version before assembling the prompt.

Let users pin documents. Give users control — a pinned document always gets included, everything else is capped or summarized. This is often more useful than trying to guess relevance automatically.

Add vector search later. When your data grows large enough that capping and pinning don't cut it, vector search is the right next step. You add an embedding column, generate embeddings on save, and query by cosine similarity to find the most relevant documents for each conversation. The context assembly step stays the same — you just get smarter document selection feeding into it.

When this is the right approach

This pattern works well when:

Your data is structured (JSON, not unstructured text blobs)
Your dataset is modest (hundreds of documents, not millions)
Users naturally segment their data into logical groups
You want something working fast without infrastructure overhead

It's a good starting point for any AI feature that needs to reason about user-specific data — support tools, personal assistants, document Q&A, anything where the data set is bounded and the structure is known.

When you outgrow it, the upgrade path to full RAG is incremental rather than a rewrite. The context assembly layer stays. You just add smarter selection in front of it.

FAQ

When should I use context assembly instead of full RAG with a vector database?

Use context assembly when your dataset is bounded (a few hundred documents per user, max), the documents are already structured (JSON, key-value, or short prose), and users have a natural way to scope which subset is relevant for a conversation. Switch to vector-database RAG when you can't fit the relevant slice in a system prompt, when relevance ranking actually matters, or when content is long-form unstructured text.

How big can the system prompt get before this falls apart?

Modern frontier models accept 200K+ token context windows, but cost and latency both scale with prompt size. As a practical rule, keep the assembled context under ~20K tokens for most consumer use cases — beyond that you'll feel the latency in chat, and the per-request cost adds up fast.

Does this work with any LLM provider?

Yes. The pattern is just a system message — every chat-completions API supports it. I've used the same code unchanged across OpenAI, Anthropic, and OpenRouter.

How do I migrate to full RAG later without rewriting everything?

Keep the context-assembly function as-is. Add an embedding column to the documents table, generate embeddings on save, and replace the "load all documents in active contexts" query with "load top N documents in active contexts ranked by cosine similarity to the user's question." Everything downstream of that — the prompt formatting, the chat call — stays identical.

What about prompt caching?

This pattern composes well with prompt caching. The system prompt changes only when documents are added/edited, so providers that support prompt caching (Anthropic, OpenAI) can cache the assembled context across turns and dramatically cut input-token cost on follow-up messages.

Is it safe to dump raw user data into the system prompt?

For a single-tenant app where the user owns the data, yes — that's the whole point. For multi-tenant apps, be strict about which user's contexts get loaded, and never assemble across users. A bug in context selection becomes a data-leak bug.

DEV Community