Prajwol Shrestha

Posted on Jul 5

Building a Document Chat App: How RAG Actually Works

#webdev #programming #ai #rag

When I first came across RAG, the terminology made it sound more intimidating than it is - embeddings, vector search, cosine similarity.

But once you strip away the jargon, the core idea is straightforward. And building it yourself is genuinely the fastest way to understand it.

This post walks through how I built Khoj - a document chat app where you upload any PDF and ask it questions, with answers grounded in the document's actual content. I'll use it as a reference throughout, but the focus here is on how RAG works and why each piece exists.

The Problem RAG Solves

Large language models are trained on broad data, not your specific documents. When you ask an LLM about an internal report, a contract, or a policy document it has never seen, it will often produce a confident but fabricated answer — filling gaps with plausible-sounding information it doesn't actually know.

RAG addresses this by separating two concerns: retrieval and generation.

Instead of asking the model to recall information it may not have, you first retrieve the relevant parts of your document, then hand those to the model as context. The instruction becomes: "answer based only on this." The model stops guessing and starts reasoning over real information.

That is the entire idea. Everything else in this post is implementation.

Two Phases

A RAG system has two distinct phases that are worth keeping separate in your mental model.

Ingestion runs once when a document is uploaded. It prepares the document for search.

Query runs on every user question. It retrieves relevant content and generates an answer.

Phase 1 — Ingestion

Why Chunking?

You cannot embed an entire document as a single vector. At that scale, the embedding captures only a vague, high-level representation - searching against it returns imprecise results.

Instead, the document is sliced into smaller pieces of around 500 characters each. Every chunk gets its own embedding vector, stored as its own row in the database. When a question arrives, the system finds the three most relevant chunks — not the whole document, just the pieces that matter.

The overlap between chunks (50 characters in this case) is important. Without it, a sentence that lands at a chunk boundary gets split in half and loses its meaning. Overlap ensures that the next chunk starts just before the previous one ended, preserving that context.

export function chunkText(text: string, chunkSize = 500, overlap = 50): string[] {
  const chunks: string[] = []
  let start = 0

  const cleaned = text
    .replace(/\u0000/g, '')                // null bytes cause Postgres errors
    .replace(/[\u0001-\u001F\u007F]/g, '') // strip control characters
    .replace(/\s+/g, ' ')
    .trim()

  while (start < cleaned.length) {
    const end = start + chunkSize
    chunks.push(cleaned.slice(start, end))
    start = end - overlap
  }

  return chunks.filter(c => c.trim().length > 20)
}

What Is an Embedding?

An embedding is a list of numbers that represents the meaning of a piece of text. Text with similar meanings produces vectors that are numerically close to each other in space.

This is what makes semantic search possible. You are not searching for exact keyword matches — you are finding chunks that are about the same thing as the question, even if they use entirely different words.

// for document chunks
export async function embedText(text: string): Promise<number[]> {
  const result = await ai.models.embedContent({
    model: 'gemini-embedding-001',
    contents: text,
    config: {
      outputDimensionality: 768,
      taskType: 'RETRIEVAL_DOCUMENT',  // optimised for being retrieved
    },
  })
  return result.embeddings?.[0]?.values ?? []
}

// for user questions
export async function embedQuery(text: string): Promise<number[]> {
  const result = await ai.models.embedContent({
    model: 'gemini-embedding-001',
    contents: text,
    config: {
      outputDimensionality: 768,
      taskType: 'RETRIEVAL_QUERY',     // optimised for retrieval search
    },
  })
  return result.embeddings?.[0]?.values ?? []
}

The taskType distinction is worth paying attention to. RETRIEVAL_DOCUMENT and RETRIEVAL_QUERY instruct the model to optimise the output vector for different purposes — one for being found, the other for finding. Using the appropriate task type for each case measurably improves search quality with no additional cost.

Storing Vectors in Postgres

Each chunk is inserted into a Postgres table with a vector(768) column, powered by the pgvector extension:

create extension if not exists vector;

create table public.chunks (
  id bigserial primary key,
  document_id uuid references public.documents(id) on delete cascade,
  content text not null,
  chunk_index int not null,
  embedding vector(768)
);

create index on public.chunks
  using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);

The ivfflat index makes similarity search fast. Without it, every query scans the entire table. With it, the search uses an approximate nearest-neighbor algorithm that is significantly faster at scale, with a negligible trade-off in precision.

Phase 2 — Query

Embedding the Question

The user's question is embedded using the same model that was used for the document chunks. This is a requirement, Embeddings from different models exist in different vector spaces, so comparing them is like comparing GPS coordinates with Minecraft coordinates - they describe entirely different worlds.

Vector Search

pgvector finds the chunks whose embeddings are closest to the question's embedding:

create or replace function match_chunks(
  query_embedding vector(768),
  match_document_id uuid,
  match_count int default 3
)
returns table (id bigint, content text, chunk_index int, similarity float)
language sql stable
as $$
  select
    id,
    content,
    chunk_index,
    1 - (embedding <=> query_embedding) as similarity
  from public.chunks
  where document_id = match_document_id
  order by embedding <=> query_embedding
  limit match_count;
$$;

The <=> operator is pgvector's cosine distance. Since distance and similarity are inverses, 1 - distance gives you a similarity score between 0.0 (completely unrelated) and 1.0 (identical meaning).

Imagine every sentence as a point floating in a huge 768-dimensional space. Similar ideas naturally end up close together, while unrelated topics drift farther apart. Vector search simply asks, "Which stored points are nearest to this new point?"

The Similarity Threshold

pgvector always returns results — even when the question has nothing to do with the document. It simply returns the least-distant chunks it can find. Without a threshold, you will attach irrelevant "sources" to answers like "I couldn't find that information," which is misleading and confusing for users.

const SIMILARITY_THRESHOLD = 0.5
const relevantChunks = chunks.filter(c => c.similarity >= SIMILARITY_THRESHOLD)

if (relevantChunks.length === 0) {
  return "I couldn't find that information in the document."
}

A threshold of 0.5 is a reasonable starting point. You may need to tune it based on your specific content and embedding model.

Building the Prompt

The retrieved chunks are assembled into context and passed to the language model:

const context = relevantChunks
  .map((c, i) => `[Chunk ${i + 1}]:\n${c.content}`)
  .join('\n\n')

const prompt = `Answer based strictly on the context provided below.
If the answer is not present in the context, respond with "I couldn't find that information in the document."

Context:
${context}

Question: ${question}

Answer:`

The explicit instruction to answer only from the provided context is what prevents hallucination. The model is no longer filling gaps from its training data — it is reading and reasoning over the content you have given it.

Streaming Responses

Without streaming, the user submits a question and waits several seconds for the full response to appear at once. With streaming, words appear as they are generated. The total response time is the same, but the experience is fundamentally different.

The approach here uses Server-Sent Events (SSE). The server sends a stream of newline-delimited JSON events:

data: {"type":"sources","sources":[...]}

data: {"type":"token","token":"The"}

data: {"type":"token","token":" document"}

data: {"type":"done"}

There are three event types: sources arrives first so the UI can display references immediately, token carries each word as it is generated, and done signals the end of the stream.

// server — create the stream
const stream = new ReadableStream({
  async start(controller) {
    const encoder = new TextEncoder()

    // send sources before any tokens arrive
    controller.enqueue(
      encoder.encode(`data: ${JSON.stringify({ type: 'sources', sources })}\n\n`)
    )

    const completion = await groq.chat.completions.create({
      model: 'llama-3.1-8b-instant',
      messages: [{ role: 'user', content: prompt }],
      stream: true,
    })

    let fullAnswer = ''
    for await (const chunk of completion) {
      const token = chunk.choices[0]?.delta?.content ?? ''
      if (token) {
        fullAnswer += token
        controller.enqueue(
          encoder.encode(`data: ${JSON.stringify({ type: 'token', token })}\n\n`)
        )
      }
    }

    controller.enqueue(encoder.encode(`data: ${JSON.stringify({ type: 'done' })}\n\n`))
    controller.close()
  },
})

return new Response(stream, {
  headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache' },
})

// client — read the stream
const reader = res.body!.getReader()
const decoder = new TextDecoder()
let buffer = ''

while (true) {
  const { done, value } = await reader.read()
  if (done) break

  buffer += decoder.decode(value, { stream: true })
  const lines = buffer.split('\n\n')
  buffer = lines.pop() ?? ''

  for (const line of lines) {
    if (!line.startsWith('data: ')) continue
    const event = JSON.parse(line.slice(6))

    if (event.type === 'token') {
      setMessages(m => m.map(msg =>
        msg.id === pendingId
          ? { ...msg, content: msg.content + event.token }
          : msg
      ))
    }
  }
}

reader.read() is part of the browser's native Web Streams API and returns { done, value } on each call. No additional library is required.

Closing Thoughts

RAG is not a new category of technology — it is a pattern. Retrieve relevant information, then generate a response grounded in that information. The sophistication lives in how well you handle each step: chunk size, overlap, embedding quality, similarity thresholds.

Get those right, and the language model handles the rest.

If you want to explore the full implementation, The source code for Khoj is available on GitHub. A live demo is running at khoj.shresthaprajwol.com.np.

DEV Community