DEV Community

Cover image for Build a Local RAG AI App with Ollama, Mistral, and Node.js
Gaurav Kumar Singh
Gaurav Kumar Singh

Posted on

Build a Local RAG AI App with Ollama, Mistral, and Node.js

Most people start using Large Language Models by asking questions directly:

Question -> LLM -> Answer
Enter fullscreen mode Exit fullscreen mode

This works well for general questions.

But what happens when you ask about your own documents, company policy, product FAQ, or internal notes?

The model may not know the answer. Even worse, it may guess confidently. This is called a hallucination.

That is the problem RAG solves.

In this article, we will build a simple local RAG app using Ollama, Mistral, and Node.js.

Complete code is available here:

https://github.com/gaurav101/ai-experiment/tree/main/rag

What Is RAG?

RAG stands for Retrieval-Augmented Generation.

That sounds complex, but the idea is simple:

Before asking the LLM to answer, first search your own documents and give the most useful parts to the model.

So instead of this:

User question -> LLM -> Answer
Enter fullscreen mode Exit fullscreen mode

we do this:

User question
   -> Search local documents
   -> Find relevant text
   -> Send that text to the LLM
   -> Generate an answer
Enter fullscreen mode Exit fullscreen mode

Think of it like an open-book exam.

The LLM is still doing the writing, but now it has the right page open before it answers.

Why RAG Matters

RAG is important because most real AI apps need private or updated information.

For example:

  • A chatbot that answers questions from company documents
  • A support assistant that reads product FAQs
  • A legal assistant that searches contracts
  • A coding assistant that understands project docs
  • A personal assistant that uses your own notes

Without RAG, the model only uses what it already learned during training.

With RAG, we can give the model fresh information at runtime.

This means:

  • No need to retrain the model
  • Documents can stay private
  • Answers are based on your data
  • The system is easier to update

If your refund policy changes, you update the document and rebuild the index. You do not retrain the LLM.

What We Will Build

We will build a local RAG app that can answer questions from files stored on your machine.

The app uses:

  • Node.js for the code
  • Ollama to run models locally
  • Mistral to generate answers
  • nomic-embed-text to create embeddings
  • A local data/docs folder for documents
  • A local data/index.json file as a simple vector index

The project flow is:

Documents -> Chunks -> Embeddings -> Search -> Context -> Mistral -> Answer
Enter fullscreen mode Exit fullscreen mode

Do not worry if words like "embeddings" or "vector index" feel new. We will walk through them step by step.

Step 1: Install Ollama

First, install Ollama:

https://ollama.com/download

On Linux or macOS, you can also install it with:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Start Ollama:

ollama serve
Enter fullscreen mode Exit fullscreen mode

Now pull the two models we need.

Mistral will generate the final answer:

ollama pull mistral
Enter fullscreen mode Exit fullscreen mode

nomic-embed-text will convert text into embeddings:

ollama pull nomic-embed-text
Enter fullscreen mode Exit fullscreen mode

You can test Mistral with:

ollama run mistral
Enter fullscreen mode Exit fullscreen mode

Step 2: Create the Node.js Project

Create a new project:

mkdir local-rag
cd local-rag
npm init -y
Enter fullscreen mode Exit fullscreen mode

Use ES modules by adding this to package.json:

{
  "type": "module"
}
Enter fullscreen mode Exit fullscreen mode

Install dotenv:

npm install dotenv
Enter fullscreen mode Exit fullscreen mode

In this project, we use these scripts:

{
  "scripts": {
    "index": "node src/index-docs.js",
    "ask:ollama": "node src/ask-ollama.js"
  }
}
Enter fullscreen mode Exit fullscreen mode

npm run index builds the searchable document index.

npm run ask:ollama asks a question using Ollama and Mistral.

Step 3: Add Local Documents

Create a folder for your documents:

mkdir -p data/docs
Enter fullscreen mode Exit fullscreen mode

Add a file:

data/docs/company-faq.txt
Enter fullscreen mode Exit fullscreen mode

Example content:

Refunds are allowed within 14 days of purchase.
Enterprise customers get priority email support.
The product supports SSO on the Business and Enterprise plans.
Enter fullscreen mode Exit fullscreen mode

These documents are the knowledge base for our RAG app.

Later, when the user asks a question, the app will search these files first.

Step 4: Add Configuration

Create src/config.js.

This file keeps all important settings in one place:

export const DOCS_DIR = process.env.DOCS_DIR || "data/docs";
export const INDEX_FILE = process.env.INDEX_FILE || "data/index.json";

export const OLLAMA_BASE =
  process.env.OLLAMA_BASE_URL || "http://localhost:11434";

export const OLLAMA_EMBED_ENDPOINT = "/api/embeddings";
export const OLLAMA_EMBED_MODEL = "nomic-embed-text";

export const OLLAMA_GEN_ENDPOINT = "/api/generate";
export const OLLAMA_GEN_MODEL = process.env.OLLAMA_MODEL || "mistral";
Enter fullscreen mode Exit fullscreen mode

This tells the app:

  • Where to read documents from
  • Where to save the index
  • Which model to use for embeddings
  • Which model to use for answers

Step 5: Read and Split Documents

LLMs work better when we give them small, focused pieces of text.

So we split long documents into smaller parts called chunks.

export function chunkText(text, size = 900, overlap = 150) {
  const chunks = [];
  let start = 0;

  while (start < text.length) {
    const end = start + size;
    chunks.push(text.slice(start, end));
    start += size - overlap;
  }

  return chunks.map((chunk) => chunk.trim()).filter(Boolean);
}
Enter fullscreen mode Exit fullscreen mode

Here:

  • size = 900 means each chunk is around 900 characters
  • overlap = 150 means the next chunk repeats 150 characters from the previous one

The overlap is useful because important meaning can sit between two chunks.

Example:

Chunk 1: characters 0 to 900
Chunk 2: characters 750 to 1650
Chunk 3: characters 1500 to 2400
Enter fullscreen mode Exit fullscreen mode

Step 6: Create Embeddings

An embedding is a list of numbers that represents the meaning of text.

For example, this sentence:

The product supports SSO on Business and Enterprise plans.
Enter fullscreen mode Exit fullscreen mode

is converted into a vector:

[0.12, -0.04, 0.89, ...]
Enter fullscreen mode Exit fullscreen mode

The exact numbers are not important for us.

What matters is this:

Similar text gets similar embeddings.

Why Do We Need nomic-embed-text?

Mistral is good at generating answers, but we also need a way to search our documents by meaning.

That is what nomic-embed-text does.

It converts text into embeddings so our app can compare:

  • The user's question
  • The chunks from our documents

Without embeddings, our app would only do simple keyword matching.

For example, if the document says:

The product supports SSO on the Business and Enterprise plans.
Enter fullscreen mode Exit fullscreen mode

and the user asks:

Which subscription includes single sign-on?
Enter fullscreen mode Exit fullscreen mode

keyword search may miss the connection because the words are different.

But embeddings can understand that SSO and single sign-on are related.

So in this project:

  • nomic-embed-text is used for search
  • mistral is used for answering

That means a question like:

Which plans have SSO?
Enter fullscreen mode Exit fullscreen mode

should be close to the document sentence:

The product supports SSO on the Business and Enterprise plans.
Enter fullscreen mode Exit fullscreen mode

Here is the embedding function:

export async function embed(text) {
  const response = await fetch("http://localhost:11434/api/embeddings", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "nomic-embed-text",
      prompt: text
    })
  });

  const data = await response.json();
  return data.embedding;
}
Enter fullscreen mode Exit fullscreen mode

We use this same function for both:

  • Document chunks
  • User questions

That is how we compare a question with our documents.

Step 7: Build the Index

Now we create the index.

The index is a JSON file that stores:

  • The document name
  • The chunk text
  • The embedding for that chunk
export async function buildIndex() {
  const docs = await readDocuments();
  const records = [];

  for (const doc of docs) {
    const chunks = chunkText(doc.text);

    for (const [i, chunk] of chunks.entries()) {
      records.push({
        id: `${doc.source}#${i}`,
        source: doc.source,
        text: chunk,
        embedding: await embed(chunk)
      });
    }
  }

  await fs.writeFile("data/index.json", JSON.stringify(records, null, 2));
  return records.length;
}
Enter fullscreen mode Exit fullscreen mode

Run:

npm run index
Enter fullscreen mode Exit fullscreen mode

This creates:

data/index.json
Enter fullscreen mode Exit fullscreen mode

Now your documents are searchable by meaning, not just by exact words.

Step 8: Search the Best Chunks

When the user asks a question, we need to find the document chunks that are closest to that question.

To do that, we:

  1. Convert the question into an embedding
  2. Compare it with every saved document embedding
  3. Sort the results
  4. Keep the best matches

The comparison uses cosine similarity:

function cosineSimilarity(a, b) {
  let dot = 0;
  let normA = 0;
  let normB = 0;

  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }

  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
Enter fullscreen mode Exit fullscreen mode

Then we search the index:

export async function search(query, limit = 4) {
  const raw = await fs.readFile("data/index.json", "utf8");
  const index = JSON.parse(raw);

  const queryEmbedding = await embed(query);

  return index
    .map((item) => ({
      ...item,
      score: cosineSimilarity(queryEmbedding, item.embedding)
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, limit);
}
Enter fullscreen mode Exit fullscreen mode

The result is a small set of document chunks that are most likely to contain the answer.

Step 9: Give Context to Mistral

Now we have the useful document chunks.

Next, we send them to Mistral with the user's question.

The prompt looks like this:

const prompt = `
Answer using only the context below.
If the answer is missing, say you do not know.

Context:
${context}

Question:
${question}
`;
Enter fullscreen mode Exit fullscreen mode

This line is very important:

Answer using only the context below.
Enter fullscreen mode Exit fullscreen mode

It tells the model not to guess.

Then we call Ollama's generation API:

const resp = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "mistral",
    prompt,
    max_tokens: 512,
    temperature: 0.2
  })
});
Enter fullscreen mode Exit fullscreen mode

temperature: 0.2 makes the answer more focused and less random.

Step 10: Ask a Question

Now run:

npm run ask:ollama -- "What plans support SSO?"
Enter fullscreen mode Exit fullscreen mode

Example answer:

The product supports SSO on the Business and Enterprise plans.
Enter fullscreen mode Exit fullscreen mode

This answer came from the local document.

Mistral did not need to already know your product FAQ. The RAG pipeline found the right context and gave it to the model.

The Full RAG Flow

Here is the complete flow again:

1. Put documents in data/docs
2. Split documents into chunks
3. Convert chunks into embeddings
4. Save embeddings in data/index.json
5. User asks a question
6. Convert the question into an embedding
7. Find the most similar chunks
8. Add those chunks to the prompt
9. Ask Mistral to answer using that context
Enter fullscreen mode Exit fullscreen mode

That is RAG.

Search first. Generate second.

Why This Local Version Is Useful

This project is intentionally simple.

It uses a JSON file instead of a vector database. That makes it easier to understand what is happening.

For learning, this is perfect.

For production, you may later replace data/index.json with a vector database such as:

  • Chroma
  • Qdrant
  • Weaviate
  • pgvector

But the core idea stays the same:

Store embeddings -> Search similar chunks -> Send context to the LLM
Enter fullscreen mode Exit fullscreen mode

Conclusion

RAG is one of the most useful patterns for building practical AI apps.

It helps LLMs answer using your data without retraining the model.

In this article, we built a local RAG app with:

  • Ollama
  • Mistral
  • Node.js
  • nomic-embed-text
  • Local documents
  • A JSON-based vector index

The main idea is simple:

Search your documents first, then let the LLM answer with that context.

Complete implementation:

https://github.com/gaurav101/ai-experiment/tree/main/rag

Setup references:

Top comments (0)