Most people start using Large Language Models by asking questions directly:
Question -> LLM -> Answer
This works well for general questions.
But what happens when you ask about your own documents, company policy, product FAQ, or internal notes?
The model may not know the answer. Even worse, it may guess confidently. This is called a hallucination.
That is the problem RAG solves.
In this article, we will build a simple local RAG app using Ollama, Mistral, and Node.js.
Complete code is available here:
https://github.com/gaurav101/ai-experiment/tree/main/rag
What Is RAG?
RAG stands for Retrieval-Augmented Generation.
That sounds complex, but the idea is simple:
Before asking the LLM to answer, first search your own documents and give the most useful parts to the model.
So instead of this:
User question -> LLM -> Answer
we do this:
User question
-> Search local documents
-> Find relevant text
-> Send that text to the LLM
-> Generate an answer
Think of it like an open-book exam.
The LLM is still doing the writing, but now it has the right page open before it answers.
Why RAG Matters
RAG is important because most real AI apps need private or updated information.
For example:
- A chatbot that answers questions from company documents
- A support assistant that reads product FAQs
- A legal assistant that searches contracts
- A coding assistant that understands project docs
- A personal assistant that uses your own notes
Without RAG, the model only uses what it already learned during training.
With RAG, we can give the model fresh information at runtime.
This means:
- No need to retrain the model
- Documents can stay private
- Answers are based on your data
- The system is easier to update
If your refund policy changes, you update the document and rebuild the index. You do not retrain the LLM.
What We Will Build
We will build a local RAG app that can answer questions from files stored on your machine.
The app uses:
- Node.js for the code
- Ollama to run models locally
- Mistral to generate answers
- nomic-embed-text to create embeddings
- A local
data/docsfolder for documents - A local
data/index.jsonfile as a simple vector index
The project flow is:
Documents -> Chunks -> Embeddings -> Search -> Context -> Mistral -> Answer
Do not worry if words like "embeddings" or "vector index" feel new. We will walk through them step by step.
Step 1: Install Ollama
First, install Ollama:
On Linux or macOS, you can also install it with:
curl -fsSL https://ollama.com/install.sh | sh
Start Ollama:
ollama serve
Now pull the two models we need.
Mistral will generate the final answer:
ollama pull mistral
nomic-embed-text will convert text into embeddings:
ollama pull nomic-embed-text
You can test Mistral with:
ollama run mistral
Step 2: Create the Node.js Project
Create a new project:
mkdir local-rag
cd local-rag
npm init -y
Use ES modules by adding this to package.json:
{
"type": "module"
}
Install dotenv:
npm install dotenv
In this project, we use these scripts:
{
"scripts": {
"index": "node src/index-docs.js",
"ask:ollama": "node src/ask-ollama.js"
}
}
npm run index builds the searchable document index.
npm run ask:ollama asks a question using Ollama and Mistral.
Step 3: Add Local Documents
Create a folder for your documents:
mkdir -p data/docs
Add a file:
data/docs/company-faq.txt
Example content:
Refunds are allowed within 14 days of purchase.
Enterprise customers get priority email support.
The product supports SSO on the Business and Enterprise plans.
These documents are the knowledge base for our RAG app.
Later, when the user asks a question, the app will search these files first.
Step 4: Add Configuration
Create src/config.js.
This file keeps all important settings in one place:
export const DOCS_DIR = process.env.DOCS_DIR || "data/docs";
export const INDEX_FILE = process.env.INDEX_FILE || "data/index.json";
export const OLLAMA_BASE =
process.env.OLLAMA_BASE_URL || "http://localhost:11434";
export const OLLAMA_EMBED_ENDPOINT = "/api/embeddings";
export const OLLAMA_EMBED_MODEL = "nomic-embed-text";
export const OLLAMA_GEN_ENDPOINT = "/api/generate";
export const OLLAMA_GEN_MODEL = process.env.OLLAMA_MODEL || "mistral";
This tells the app:
- Where to read documents from
- Where to save the index
- Which model to use for embeddings
- Which model to use for answers
Step 5: Read and Split Documents
LLMs work better when we give them small, focused pieces of text.
So we split long documents into smaller parts called chunks.
export function chunkText(text, size = 900, overlap = 150) {
const chunks = [];
let start = 0;
while (start < text.length) {
const end = start + size;
chunks.push(text.slice(start, end));
start += size - overlap;
}
return chunks.map((chunk) => chunk.trim()).filter(Boolean);
}
Here:
-
size = 900means each chunk is around 900 characters -
overlap = 150means the next chunk repeats 150 characters from the previous one
The overlap is useful because important meaning can sit between two chunks.
Example:
Chunk 1: characters 0 to 900
Chunk 2: characters 750 to 1650
Chunk 3: characters 1500 to 2400
Step 6: Create Embeddings
An embedding is a list of numbers that represents the meaning of text.
For example, this sentence:
The product supports SSO on Business and Enterprise plans.
is converted into a vector:
[0.12, -0.04, 0.89, ...]
The exact numbers are not important for us.
What matters is this:
Similar text gets similar embeddings.
Why Do We Need nomic-embed-text?
Mistral is good at generating answers, but we also need a way to search our documents by meaning.
That is what nomic-embed-text does.
It converts text into embeddings so our app can compare:
- The user's question
- The chunks from our documents
Without embeddings, our app would only do simple keyword matching.
For example, if the document says:
The product supports SSO on the Business and Enterprise plans.
and the user asks:
Which subscription includes single sign-on?
keyword search may miss the connection because the words are different.
But embeddings can understand that SSO and single sign-on are related.
So in this project:
-
nomic-embed-textis used for search -
mistralis used for answering
That means a question like:
Which plans have SSO?
should be close to the document sentence:
The product supports SSO on the Business and Enterprise plans.
Here is the embedding function:
export async function embed(text) {
const response = await fetch("http://localhost:11434/api/embeddings", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "nomic-embed-text",
prompt: text
})
});
const data = await response.json();
return data.embedding;
}
We use this same function for both:
- Document chunks
- User questions
That is how we compare a question with our documents.
Step 7: Build the Index
Now we create the index.
The index is a JSON file that stores:
- The document name
- The chunk text
- The embedding for that chunk
export async function buildIndex() {
const docs = await readDocuments();
const records = [];
for (const doc of docs) {
const chunks = chunkText(doc.text);
for (const [i, chunk] of chunks.entries()) {
records.push({
id: `${doc.source}#${i}`,
source: doc.source,
text: chunk,
embedding: await embed(chunk)
});
}
}
await fs.writeFile("data/index.json", JSON.stringify(records, null, 2));
return records.length;
}
Run:
npm run index
This creates:
data/index.json
Now your documents are searchable by meaning, not just by exact words.
Step 8: Search the Best Chunks
When the user asks a question, we need to find the document chunks that are closest to that question.
To do that, we:
- Convert the question into an embedding
- Compare it with every saved document embedding
- Sort the results
- Keep the best matches
The comparison uses cosine similarity:
function cosineSimilarity(a, b) {
let dot = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
Then we search the index:
export async function search(query, limit = 4) {
const raw = await fs.readFile("data/index.json", "utf8");
const index = JSON.parse(raw);
const queryEmbedding = await embed(query);
return index
.map((item) => ({
...item,
score: cosineSimilarity(queryEmbedding, item.embedding)
}))
.sort((a, b) => b.score - a.score)
.slice(0, limit);
}
The result is a small set of document chunks that are most likely to contain the answer.
Step 9: Give Context to Mistral
Now we have the useful document chunks.
Next, we send them to Mistral with the user's question.
The prompt looks like this:
const prompt = `
Answer using only the context below.
If the answer is missing, say you do not know.
Context:
${context}
Question:
${question}
`;
This line is very important:
Answer using only the context below.
It tells the model not to guess.
Then we call Ollama's generation API:
const resp = await fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "mistral",
prompt,
max_tokens: 512,
temperature: 0.2
})
});
temperature: 0.2 makes the answer more focused and less random.
Step 10: Ask a Question
Now run:
npm run ask:ollama -- "What plans support SSO?"
Example answer:
The product supports SSO on the Business and Enterprise plans.
This answer came from the local document.
Mistral did not need to already know your product FAQ. The RAG pipeline found the right context and gave it to the model.
The Full RAG Flow
Here is the complete flow again:
1. Put documents in data/docs
2. Split documents into chunks
3. Convert chunks into embeddings
4. Save embeddings in data/index.json
5. User asks a question
6. Convert the question into an embedding
7. Find the most similar chunks
8. Add those chunks to the prompt
9. Ask Mistral to answer using that context
That is RAG.
Search first. Generate second.
Why This Local Version Is Useful
This project is intentionally simple.
It uses a JSON file instead of a vector database. That makes it easier to understand what is happening.
For learning, this is perfect.
For production, you may later replace data/index.json with a vector database such as:
- Chroma
- Qdrant
- Weaviate
- pgvector
But the core idea stays the same:
Store embeddings -> Search similar chunks -> Send context to the LLM
Conclusion
RAG is one of the most useful patterns for building practical AI apps.
It helps LLMs answer using your data without retraining the model.
In this article, we built a local RAG app with:
- Ollama
- Mistral
- Node.js
- nomic-embed-text
- Local documents
- A JSON-based vector index
The main idea is simple:
Search your documents first, then let the LLM answer with that context.
Complete implementation:
https://github.com/gaurav101/ai-experiment/tree/main/rag
Setup references:
Top comments (0)