dhiran sapkota

Posted on Jun 23

A Complete Guide to Retrieval-Augmented Generation

#rag #ai #vectordatabase #langchain

In today’s tech industry, there’s a growing trend: every system is expected to be AI-powered to stay competitive. This shift has created a rising demand for engineers who not only understand artificial intelligence but also know how to build systems around it.

In this blog, I’ll walk you through how to build a Retrieval-Augmented Generation (RAG) system using various language models. This is based on a project I created and the knowledge I’ve gained while learning about RAG. I’m still learning, so consider this more of a shared journey than a tutorial from an expert.

RAG
So, let’s try to understand what RAG is.

Most language models (like GPT-4, DeepSeek, etc.) are trained on a massive amount of data. Based on that training, they can generate responses to a wide variety of questions. However, that knowledge is static — they can’t access new or private information unless we give it to them at the time of the request.

This is where Retrieval-Augmented Generation (RAG) comes in. RAG improves language models by retrieving relevant documents from an external knowledge base (like your own files or company data), and then using the language model to generate an answer based on that retrieved content.

To put it in simple terms:

In traditional systems, you design a flow — like “click here, go there” — to access specific information. With RAG, you feed all your private or custom information into a searchable system, and the language model fetches and presents it to users in natural, human-like responses.

It’s like giving the language model a smart memory extension — one that you fully control.

Now that we’ve covered the basics, let’s explore how to build a personalized chatbot—one that can give information specifically about you, something ChatGPT or other general models can’t do, since they don’t recognize or store your personal context.

To access large language models (LLMs), we first need their API keys. So you can search for you favorite language model api key. For this i am using google's gemini since it's free version would be enough for our learning purposes.

Secondly, we need to upload our knowledge to a vector database.

What is a vector database?
It is like a regular database but instead of storing it as plain text, it stores in large sequence of numbers. Each number represents a dimension. It is used to store the semantics of any data that means what that data wants to say instead of raw data.

So why do we need our information stored in vector database?
Since the amount of information we want to share with an LLM can be quite large, passing all of it with every query is inefficient. It increases token usage (and thus cost), and makes it harder for the model to focus on the most relevant parts of the data for that specific query. So, while giving our data to LLM we want it to as less and specific as possible.

Cant we just pass that specific data using plain database?
If you store your knowledge base as plain text (e.g., in a SQL table or a file), and want to retrieve relevant passages, your only option is:

Keyword search (like SQL LIKE, or basic string matching),
Or using a full-text search engine (e.g., Elasticsearch, Meilisearch).

These approaches:

Depend heavily on exact words,
Can miss semantically similar but differently worded queries,
Are not great at understanding context or meaning.

For example, if a user asks: “How to treat a cold?”—a keyword-based system might miss a passage that says: “Remedies for common flu symptoms,” because the words don’t match exactly. However, a vector-based system can recognize that both queries mean the same thing.

So what we do is we upload our information in a vector database using LLM providers vector embedding models.

Why vector database so good at understanding context?
The vector database itself isn’t what “understands” context. It’s the embedding model that creates context-aware vectors. The vector database is simply optimized to search those vectors efficiently.

Let's understand by an example:
A vector is just a list of numbers like:
[0.12, -0.88, 0.44, 0.31, ..., 0.05] ← This is a 5D vector if it has 5 numbers

Suppose we have a text: "Apple is a fruit."
It equivalent vector embedding could be: [0.1, 0.9, 0.2]
So it is a 3 dimensional vector where each dimension could mean something for example in above example:
Dimension Hypothetical Meaning
D1 Is it about technology or food? (tech → +1, food → -1)
D2 Is it about a company or a general concept? (company → +1)
D3 Is it about a product or description? (product → +1)

It is very simple one but in reality dimensions of these language models are large like 512D or 1536D and each dimension meaning could be very complex, certainly not human readable.

The specific structure of the embedding—and what each dimension represents—is designed by the engineers building the language model. That’s why each model (e.g., OpenAI, Google, Cohere) has its own unique embedding system and vector dimensions. So, if you're using OpenAI’s models, you need to embed your data using their embedding API.

So let’s revise the whole flow one last time:

Upload Knowledge: First, we upload our knowledge base to a vector database using a vector embedding model.
User Query: When an end-user submits a prompt, our system queries the vector database to retrieve the most relevant pieces of information related to that prompt.
Retrieve Context: We don’t send the vector embeddings themselves to the language model. Instead, we retrieve the original text chunks (whose embeddings matched the query) and use them as context.
Generate Response: We send the user's query along with the retrieved context to the LLM and stream the generated response back to the client.

Note: The system queries the vector database using the user's prompt (converted into an embedding), finds the most similar embeddings in the database, and uses the associated text chunks—not the embeddings themselves—as input to the LLM.

Let's see how to implement it practically.

Firstly, we need a vector storage. There are many specialized service providers for storing vectors. Since, postgres database itself supports for vector type data, I am going to use Supabase(which itself is postgres base database service provider).

So create a account in supabase and in sql editor call these one by one:

-- Enable the pgvector extension to work with embedding vectors
create extension vector;

-- Create a table to store your documents
create table documents (
  id bigserial primary key,
  content text, -- corresponds to Document.pageContent
  metadata jsonb, -- corresponds to Document.metadata
  embedding vector(768) -- 768 works for Gemini embeddings, change if needed
);

-- Create a function to search for documents
create function match_documents (
  query_embedding vector(1536),
  match_count int DEFAULT null,
  filter jsonb DEFAULT '{}'
) returns table (
  id bigint,
  content text,
  metadata jsonb,
  embedding jsonb,
  similarity float
)
language plpgsql
as $$
#variable_conflict use_column
begin
  return query
  select
    id,
    content,
    metadata,
    (embedding::text)::jsonb as embedding,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where metadata @> filter
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;

This SQL snippet sets up a semantic search system using PostgreSQL and the pgvector extension. It enables storing and querying text documents based on their vector embeddings.

pgvector extension is enabled to store and compare high-dimensional vectors.
A documents table is created to store:

content: the actual text,
metadata: optional structured data (e.g., tags, source),
embedding: a 768-dimensional vector representation of the text.
1. A search function match_documents is defined to:
Take a query embedding and an optional metadata filter,
Return the most similar documents based on cosine distance,
Include a similarity score and metadata in the result.

This setup enables efficient semantic search with filtering directly inside PostgreSQL.

Secondly, prepare knowledge to upload.
So, ideally while being uploading a chunk it would be best if information does not get separated. Since I would be setting each chunk size to be 500 tokens, That means information should be separated by around 500 words each. So keeping these things in mind created a .txt file.

Lets see the knowledge upload code:

const path = require("path")
const fs = require("fs")
const { RecursiveCharacterTextSplitter } = require("@langchain/textsplitters");
const { createClient } = require("@supabase/supabase-js")


const { GoogleGenerativeAIEmbeddings } = require("@langchain/google-genai")
const { TaskType } = require("@google/generative-ai")

const filePath = path.join(__dirname, "knowledge.txt");
const data = fs.readFileSync(filePath, { encoding: "utf-8" });


const embedSentence = async () => {
    try {
        // Initialize Supabase client
        const supabase = createClient(
            "project_url",
            "project_secret"
        );

        // Initialize Google Generative AI Embeddings
        const embeddings = new GoogleGenerativeAIEmbeddings({
            modelName: "text-embedding-004",
            taskType: TaskType.RETRIEVAL_QUERY,
        });

        // Split the data into chunks
        const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500 });
        const chunks = await splitter.createDocuments([data]);

        // Generate embeddings and prepare data for insertion
        const rows = await Promise.all(
            chunks.map(async (chunk) => {
                const vector = await embeddings.embedQuery(chunk.pageContent);
                return {
                    content: chunk.pageContent,
                    embedding: vector,
                };
            })
        );

        console.log(rows)
        // Insert the data into Supabase
        const { error } = await supabase.from('documents').insert(rows);
        if (error) throw error;

        console.log('Data uploaded successfully');
    } catch (error) {
        console.error('Error uploading data:', error);
    }
}

embedSentence()

So briefly explain about the code:

We are using Langchain.js, which is framework for building Powerful LLM Applications. It gives you reusable components and clean abstractions to manage. But each LLM provides their own implementation techniques. We could use that as well. There is no significant difference.
I am reading the text that i wrote in knowledge.txt file where all my information resides.
I am splitting those information with each chunk size of 500 tokens using textsplitter provided by langchain
Then I looped through every chunk, converted raw english text to vector using Gemini's embedding model and upload both (vector and chunk text) to supabase using supabase client.
At last run the code using node yourfilename.ts

Note: You need to provide gemini secret to your shell before uploading.
export GOOGLE_API_KEY="your-gemini-api-key-here"

After information has been uploaded, you can check your supabase dashboard where all the uploaded chunks will be stored.
Now let's use these information when user queries to our chatbot.

Code:

import { GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI } from "@langchain/google-genai"
import { createClient } from "@supabase/supabase-js"
import { TaskType } from "@google/generative-ai"
import { ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate } from "@langchain/core/prompts";


const embeddings = new GoogleGenerativeAIEmbeddings({
    modelName: "text-embedding-004",
    taskType: TaskType.RETRIEVAL_QUERY,
});

const llm = new ChatGoogleGenerativeAI({
    model: "gemini-1.5-pro",
    temperature: 0.5,
    maxRetries: 2,
    streaming: true
})

const supabaseClient = createClient(
    "supabase-url",
    "secret-key"
);

export async function POST(request: Request) {
    const requestBody = await request.json()
    const query = requestBody?.query
    const inputembedding = await embeddings.embedQuery(query)

    const { data, error } = await supabaseClient.rpc("match_documents", {
        match_count: 4,
        query_embedding: inputembedding,
    })

    const systemPrompt = `You are very nice and helpful chatbot. Give response being polite.`;

    const systemMessagePrompt = SystemMessagePromptTemplate.fromTemplate(systemPrompt);
    const humanMessagePrompt = HumanMessagePromptTemplate.fromTemplate("{context}\n\nQuestion: {question}");

    const chatPrompt = ChatPromptTemplate.fromMessages([
        systemMessagePrompt,
        humanMessagePrompt
    ]);

    const pipeline = chatPrompt.pipe(llm)

    const context = data?.map((item: any) => {
        return item?.content
    }).join('\n\n')

    const encoder = new TextEncoder();
    const stream = new ReadableStream({
        async start(controller) {
            const streamIterator = await pipeline.stream({ context, question: query })

            for await (const chunk of streamIterator) {
                controller.enqueue(encoder.encode(chunk?.content as string));
            }

            controller.close();
        }
    })

    return new Response(stream, {
        headers: {
            "Content-Type": "text/event-stream",
            "Cache-Control": "no-cache",
            Connection: "keep-alive",
            "Access-Control-Allow-Origin": "*",
        },
    });
}

Explaination:

Note: Here I'm using nextjs server, but you can choose any nodejs environment, all the packages and code will be almost same.

Here function is taking user query and firstly embedding this query using same embedding model that is used to embed our knowledge.
Then it is call "match_documents" postgres function which we created earlier while setting up supabase, along with user query embedded and how many matching documents to return, for now 4 should be enough.
Then system prompt and message prompt is being created. In system prompt you can customize the LLM response types. This is where you can implement all of your prompt engineering knowledge.
Then context is created as plain text which will be provided along with system prompt and human query.
At last, LLM streams responses to our backend and we can also stream that response to our frontend.

This is it.

DEV Community

A Complete Guide to Retrieval-Augmented Generation

Top comments (0)