Specialized chatbot using rag (retrieval augmented generation) Part I

#webdev #programming #beginners #ai

In the previous episode, we successfully built an interactive chatbot that can respond to user questions and secure the API key itself. Now, we want to enhance it by implementing Retrieval-Augmented Generation (RAG).
RAG allows us to specialize our chatbot by enabling it to use our own documents as a knowledge source. These documents can belong to specific domains such as finance, law, science, mathematics, or any other specialized field. Instead of relying only on the model’s pre-trained knowledge, we provide it with domain-specific information so it can generate more accurate and relevant responses.
The first step in implementing RAG is preparing the source data. The source consists of the documents we want the chatbot to learn from, such as PDFs, text files, reports, or databases.
However, a language model cannot efficiently process large raw documents directly every time a user asks a question. Large documents may exceed the model’s context limit and would be inefficient to pass into the prompt repeatedly. Therefore, we need to preprocess these documents.
The preprocessing steps typically include:

Splitting the documents into smaller chunks Each document is divided into manageable pieces to fit within the model’s context window.
Generating embeddings for each chunk Using an embedding model (for example, from OpenAI), each text chunk is converted into a numerical vector representation. These vectors capture the semantic meaning of the text.
Storing embeddings in a vector database The generated embeddings are stored in a vector database such as Pinecone, Weaviate, Chroma, or FAISS. This database allows us to perform similarity searches efficiently. When a user submits a question: • The question is converted into an embedding. • The system searches the vector database to find the most relevant document chunks. • These retrieved chunks are then provided to the language model as additional context. • Finally, the model generates an answer grounded in the retrieved information. By applying RAG, our chatbot becomes more accurate, domain-specific, and capable of providing responses based on our own knowledge base rather than relying solely on general training data.

Now, for the source, I’m using Annual Report from BCA (Bank Central Asia):
This file consists 600 pages and mostly full of text. Now, we’re moving to the next step.
And make sure the structure of our app should be look like this:

Nebula -> Source (Annual Report from BCA), .env , nebula.py (our main program)
So, our program workflow should be like these:

Load PDF
Split into chunks
Create embeddings
Store in ChromaDB
Search relevant chunks
Send context + question to Nebula API

Preparing Requirements:
Before we can run our program, we need to install several library in order to support it:

chromadb
pip install langchain (Optional but better for chunking)
pypdf
sentence-transformers
tiktoken
python-dotenv (We have installed it before)
openai (We have installed it before) Do we need to set up Virtual Environment for this project? Well, the answer is depend of the scale of our app, if we want to create this chatbot to take way more source, way more capability etc, scared of different environment changes because of inconsistent library version of each update, the answer is YES. But for these type of app that I build to show you guys how to build simple RAG chatbot using Nebula API, might be not necessary.

Okay, now we need to install all the requirement of library inside our environment / local device, the command is very simple ‘pip install our package name’, so if I want to install chroma db, I just need to type ‘pip install chromadb’ on my CLI and hit Enter.

If you guys see, on my cli all the output says “Requirement already satisfied”, it happened because Im already installing all of those package on the past so the program will automatically tell me and skip the process.

But there’s several way to do installation,

One by one (like pip install chromadb)
By Listing (pip install chromadb langchain etc)
Whole list (put all the library list on the requirements.txt and on the CLI, just type pip install -r requirements.txt and hit enter) So anything you want to use, its good and its serve different purposes.

OKAY, on the next episode, we will try to ingest the pdf into our database (Chroma DB), it will be a bit difficult at first but I know you guys who read this article will master it easily.
And for you guys who wants to build chatbot or even other things with ai, you can check NEBULA LAB here: