NEBULA DATA

Posted on Mar 9

Specialized chatbot using RAG

#ai #webdev #programming #python

Specialized Chatbot using RAG (Retrieval-Augmented Generation) — Part II

In the previous episode, we already discussed the concept of Retrieval-Augmented Generation (RAG) and prepared our project structure, requirements, and source data.

Now we are moving to the most important step of the RAG pipeline, which is ingesting our knowledge source into the vector database.

This process includes several stages:

Reading the PDF
Splitting the document into chunks
Creating embeddings
Storing them inside ChromaDB

Once this process is completed, our chatbot will have a searchable knowledge base.

Later, when a user asks a question, our system will retrieve the most relevant parts of the document and use them as context for the model.

Understanding the Ingestion Pipeline

Before jumping into the code, let's understand the overall workflow.

Our program will perform the following steps:

Load the PDF document
Extract text from the PDF
Split the text into smaller chunks
Convert each chunk into embeddings
Store the embeddings and text into ChromaDB

The result will be a vector database that represents the entire BCA Annual Report in semantic form.

Importing Required Libraries

First, we import all the libraries used in our program.

import os
import chromadb
from pypdf import PdfReader
from openai import OpenAI
from dotenv import load_dotenv

Explanation of Each Library

os
Used to interact with our system environment.

chromadb
Our vector database that will store embeddings.

pypdf
Used to read and extract text from PDF documents.

OpenAI client (Nebula API)
Used to generate embeddings.

dotenv
Used to securely load our API key from the .env file.

Loading Environment Variables

Next, we load the environment variables.

python load_dotenv()

This allows our program to read the NEBULA_API_KEY stored inside the .env file.

Example .env file:

NEBULA_API_KEY=your_api_key_here

This method is important because API keys should never be hardcoded directly inside the program.

Initializing Nebula API Client

Now we initialize the API client that will communicate with Nebula.

python client = OpenAI( api_key=os.getenv("NEBULA_API_KEY"), base_url="https://llm.ai-nebula.com/v1" )

Here we use an OpenAI-compatible client, but the request is actually sent to the Nebula API endpoint.

This allows us to use Nebula infrastructure while keeping the OpenAI SDK interface.

Initializing ChromaDB

Now we prepare our vector database.

`python
db_path = "./chroma_db"

chroma_client = chromadb.PersistentClient(path=db_path)

collection = chroma_client.get_or_create_collection(
name="bca_annual_report_2025"
)
`

Explanation

PersistentClient
Creates a persistent database stored locally.

db_path
The folder where our vector database will be stored.

Collection
Similar to a table in a traditional database.

In this case we create a collection called:

bca_annual_report_2025

This collection will contain:

document chunks
embeddings
document IDs

Step 1 — Reading the PDF

Now we create a function to read the entire PDF document.

`python
def read_pdf(path):
reader = PdfReader(path)
text = ""

for page in reader.pages:
    text += page.extract_text() + "\n"

return text

What This Function Does

Opens the PDF file
Iterates through every page
Extracts the text
Combines all text into a single string

Since the BCA Annual Report contains around 600 pages, this step may take a few seconds depending on your machine.

Step 2 — Splitting the Document into Chunks

Next we split the document into smaller pieces.

`python
def chunk_text(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []

for i in range(0, len(words), chunk_size - overlap):
    chunk = " ".join(words[i:i+chunk_size])
    chunks.append(chunk)

return chunks

Why Do We Need Chunking?

Because:

Language models have context limits
Sending the entire document into the prompt would be extremely inefficient

Instead, we break the document into smaller segments.

In this code:

chunk_size = 500 words overlap = 50 words

The overlap helps preserve context continuity between chunks, which improves retrieval quality.

Example Structure

Chunk 1 : word 1 → word 500 Chunk 2 : word 450 → word 950 Chunk 3 : word 900 → word 1400

This technique helps prevent information loss between chunks.

Step 3 — Creating Embeddings

Now we convert each chunk into embeddings.

`python
def embed(texts):
response = client.embeddings.create(
model="text-embedding-3-large",
input=texts
)

return [e.embedding for e in response.data]

Embeddings are numerical vector representations of text.

Instead of storing raw text only, we convert each chunk into vectors so the database can perform semantic similarity search.

Example Concept

"bank revenue growth" → [0.021, -0.771, 0.144, ...]

Texts with similar meaning will have vectors close to each other in vector space.

This is what allows RAG systems to find relevant knowledge quickly.

Step 4 — Running the Ingestion Process

Now we run the entire ingestion pipeline.

`python
pdf_path = "source/20260212-BCA-AR-2025-ID.pdf"

if os.path.exists(pdf_path):

print("⏳ Reading PDF...")
pdf_text = read_pdf(pdf_path)

print("⏳ Creating chunks...")
chunks = chunk_text(pdf_text)

print(f"⏳ Creating embeddings for {len(chunks)} chunks...")
embeddings = embed(chunks)

collection.add(
    documents=chunks,
    embeddings=embeddings,
    ids=[str(i) for i in range(len(chunks))]
)

print(f"✅ Success! Database saved to: {db_path}")

else:
print(f"❌ File not found: {pdf_path}")
`