DEV Community

Cover image for Specialized chatbot using RAG
NEBULA DATA
NEBULA DATA

Posted on

Specialized chatbot using RAG

Specialized Chatbot using RAG (Retrieval-Augmented Generation) — Part II

In the previous episode, we already discussed the concept of Retrieval-Augmented Generation (RAG) and prepared our project structure, requirements, and source data.

Now we are moving to the most important step of the RAG pipeline, which is ingesting our knowledge source into the vector database.

This process includes several stages:

  1. Reading the PDF
  2. Splitting the document into chunks
  3. Creating embeddings
  4. Storing them inside ChromaDB

Once this process is completed, our chatbot will have a searchable knowledge base.

Later, when a user asks a question, our system will retrieve the most relevant parts of the document and use them as context for the model.

Understanding the Ingestion Pipeline

Before jumping into the code, let's understand the overall workflow.

Our program will perform the following steps:

  1. Load the PDF document
  2. Extract text from the PDF
  3. Split the text into smaller chunks
  4. Convert each chunk into embeddings
  5. Store the embeddings and text into ChromaDB

The result will be a vector database that represents the entire BCA Annual Report in semantic form.

Importing Required Libraries

First, we import all the libraries used in our program.

import os
import chromadb
from pypdf import PdfReader
from openai import OpenAI
from dotenv import load_dotenv
Enter fullscreen mode Exit fullscreen mode


`

Explanation of Each Library

os
Used to interact with our system environment.

chromadb
Our vector database that will store embeddings.

pypdf
Used to read and extract text from PDF documents.

OpenAI client (Nebula API)
Used to generate embeddings.

dotenv
Used to securely load our API key from the .env file.

Loading Environment Variables

Next, we load the environment variables.

python
load_dotenv()

This allows our program to read the NEBULA_API_KEY stored inside the .env file.

Example .env file:


NEBULA_API_KEY=your_api_key_here

This method is important because API keys should never be hardcoded directly inside the program.


Initializing Nebula API Client

Now we initialize the API client that will communicate with Nebula.

python
client = OpenAI(
api_key=os.getenv("NEBULA_API_KEY"),
base_url="https://llm.ai-nebula.com/v1"
)

Here we use an OpenAI-compatible client, but the request is actually sent to the Nebula API endpoint.

This allows us to use Nebula infrastructure while keeping the OpenAI SDK interface.

Initializing ChromaDB

Now we prepare our vector database.

`python
db_path = "./chroma_db"

chroma_client = chromadb.PersistentClient(path=db_path)

collection = chroma_client.get_or_create_collection(
name="bca_annual_report_2025"
)
`

Explanation

PersistentClient
Creates a persistent database stored locally.

db_path
The folder where our vector database will be stored.

Collection
Similar to a table in a traditional database.

In this case we create a collection called:


bca_annual_report_2025

This collection will contain:

  • document chunks
  • embeddings
  • document IDs

Step 1 — Reading the PDF

Now we create a function to read the entire PDF document.

`python
def read_pdf(path):
reader = PdfReader(path)
text = ""

for page in reader.pages:
    text += page.extract_text() + "\n"

return text
Enter fullscreen mode Exit fullscreen mode

`

What This Function Does

  1. Opens the PDF file
  2. Iterates through every page
  3. Extracts the text
  4. Combines all text into a single string

Since the BCA Annual Report contains around 600 pages, this step may take a few seconds depending on your machine.

Step 2 — Splitting the Document into Chunks

Next we split the document into smaller pieces.

`python
def chunk_text(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []

for i in range(0, len(words), chunk_size - overlap):
    chunk = " ".join(words[i:i+chunk_size])
    chunks.append(chunk)

return chunks
Enter fullscreen mode Exit fullscreen mode

`

Why Do We Need Chunking?

Because:

  • Language models have context limits
  • Sending the entire document into the prompt would be extremely inefficient

Instead, we break the document into smaller segments.

In this code:


chunk_size = 500 words
overlap = 50 words

The overlap helps preserve context continuity between chunks, which improves retrieval quality.

Example Structure


Chunk 1 : word 1 → word 500
Chunk 2 : word 450 → word 950
Chunk 3 : word 900 → word 1400

This technique helps prevent information loss between chunks.

Step 3 — Creating Embeddings

Now we convert each chunk into embeddings.

`python
def embed(texts):
response = client.embeddings.create(
model="text-embedding-3-large",
input=texts
)

return [e.embedding for e in response.data]
Enter fullscreen mode Exit fullscreen mode

`

Embeddings are numerical vector representations of text.

Instead of storing raw text only, we convert each chunk into vectors so the database can perform semantic similarity search.

Example Concept


"bank revenue growth" → [0.021, -0.771, 0.144, ...]

Texts with similar meaning will have vectors close to each other in vector space.

This is what allows RAG systems to find relevant knowledge quickly.

Step 4 — Running the Ingestion Process

Now we run the entire ingestion pipeline.

`python
pdf_path = "source/20260212-BCA-AR-2025-ID.pdf"

if os.path.exists(pdf_path):

print("⏳ Reading PDF...")
pdf_text = read_pdf(pdf_path)

print("⏳ Creating chunks...")
chunks = chunk_text(pdf_text)

print(f"⏳ Creating embeddings for {len(chunks)} chunks...")
embeddings = embed(chunks)

collection.add(
    documents=chunks,
    embeddings=embeddings,
    ids=[str(i) for i in range(len(chunks))]
)

print(f"✅ Success! Database saved to: {db_path}")
Enter fullscreen mode Exit fullscreen mode

else:
print(f"❌ File not found: {pdf_path}")
`

Step-by-Step Explanation

Step 1 — Check if the PDF exists

python
os.path.exists(pdf_path)

This prevents errors if the file is missing.


Step 2 — Extract the text

python
pdf_text = read_pdf(pdf_path)

The entire PDF is converted into raw text.

Step 3 — Create chunks

python
chunks = chunk_text(pdf_text)

The document is split into smaller pieces.

For a 600-page report, this may generate hundreds or even thousands of chunks.

Step 4 — Generate embeddings

python
embeddings = embed(chunks)

Each chunk is converted into a vector representation using the embedding model.


Step 5 — Store everything in ChromaDB

python
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[str(i) for i in range(len(chunks))]
)

We store three components:

  • documents → the text chunks
  • embeddings → vector representations
  • ids → unique identifiers

Now our vector database is ready.

After the Ingestion


Once the ingestion process is finished, our database will contain semantic vectors for every chunk of the BCA Annual Report.


This means our system can now perform:

Semantic Search instead of traditional Keyword Search.

What Happens Next?

Now that our knowledge base has been stored inside ChromaDB, the next step is building the retrieval pipeline.

In the next part we will implement:

  1. Convert user question into embedding
  2. Search the vector database
  3. Retrieve the most relevant chunks
  4. Send them as context to Nebula API
  5. Generate a grounded response

This is where the actual RAG magic happens.

Nebula Lab

For those who want to build chatbots or other AI applications, you can check Nebula Lab here:

https://nebula-data.ai/

They offer more than just API access for multiple models, including various tools and features for AI development.

See you in the next part.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.