DEV Community

Oswin Heman-Ackah
Oswin Heman-Ackah

Posted on

Building a chatbot with Python (Backend)

Documentation: Backend (backend.py)
This script processes PDF documents into vector embeddings and builds a FAISS index for semantic search.
It is the offline preprocessing pipeline for the Indaba RAG chatbot.

Key Responsibilities

  1. Load PDFs from a folder.
  2. Extract raw text using PyPDF2.
  3. Chunk large documents into smaller overlapping text segments.
  4. Convert chunks into embeddings using SentenceTransformers.
  5. Build and persist a FAISS index for similarity search.
  6. Save the raw chunks for later retrieval.

Step-by-Step Breakdown

Imports and Setup

import os
import pickle
import numpy as np
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import faiss

Enter fullscreen mode Exit fullscreen mode

os → file system operations.
pickle→ save preprocessed chunks.
numpy → numerical array handling.
PyPDF2→ extract text from PDF files.
SentenceTransformer → embedding model (all-MiniLM-L6-v2).
faiss → efficient similarity search.

Constants:

embedder = SentenceTransformer("all-MiniLM-L6-v2")
INDEX_FILE = "faiss_index.bin"
CHUNKS_FILE = "chunks.pkl"
Enter fullscreen mode Exit fullscreen mode
  • embedder is the model instance; loading this, downloads model weights (first run may take time).
  • INDEX_FILE and CHUNKS_FILE defines where to save FAISS index and chunks.

Function to Load PDF

def load_pdf(file_path):
    pdf = PdfReader(file_path)
    text = ""
    for page in pdf.pages:
        text += page.extract_text() + "\n"
    return text
Enter fullscreen mode Exit fullscreen mode
  • Reads a PDF file with PyPDF2.
  • Extracts text page by page.
  • Returns the full document text as a string.

Function for Text Chunking

def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks
Enter fullscreen mode Exit fullscreen mode
  • Splits the text into chunks of chunk_size characters, shifting by chunk_size - overlap each time (so consecutive chunks overlap by overlap characters).

  • Representation:
    Chunk 1 = 0–500
    Chunk 2 = 400–900 (100 overlap)

Full Pipeline Info
Walkthrough of the function:

1.Collect chunks for all PDFs:

pdf_folder = "vault"   
 #This is the folder/path pdfs are stored in.

all_chunks = []
for filename in os.listdir(pdf_folder):
    if filename.endswith(".pdf"):
        text = load_pdf(os.path.join(pdf_folder, filename))
        chunks = chunk_text(text)
        all_chunks.extend(chunks)

Enter fullscreen mode Exit fullscreen mode
  • Extracts text and chunks for each PDF and keeps all chunks in all_chunks list as strings.

Note: Order matters (index ids align with order).

2.Embed chunks:

vectors = embedder.encode(all_chunks)
   vectors = np.array(vectors)
Enter fullscreen mode Exit fullscreen mode
  • embedder.encode(list_of_texts) returns a list/array of vectors. By default, it returns float32 or float64 depending on version — FAISS expects float32. In practice it's safer to force dtype float32.

  • Important: embedding all chunks at once can OOM if you have many chunks. Use batching:

 vectors = embedder.encode(all_chunks, batch_size=32, show_progress_bar=True)
 vectors = np.array(vectors).astype('float32')
Enter fullscreen mode Exit fullscreen mode

3.Create FAISS index:

dim = vectors.shape[1]
   index = faiss.IndexFlatL2(dim)
   index.add(vectors)
Enter fullscreen mode Exit fullscreen mode

(Basic)

  • Creates a FAISS index and adds all chunk vectors into the index.

(Technical)
IndexFlatL2 = exact (brute-force) nearest neighbor search using L2 distance. Works for small-to-medium collections.

  • Pros: simple and exact.
  • Cons: slow on large collections.

The index.add(vectors) adds vectors in the same order as all_chunks. FAISS internal ids = 0..N-1 in that order — that’s how you map back to chunks.

4.Save index and chunks:

  faiss.write_index(index, INDEX_FILE)
   with open(CHUNKS_FILE, "wb") as f:
       pickle.dump(all_chunks, f)
Enter fullscreen mode Exit fullscreen mode

Saves FAISS index to faiss_index.bin.
Saves chunks (raw text) to chunks.pkl.

These files are later loaded by the Streamlit frontend on runtime.

How to run this script

make sure your PDFs are in docs/

python -m backend

output: faiss_index.bin and chunks.pkl in the current directory

Top comments (0)