Documentation: Backend (backend.py)
This script processes PDF documents into vector embeddings and builds a FAISS index for semantic search.
It is the offline preprocessing pipeline for the Indaba RAG chatbot.
Key Responsibilities
- Load PDFs from a folder.
- Extract raw text using PyPDF2.
- Chunk large documents into smaller overlapping text segments.
- Convert chunks into embeddings using SentenceTransformers.
- Build and persist a FAISS index for similarity search.
- Save the raw chunks for later retrieval.
Step-by-Step Breakdown
Imports and Setup
import os
import pickle
import numpy as np
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import faiss
os → file system operations.
pickle→ save preprocessed chunks.
numpy → numerical array handling.
PyPDF2→ extract text from PDF files.
SentenceTransformer → embedding model (all-MiniLM-L6-v2).
faiss → efficient similarity search.
Constants:
embedder = SentenceTransformer("all-MiniLM-L6-v2")
INDEX_FILE = "faiss_index.bin"
CHUNKS_FILE = "chunks.pkl"
- embedder is the model instance; loading this, downloads model weights (first run may take time).
- INDEX_FILE and CHUNKS_FILE defines where to save FAISS index and chunks.
Function to Load PDF
def load_pdf(file_path):
pdf = PdfReader(file_path)
text = ""
for page in pdf.pages:
text += page.extract_text() + "\n"
return text
- Reads a PDF file with PyPDF2.
- Extracts text page by page.
- Returns the full document text as a string.
Function for Text Chunking
def chunk_text(text, chunk_size=500, overlap=100):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap
return chunks
Splits the text into chunks of chunk_size characters, shifting by chunk_size - overlap each time (so consecutive chunks overlap by overlap characters).
Representation:
Chunk 1 = 0–500
Chunk 2 = 400–900 (100 overlap)
Full Pipeline Info
Walkthrough of the function:
1.Collect chunks for all PDFs:
pdf_folder = "vault"
#This is the folder/path pdfs are stored in.
all_chunks = []
for filename in os.listdir(pdf_folder):
if filename.endswith(".pdf"):
text = load_pdf(os.path.join(pdf_folder, filename))
chunks = chunk_text(text)
all_chunks.extend(chunks)
- Extracts text and chunks for each PDF and keeps all chunks in all_chunks list as strings.
Note: Order matters (index ids align with order).
2.Embed chunks:
vectors = embedder.encode(all_chunks)
vectors = np.array(vectors)
embedder.encode(list_of_texts) returns a list/array of vectors. By default, it returns float32 or float64 depending on version — FAISS expects float32. In practice it's safer to force dtype float32.
Important: embedding all chunks at once can OOM if you have many chunks. Use batching:
vectors = embedder.encode(all_chunks, batch_size=32, show_progress_bar=True)
vectors = np.array(vectors).astype('float32')
3.Create FAISS index:
dim = vectors.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(vectors)
(Basic)
- Creates a FAISS index and adds all chunk vectors into the index.
(Technical)
IndexFlatL2 = exact (brute-force) nearest neighbor search using L2 distance. Works for small-to-medium collections.
- Pros: simple and exact.
- Cons: slow on large collections.
The index.add(vectors) adds vectors in the same order as all_chunks. FAISS internal ids = 0..N-1 in that order — that’s how you map back to chunks.
4.Save index and chunks:
faiss.write_index(index, INDEX_FILE)
with open(CHUNKS_FILE, "wb") as f:
pickle.dump(all_chunks, f)
Saves FAISS index to faiss_index.bin.
Saves chunks (raw text) to chunks.pkl.
These files are later loaded by the Streamlit frontend on runtime.
How to run this script
make sure your PDFs are in docs/
python -m backend
Top comments (0)