Backend — backend.py
(Full explanation)
1) One-line summary
This script:
- reads every PDF in
docs/
, - extracts text and splits it into fixed-size chunks (with overlap),
- converts those chunks to vector embeddings using
sentence-transformers
, - builds a FAISS index from those vectors, and
- saves the FAISS index (
faiss_index.bin
) and the chunk list (chunks.pkl
) to disk so your Streamlit app can load them for retrieval.
2) Imports & config (what each dependency does)
import os
import pickle
import numpy as np
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import faiss
-
os
— filesystem operations (listing files, joining paths). -
pickle
— persist Python objects (we use it to savechunks
list). -
numpy
— numeric arrays used by FAISS. -
PyPDF2.PdfReader
— extract text from PDF pages. (Note:extract_text()
can returnNone
for some pages; more on that in pitfalls.) -
SentenceTransformer
— embedding model (all-MiniLM-L6-v2
) to convert text → dense vector. -
faiss
— Facebook AI Similarity Search library for efficient vector indexing & nearest-neighbor search.
Config:
embedder = SentenceTransformer("all-MiniLM-L6-v2")
INDEX_FILE = "faiss_index.bin"
CHUNKS_FILE = "chunks.pkl"
-
embedder
is the model instance; loading this downloads model weights (first run may take time). -
INDEX_FILE
andCHUNKS_FILE
are filenames for saving outputs.
3) load_pdf(file_path)
— read a PDF and return full text
def load_pdf(file_path):
pdf = PdfReader(file_path)
text = ""
for page in pdf.pages:
text += page.extract_text() + "\n"
return text
- Iterates pages and concatenates
page.extract_text()
results.
4) chunk_text(text, chunk_size=500, overlap=100)
— naive character-based chunking
def chunk_text(text, chunk_size=500, overlap=100):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap
return chunks
- Splits the text into chunks of
chunk_size
characters, shifting bychunk_size - overlap
each time (so consecutive chunks overlap byoverlap
characters). -
Rationale:
- Overlap ensures context continuity: boundaries rarely cut important info completely.
5) build_index(pdf_folder="docs")
— full pipeline
Walkthrough of the function:
1.Collect chunks for all PDFs:
all_chunks = []
for filename in os.listdir(pdf_folder):
if filename.endswith(".pdf"):
text = load_pdf(os.path.join(pdf_folder, filename))
chunks = chunk_text(text)
all_chunks.extend(chunks)
-
all_chunks
becomes a list of strings, order matters (index ids align with order).
2.Embed chunks:
vectors = embedder.encode(all_chunks)
vectors = np.array(vectors)
-
embedder.encode(list_of_texts)
returns a list/array of vectors. By default, returnsfloat32
orfloat64
depending on version — FAISS expectsfloat32
. In practice it's safer to force dtypefloat32
. -
Important: embedding all chunks at once can OOM if you have many chunks. Use batching:
vectors = embedder.encode(all_chunks, batch_size=32, show_progress_bar=True) vectors = np.array(vectors).astype('float32')
3.Create FAISS index:
dim = vectors.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(vectors)
-
IndexFlatL2
= exact (brute-force) nearest neighbor search using L2 distance. Works for small-to-medium collections. Pros: simple and exact. Cons: slow on large collections. - The
index.add(vectors)
adds vectors in the same order asall_chunks
. FAISS internal ids = 0..N-1 in that order — that’s how you map back to chunks.
4.Save index and chunks:
faiss.write_index(index, INDEX_FILE)
with open(CHUNKS_FILE, "wb") as f:
pickle.dump(all_chunks, f)
- Saves persistent files which your Streamlit frontend loads at runtime.
6) How to run this script
# make sure your PDFs are in docs/
python -m backend
# output: faiss_index.bin and chunks.pkl in the current directory
Top comments (0)