JeffMint

Posted on Sep 11

Chatbot with Python(Customizing the AI Chatbot with your Docs)

Backend — `backend.py` (Full explanation)

1) One-line summary

This script:

reads every PDF in docs/,
extracts text and splits it into fixed-size chunks (with overlap),
converts those chunks to vector embeddings using sentence-transformers,
builds a FAISS index from those vectors, and
saves the FAISS index (faiss_index.bin) and the chunk list (chunks.pkl) to disk so your Streamlit app can load them for retrieval.

2) Imports & config (what each dependency does)

import os
import pickle
import numpy as np
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import faiss

os — filesystem operations (listing files, joining paths).
pickle — persist Python objects (we use it to save chunks list).
numpy — numeric arrays used by FAISS.
PyPDF2.PdfReader — extract text from PDF pages. (Note: extract_text() can return None for some pages; more on that in pitfalls.)
SentenceTransformer — embedding model (all-MiniLM-L6-v2) to convert text → dense vector.
faiss — Facebook AI Similarity Search library for efficient vector indexing & nearest-neighbor search.

Config:

embedder = SentenceTransformer("all-MiniLM-L6-v2")
INDEX_FILE = "faiss_index.bin"
CHUNKS_FILE = "chunks.pkl"

embedder is the model instance; loading this downloads model weights (first run may take time).
INDEX_FILE and CHUNKS_FILE are filenames for saving outputs.

3) `load_pdf(file_path)` — read a PDF and return full text

def load_pdf(file_path):
    pdf = PdfReader(file_path)
    text = ""
    for page in pdf.pages:
        text += page.extract_text() + "\n"
    return text

Iterates pages and concatenates page.extract_text() results.

4) `chunk_text(text, chunk_size=500, overlap=100)` — naive character-based chunking

def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

Splits the text into chunks of chunk_size characters, shifting by chunk_size - overlap each time (so consecutive chunks overlap by overlap characters).
Rationale:
- Overlap ensures context continuity: boundaries rarely cut important info completely.

5) `build_index(pdf_folder="docs")` — full pipeline

Walkthrough of the function:

1.Collect chunks for all PDFs:

   all_chunks = []
   for filename in os.listdir(pdf_folder):
       if filename.endswith(".pdf"):
           text = load_pdf(os.path.join(pdf_folder, filename))
           chunks = chunk_text(text)
           all_chunks.extend(chunks)

all_chunks becomes a list of strings, order matters (index ids align with order).

2.Embed chunks:

   vectors = embedder.encode(all_chunks)
   vectors = np.array(vectors)

embedder.encode(list_of_texts) returns a list/array of vectors. By default, returns float32 or float64 depending on version — FAISS expects float32. In practice it's safer to force dtype float32.

Important: embedding all chunks at once can OOM if you have many chunks. Use batching:

 vectors = embedder.encode(all_chunks, batch_size=32, show_progress_bar=True)
 vectors = np.array(vectors).astype('float32')

3.Create FAISS index:

   dim = vectors.shape[1]
   index = faiss.IndexFlatL2(dim)
   index.add(vectors)

IndexFlatL2 = exact (brute-force) nearest neighbor search using L2 distance. Works for small-to-medium collections. Pros: simple and exact. Cons: slow on large collections.
The index.add(vectors) adds vectors in the same order as all_chunks. FAISS internal ids = 0..N-1 in that order — that’s how you map back to chunks.

4.Save index and chunks:

   faiss.write_index(index, INDEX_FILE)
   with open(CHUNKS_FILE, "wb") as f:
       pickle.dump(all_chunks, f)

Saves persistent files which your Streamlit frontend loads at runtime.

6) How to run this script

# make sure your PDFs are in docs/
python -m backend
# output: faiss_index.bin and chunks.pkl in the current directory

DEV Community

Chatbot with Python(Customizing the AI Chatbot with your Docs)

Backend — `backend.py` (Full explanation)

1) One-line summary

2) Imports & config (what each dependency does)

3) `load_pdf(file_path)` — read a PDF and return full text

4) `chunk_text(text, chunk_size=500, overlap=100)` — naive character-based chunking

5) `build_index(pdf_folder="docs")` — full pipeline

6) How to run this script

Top comments (0)

Backend — backend.py (Full explanation)

1) One-line summary

2) Imports & config (what each dependency does)

3) load_pdf(file_path) — read a PDF and return full text

4) chunk_text(text, chunk_size=500, overlap=100) — naive character-based chunking

5) build_index(pdf_folder="docs") — full pipeline

6) How to run this script

Backend — `backend.py` (Full explanation)

3) `load_pdf(file_path)` — read a PDF and return full text

4) `chunk_text(text, chunk_size=500, overlap=100)` — naive character-based chunking

5) `build_index(pdf_folder="docs")` — full pipeline