DEV Community

JeffMint
JeffMint

Posted on

Chatbot with Python(Customizing the AI Chatbot with your Docs)

Backend — backend.py (Full explanation)

1) One-line summary

This script:

  1. reads every PDF in docs/,
  2. extracts text and splits it into fixed-size chunks (with overlap),
  3. converts those chunks to vector embeddings using sentence-transformers,
  4. builds a FAISS index from those vectors, and
  5. saves the FAISS index (faiss_index.bin) and the chunk list (chunks.pkl) to disk so your Streamlit app can load them for retrieval.

2) Imports & config (what each dependency does)

import os
import pickle
import numpy as np
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import faiss
Enter fullscreen mode Exit fullscreen mode
  • os — filesystem operations (listing files, joining paths).
  • pickle — persist Python objects (we use it to save chunks list).
  • numpy — numeric arrays used by FAISS.
  • PyPDF2.PdfReader — extract text from PDF pages. (Note: extract_text() can return None for some pages; more on that in pitfalls.)
  • SentenceTransformer — embedding model (all-MiniLM-L6-v2) to convert text → dense vector.
  • faiss — Facebook AI Similarity Search library for efficient vector indexing & nearest-neighbor search.

Config:

embedder = SentenceTransformer("all-MiniLM-L6-v2")
INDEX_FILE = "faiss_index.bin"
CHUNKS_FILE = "chunks.pkl"
Enter fullscreen mode Exit fullscreen mode
  • embedder is the model instance; loading this downloads model weights (first run may take time).
  • INDEX_FILE and CHUNKS_FILE are filenames for saving outputs.

3) load_pdf(file_path) — read a PDF and return full text

def load_pdf(file_path):
    pdf = PdfReader(file_path)
    text = ""
    for page in pdf.pages:
        text += page.extract_text() + "\n"
    return text
Enter fullscreen mode Exit fullscreen mode
  • Iterates pages and concatenates page.extract_text() results.

4) chunk_text(text, chunk_size=500, overlap=100) — naive character-based chunking

def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks
Enter fullscreen mode Exit fullscreen mode
  • Splits the text into chunks of chunk_size characters, shifting by chunk_size - overlap each time (so consecutive chunks overlap by overlap characters).
  • Rationale:

    • Overlap ensures context continuity: boundaries rarely cut important info completely.

5) build_index(pdf_folder="docs") — full pipeline

Walkthrough of the function:

1.Collect chunks for all PDFs:

   all_chunks = []
   for filename in os.listdir(pdf_folder):
       if filename.endswith(".pdf"):
           text = load_pdf(os.path.join(pdf_folder, filename))
           chunks = chunk_text(text)
           all_chunks.extend(chunks)
Enter fullscreen mode Exit fullscreen mode
  • all_chunks becomes a list of strings, order matters (index ids align with order).

2.Embed chunks:

   vectors = embedder.encode(all_chunks)
   vectors = np.array(vectors)
Enter fullscreen mode Exit fullscreen mode
  • embedder.encode(list_of_texts) returns a list/array of vectors. By default, returns float32 or float64 depending on version — FAISS expects float32. In practice it's safer to force dtype float32.
  • Important: embedding all chunks at once can OOM if you have many chunks. Use batching:

     vectors = embedder.encode(all_chunks, batch_size=32, show_progress_bar=True)
     vectors = np.array(vectors).astype('float32')
    

3.Create FAISS index:

   dim = vectors.shape[1]
   index = faiss.IndexFlatL2(dim)
   index.add(vectors)
Enter fullscreen mode Exit fullscreen mode
  • IndexFlatL2 = exact (brute-force) nearest neighbor search using L2 distance. Works for small-to-medium collections. Pros: simple and exact. Cons: slow on large collections.
  • The index.add(vectors) adds vectors in the same order as all_chunks. FAISS internal ids = 0..N-1 in that order — that’s how you map back to chunks.

4.Save index and chunks:

   faiss.write_index(index, INDEX_FILE)
   with open(CHUNKS_FILE, "wb") as f:
       pickle.dump(all_chunks, f)
Enter fullscreen mode Exit fullscreen mode
  • Saves persistent files which your Streamlit frontend loads at runtime.

6) How to run this script

# make sure your PDFs are in docs/
python -m backend
# output: faiss_index.bin and chunks.pkl in the current directory
Enter fullscreen mode Exit fullscreen mode

Top comments (0)