DEV Community

Cover image for Optimized PDF Q&A Assistant with Streamlit, LangChain, Hugging Face, and Supabase
datatoinfinity
datatoinfinity

Posted on • Edited on

Optimized PDF Q&A Assistant with Streamlit, LangChain, Hugging Face, and Supabase

Live Demo: PDFSUMMARIZATION Site

Github CODE

Optimized PDF Q&A Assistant with Streamlit, LangChain, Hugging Face, and Supabase

When working on AI projects, you might notice that code runs fast on Google Colab but slows down on a local machine. The solution is to make the pipeline optimized and efficient.

In this blog, I’ll walk you through building a PDF Q&A Assistant that:

Upload a PDF → hash & check if already stored → extract, embed, and save chunks in Supabase → take user’s question → retrieve relevant chunks → refine with LLM → display answer.

Tech Stack Used

  1. Streamlit → Front-end UI and deployment
  2. LangChain → Works with LLMs, connecting the AI “brain”
  3. Hugging Face → Provides powerful pre-trained models
  4. Supabase → Vector database for storing and retrieving PDF data

Configuration

from sentence_transformers import SentenceTransformer
from supabase import create_client
from huggingface_hub import InferenceClient

SUPABASE_URL = st.secrets["SUPABASE_URL"]
SUPABASE_KEY = st.secrets["SUPABASE_KEY"]
HF_TOKEN = st.secrets["HF_TOKEN"]  # Hugging Face token

supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
model = SentenceTransformer('all-MiniLM-L6-v2')
hf_client = InferenceClient(api_key=HF_TOKEN)
Enter fullscreen mode Exit fullscreen mode

Here, Supabase is used for storage, a SentenceTransformer model handles embeddings, and Hugging Face provides an LLM client for inference.

Hash and Extract PDF Data

import fitz  # PyMuPDF (faster alternative to pdfplumber)
import hashlib

def hash_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

def extract_and_chunk(pdf_path, chunk_size=500):
    doc = fitz.open(pdf_path)
    text = " ".join([page.get_text() for page in doc])
    words = text.split()
    chunks = [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks
Enter fullscreen mode Exit fullscreen mode

hashlib → creates a unique fingerprint (hash) for the PDF, preventing duplicate processing.
fitz → efficiently extracts text from the PDF and splits it into manageable chunks.

Embed, Store, and Retrieve

def embed_chunks(chunks):
    return model.encode(chunks, batch_size=16, show_progress_bar=True).tolist()
Enter fullscreen mode Exit fullscreen mode
def store_to_supabase(chunks, embeddings, pdf_id):
    data = [{
        "id": f"chunk{i+1}",   # id will be chunk1, chunk2, ...
        "pdf_id": pdf_id,
        "text": chunk,
        "embedding": embedding
    } for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))]
    supabase.table("documents1").upsert(data).execute()
Enter fullscreen mode Exit fullscreen mode
def retrieve_chunks(query, pdf_id, top_k=10):
    query_embedding = model.encode(query).tolist()
    response = supabase.rpc("match_documents", {
        "query_embedding": query_embedding,
        "match_count": top_k,
        "pdf_id_filter": pdf_id
    }).execute()
    relevant_chunk=[row["text"] for row in response.data] if response.data else []
    return relevant_chunk
Enter fullscreen mode Exit fullscreen mode

Embed Chunks → Convert text chunks into embeddings (vectors).
Store in Supabase → Save text + embeddings for future queries.
Retrieve Chunks → Find the most relevant text chunks with semantic similarity search.

Refine with Hugging Face LLM

def refine_with_llm(relevant_chunk, question):
    refinement_input = "\n\n---\n\n".join(relevant_chunk)
    prompt = f"""
    Refine the following extracted text chunks for clarity, conciseness, and improved readability.
    Keep the technical meaning accurate and explain any complex terms simply if needed.
    Text to refine:
    {refinement_input}
    Question:
    {question}"""

 response = hf_client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": "You are an expert technical editor and writer."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.7,
    max_tokens=500
    )
    refined_text = response.choices[0].message.content
    return refined_text
Enter fullscreen mode Exit fullscreen mode

This step ensures that even if retrieved chunks are messy or incomplete, the AI agent refines them into clear, concise, and context-aware answers.

Streamlit Front-End

import uuid
import os
import streamlit as st

st.set_page_config(page_title="PDF Q&A Assistant")
st.title("📄 Ask Questions About Your PDF")

uploaded_file = st.file_uploader("Upload a PDF", type="pdf")

if uploaded_file:
    with st.spinner("Processing PDF..."):
        pdf_path = f"temp_{uuid.uuid4().hex}.pdf"
        with open(pdf_path, "wb") as f:
            f.write(uploaded_file.read())
        pdf_id = hash_pdf(pdf_path)

        existing = supabase.table("documents1").select("id").eq("pdf_id", pdf_id).execute()
        if existing.data:
            st.warning("⚠️ This PDF has already been processed. You can still ask questions.")
        else:
            chunks = extract_and_chunk(pdf_path)
            embeddings = embed_chunks(chunks)
            store_to_supabase(chunks, embeddings, pdf_id)
        os.remove(pdf_path)
    st.success("PDF ready for Q&A.")

    question = st.text_input("Ask a question about the uploaded PDF:")
    if question:
        with st.spinner("Generating answer..."):
            results = retrieve_chunks(question, pdf_id)
            if not results:
                st.error("No relevant chunks found.")
            else:
                answer = refine_with_llm(results, question)
                st.markdown("### Answer:")
                st.write(answer)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  1. UI Setup → Streamlit sets page config, title, and PDF uploader.
  2. Temporary Save → Uploaded PDF is saved locally with a unique name.
  3. Hashing → Generate an MD5 hash to uniquely identify the PDF.
  4. Check Supabase → Skip processing if the PDF was already stored.
  5. Extract & Chunk → Pull text from the PDF and split it into word chunks.
  6. Embed Chunks → Convert chunks into vector embeddings for semantic search.
  7. Store in Supabase → Save chunks, embeddings, and PDF ID in the database.
  8. Clean Up → Remove the temporary PDF file after processing.
  9. Ask Question → User inputs a question about the uploaded PDF.
  10. Retrieve Chunks → Fetch most relevant chunks from Supabase via similarity search.
  11. Refine Answer → LLM polishes the retrieved text into a clear, concise response.
  12. Display Result → Show the AI-generated answer in the Streamlit app.

From PDF to Summary: Building an AI Agent with Python & Vector Databases - Basic

Top comments (0)