DEV Community

Cover image for From PDF to Summary: Building an AI Agent with Python & Vector Databases - Basic
datatoinfinity
datatoinfinity

Posted on • Edited on

From PDF to Summary: Building an AI Agent with Python & Vector Databases - Basic

Live Demo: PDFSUMMARIZATION Site

Sample PDF: Download Here

Github CODE

The PDF Summarization AI Agent is an AI-powered tool that summarizes lengthy PDFs and answers questions based only on their content.It’s useful when you need a quick overview without reading the entire document.

  • Summarizes large PDF files into concise overviews.
  • Answers user questions only from the uploaded PDF.
  • Formats responses clearly and preserves technical accuracy.

Used By

Researchers → Extract key findings from academic papers.
Lawyers → Summarize contracts & compliance documents.
Business Analysts → Turn meeting transcripts into quick insights.
Finance Teams → Condense invoices & financial statements.
Students → Create study notes from textbooks.

Tech Used

Streamlit → Front-end & deployment.
LangChain → LLM integration & chaining workflows.
Hugging Face → Pre-trained AI models (e.g., Mixtral-8x7B).
Supabase → Vector database for storing PDF embeddings.

How It Works

  1. Extract text from PDF. 2 Chunk the text into smaller segments (for large PDFs).
  2. Embed each chunk into vector form using a transformer model.
  3. Store embeddings in Supabase Vector DB.
  4. Perform similarity search to find the most relevant chunks for a query.
  5. Use a Hugging Face model to refine and format the answer.

Key Concepts

Chaining

A method of breaking a complex task into sequential steps, where the output of one step feeds into the next.

Embedding

A representation of text, images, or audio as points in a semantic vector space.
Similar items (e.g., mobile, smartphone, cell phone) are stored close together in this space.

Installation

pip install pdfplumber sentence-transformers supabase
Enter fullscreen mode Exit fullscreen mode
  • pdfplumber → Extract text from PDF.
  • sentence-transformers → Convert text into embeddings.
  • supabase → Store and search embeddings.

Supabase Setup

  1. Create a Supabase account.
  2. Start a new project and copy:
    • Project URL
    • API Key
  3. Enable vector extension:
CREATE EXTENSION IF NOT EXISTS vector SCHEMA extensions;
Enter fullscreen mode Exit fullscreen mode
  1. Create documents1 table:
CREATE TABLE documents1 (
    id TEXT PRIMARY KEY,
    text TEXT,
    pdf_id TEXT,
    embedding VECTOR(384)
);
Enter fullscreen mode Exit fullscreen mode
  1. Create similarity search function:
CREATE FUNCTION match_documents(
    query_embedding VECTOR(384),
    match_count INT
) RETURNS TABLE (
    id TEXT,
    text TEXT
) LANGUAGE plpgsql STABLE AS $$
BEGIN
    RETURN QUERY
    SELECT documents1.id, documents1.text
    FROM documents1
    ORDER BY documents1.embedding <-> query_embedding
    LIMIT match_count;
END;
$$;
Enter fullscreen mode Exit fullscreen mode

PDF Processing

1. Upload PDF (Google Colab)

from google.colab import files
uploaded = files.upload()
Enter fullscreen mode Exit fullscreen mode

2. Extract & Chunk

import pdfplumber
def extract_and_chunk(pdf_path, chunk_size=500):
    with pdfplumber.open(pdf_path) as pdf:
        text = "".join(page.extract_text() or "" for page in pdf.pages)
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks
Enter fullscreen mode Exit fullscreen mode

Store in Supabase

from supabase import create_client
from sentence_transformers import SentenceTransformer

supabase_url = "YOUR_SUPABASE_URL"
supabase_key = "YOUR_API_KEY"
supabase = create_client(supabase_url, supabase_key)

model = SentenceTransformer('all-MiniLM-L6-v2')

pdf_path = "Sample.pdf"
chunks = extract_and_chunk(pdf_path)
embeddings = model.encode(chunks).tolist()

data = [
    {"id": f"chunk_{i}", "text": chunk, "embedding": embedding, "pdf_id": "doc1"}
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]

supabase.table("documents1").insert(data).execute()
Enter fullscreen mode Exit fullscreen mode

Query Search

query = "What is the topic?"
query_embedding = model.encode(query).tolist()

response = supabase.rpc(
    "match_documents",
    {"query_embedding": query_embedding, "match_count": 3}
).execute()

relevant_chunks = [row["text"] for row in response.data]
print("\n---\n".join(relevant_chunks))
Enter fullscreen mode Exit fullscreen mode

Hugging Face Integration

  1. Create a Hugging Face account.
  2. Generate a READ API token.
from huggingface_hub import InferenceClient
import os

client = InferenceClient(
    api_key=os.getenv("HUGGINGFACEHUB_API_TOKEN", "YOUR_HF_API_KEY")
)
Enter fullscreen mode Exit fullscreen mode

Refinement with Mixtral-8x7B

prompt = f"""
Refine the following extracted text chunks for clarity, conciseness, and improved readability.
Keep the technical meaning accurate.

Text to refine:
{ "\n\n---\n\n".join(relevant_chunks) }
"""

response = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": "You are an expert technical editor."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.7,
    max_tokens=500
)

print("\n Refined Output:\n")
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Notes

  • Delete old data before inserting chunks from a new PDF to avoid * duplicate ID errors.
  • Hugging Face request cost & speed depend on the chosen model.
  • Supabase vector size (384) must match your embedding model output.

PDF upload → chunking → storing → querying → refining

Top comments (0)