datatoinfinity

Posted on Aug 11 • Edited on Aug 20

From PDF to Summary: Building an AI Agent with Python & Vector Databases - Basic

#python #ai #langchain #supabase

Live Demo: PDFSUMMARIZATION Site

Sample PDF: Download Here

Github CODE

The PDF Summarization AI Agent is an AI-powered tool that summarizes lengthy PDFs and answers questions based only on their content.It’s useful when you need a quick overview without reading the entire document.

Summarizes large PDF files into concise overviews.
Answers user questions only from the uploaded PDF.
Formats responses clearly and preserves technical accuracy.

Used By

Researchers → Extract key findings from academic papers.
Lawyers → Summarize contracts & compliance documents.
Business Analysts → Turn meeting transcripts into quick insights.
Finance Teams → Condense invoices & financial statements.
Students → Create study notes from textbooks.

Tech Used

Streamlit → Front-end & deployment.
LangChain → LLM integration & chaining workflows.
Hugging Face → Pre-trained AI models (e.g., Mixtral-8x7B).
Supabase → Vector database for storing PDF embeddings.

How It Works

Extract text from PDF. 2 Chunk the text into smaller segments (for large PDFs).
Embed each chunk into vector form using a transformer model.
Store embeddings in Supabase Vector DB.
Perform similarity search to find the most relevant chunks for a query.
Use a Hugging Face model to refine and format the answer.

Key Concepts

Chaining

A method of breaking a complex task into sequential steps, where the output of one step feeds into the next.

Embedding

A representation of text, images, or audio as points in a semantic vector space.
Similar items (e.g., mobile, smartphone, cell phone) are stored close together in this space.

Installation

pip install pdfplumber sentence-transformers supabase

pdfplumber → Extract text from PDF.
sentence-transformers → Convert text into embeddings.
supabase → Store and search embeddings.

Supabase Setup

Create a Supabase account.
Start a new project and copy:
- Project URL
- API Key
Enable vector extension:

CREATE EXTENSION IF NOT EXISTS vector SCHEMA extensions;

Create documents1 table:

CREATE TABLE documents1 (
    id TEXT PRIMARY KEY,
    text TEXT,
    pdf_id TEXT,
    embedding VECTOR(384)
);

Create similarity search function:

CREATE FUNCTION match_documents(
    query_embedding VECTOR(384),
    match_count INT
) RETURNS TABLE (
    id TEXT,
    text TEXT
) LANGUAGE plpgsql STABLE AS $$
BEGIN
    RETURN QUERY
    SELECT documents1.id, documents1.text
    FROM documents1
    ORDER BY documents1.embedding <-> query_embedding
    LIMIT match_count;
END;
$$;

PDF Processing

1. Upload PDF (Google Colab)

from google.colab import files
uploaded = files.upload()

2. Extract & Chunk

import pdfplumber
def extract_and_chunk(pdf_path, chunk_size=500):
    with pdfplumber.open(pdf_path) as pdf:
        text = "".join(page.extract_text() or "" for page in pdf.pages)
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks

Store in Supabase

from supabase import create_client
from sentence_transformers import SentenceTransformer

supabase_url = "YOUR_SUPABASE_URL"
supabase_key = "YOUR_API_KEY"
supabase = create_client(supabase_url, supabase_key)

model = SentenceTransformer('all-MiniLM-L6-v2')

pdf_path = "Sample.pdf"
chunks = extract_and_chunk(pdf_path)
embeddings = model.encode(chunks).tolist()

data = [
    {"id": f"chunk_{i}", "text": chunk, "embedding": embedding, "pdf_id": "doc1"}
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]

supabase.table("documents1").insert(data).execute()

Query Search

query = "What is the topic?"
query_embedding = model.encode(query).tolist()

response = supabase.rpc(
    "match_documents",
    {"query_embedding": query_embedding, "match_count": 3}
).execute()

relevant_chunks = [row["text"] for row in response.data]
print("\n---\n".join(relevant_chunks))

Hugging Face Integration

Create a Hugging Face account.
Generate a READ API token.

from huggingface_hub import InferenceClient
import os

client = InferenceClient(
    api_key=os.getenv("HUGGINGFACEHUB_API_TOKEN", "YOUR_HF_API_KEY")
)

Refinement with Mixtral-8x7B

prompt = f"""
Refine the following extracted text chunks for clarity, conciseness, and improved readability.
Keep the technical meaning accurate.

Text to refine:
{ "\n\n---\n\n".join(relevant_chunks) }
"""

response = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": "You are an expert technical editor."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.7,
    max_tokens=500
)

print("\n Refined Output:\n")
print(response.choices[0].message.content)

Notes

Delete old data before inserting chunks from a new PDF to avoid * duplicate ID errors.
Hugging Face request cost & speed depend on the chosen model.
Supabase vector size (384) must match your embedding model output.

PDF upload → chunking → storing → querying → refining

DEV Community