Live Demo: PDFSUMMARIZATION Site
Sample PDF: Download Here
Github CODE
The PDF Summarization AI Agent is an AI-powered tool that summarizes lengthy PDFs and answers questions based only on their content.It’s useful when you need a quick overview without reading the entire document.
- Summarizes large PDF files into concise overviews.
- Answers user questions only from the uploaded PDF.
- Formats responses clearly and preserves technical accuracy.
Used By
Researchers → Extract key findings from academic papers.
Lawyers → Summarize contracts & compliance documents.
Business Analysts → Turn meeting transcripts into quick insights.
Finance Teams → Condense invoices & financial statements.
Students → Create study notes from textbooks.
Tech Used
Streamlit → Front-end & deployment.
LangChain → LLM integration & chaining workflows.
Hugging Face → Pre-trained AI models (e.g., Mixtral-8x7B).
Supabase → Vector database for storing PDF embeddings.
How It Works
- Extract text from PDF. 2 Chunk the text into smaller segments (for large PDFs).
- Embed each chunk into vector form using a transformer model.
- Store embeddings in Supabase Vector DB.
- Perform similarity search to find the most relevant chunks for a query.
- Use a Hugging Face model to refine and format the answer.
Key Concepts
Chaining
A method of breaking a complex task into sequential steps, where the output of one step feeds into the next.
Embedding
A representation of text, images, or audio as points in a semantic vector space.
Similar items (e.g., mobile, smartphone, cell phone) are stored close together in this space.
Installation
pip install pdfplumber sentence-transformers supabase
-
pdfplumber
→ Extract text from PDF. -
sentence-transformers
→ Convert text into embeddings. -
supabase
→ Store and search embeddings.
Supabase Setup
- Create a Supabase account.
- Start a new project and copy:
- Project URL
- API Key
- Enable vector extension:
CREATE EXTENSION IF NOT EXISTS vector SCHEMA extensions;
- Create documents1 table:
CREATE TABLE documents1 (
id TEXT PRIMARY KEY,
text TEXT,
pdf_id TEXT,
embedding VECTOR(384)
);
- Create similarity search function:
CREATE FUNCTION match_documents(
query_embedding VECTOR(384),
match_count INT
) RETURNS TABLE (
id TEXT,
text TEXT
) LANGUAGE plpgsql STABLE AS $$
BEGIN
RETURN QUERY
SELECT documents1.id, documents1.text
FROM documents1
ORDER BY documents1.embedding <-> query_embedding
LIMIT match_count;
END;
$$;
PDF Processing
1. Upload PDF (Google Colab)
from google.colab import files
uploaded = files.upload()
2. Extract & Chunk
import pdfplumber
def extract_and_chunk(pdf_path, chunk_size=500):
with pdfplumber.open(pdf_path) as pdf:
text = "".join(page.extract_text() or "" for page in pdf.pages)
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
return chunks
Store in Supabase
from supabase import create_client
from sentence_transformers import SentenceTransformer
supabase_url = "YOUR_SUPABASE_URL"
supabase_key = "YOUR_API_KEY"
supabase = create_client(supabase_url, supabase_key)
model = SentenceTransformer('all-MiniLM-L6-v2')
pdf_path = "Sample.pdf"
chunks = extract_and_chunk(pdf_path)
embeddings = model.encode(chunks).tolist()
data = [
{"id": f"chunk_{i}", "text": chunk, "embedding": embedding, "pdf_id": "doc1"}
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]
supabase.table("documents1").insert(data).execute()
Query Search
query = "What is the topic?"
query_embedding = model.encode(query).tolist()
response = supabase.rpc(
"match_documents",
{"query_embedding": query_embedding, "match_count": 3}
).execute()
relevant_chunks = [row["text"] for row in response.data]
print("\n---\n".join(relevant_chunks))
Hugging Face Integration
- Create a Hugging Face account.
- Generate a READ API token.
from huggingface_hub import InferenceClient
import os
client = InferenceClient(
api_key=os.getenv("HUGGINGFACEHUB_API_TOKEN", "YOUR_HF_API_KEY")
)
Refinement with Mixtral-8x7B
prompt = f"""
Refine the following extracted text chunks for clarity, conciseness, and improved readability.
Keep the technical meaning accurate.
Text to refine:
{ "\n\n---\n\n".join(relevant_chunks) }
"""
response = client.chat.completions.create(
model="mistralai/Mixtral-8x7B-Instruct-v0.1",
messages=[
{"role": "system", "content": "You are an expert technical editor."},
{"role": "user", "content": prompt}
],
temperature=0.7,
max_tokens=500
)
print("\n Refined Output:\n")
print(response.choices[0].message.content)
Notes
- Delete old data before inserting chunks from a new PDF to avoid * duplicate ID errors.
- Hugging Face request cost & speed depend on the chosen model.
- Supabase vector size (384) must match your embedding model output.
PDF upload → chunking → storing → querying → refining
Top comments (0)