Build an AI Agent to Auto-Summarize PDFs with LangChain, Hugging Face, and Supabase
This idea came to me while working on a project with extensive documentation. It was time-consuming and overwhelming to extract only the important details.
I thought — what if I had an assistant that could read the entire document, summarize it, and even answer my questions? That’s when PDF Summarization AI Agent was born.
Live Demo: PDFSUMMARIZATION Site
Sample PDF: Download Here
Why This Matters
PDFs are everywhere — academic papers, contracts, reports, manuals — but manually skimming hundreds of pages isn’t scalable. This is especially painful for:
- Researchers: Extract key findings from long papers.
- Lawyers: Summarize contracts & compliance docs.
- Business Analysts: Turn meeting transcripts into quick insights.
- Finance Teams: Condense invoices & statements.
- Students: Turn textbooks into study notes.
Tech Stack
Tool | Purpose |
---|---|
Streamlit | Easy Python web app frontend |
LangChain | Handles LLM workflows & chaining |
Hugging Face | Provides pre-trained AI models |
Supabase | Vector DB for semantic search |
PyPDF2 | Extracts text from PDFs |
How It Works (High-Level Flow)
- Upload PDF(s)
- Extract Text → using PyPDF2
- Chunk & Embed → LangChain breaks text into smaller parts
- Store in Supabase → for semantic search
- Query AI → Hugging Face / Gemini answers based on context
- Return Summary or Q&A Answer
Setup Instructions
-
Get a Google AI Studio API Key
- Visit Google AI Studio API Key
- Click Create API Key (new project)
- Copy your key.
-
Install Required Libraries
bash pip install langchain langchain-core langchain-google-genai PyPDF2
Lets Start With Basic
Working with AI Agent with Gemini API key.
import warnings warnings.filterwarnings("ignore") from langchain.prompts import PromptTemplate from langchain_core.runnables import RunnableSequence from langchain_google_genai import ChatGoogleGenerativeAI import PyPDF2 import os # Set your Gemini API key os.environ["GOOGLE_API_KEY"] = "API" # Extract text from multiple PDFs def extract_text_from_pdf(pdf_paths): text = "" for pdf_path in pdf_paths: # Iterate over the list of PDF paths try: with open(pdf_path, "rb") as file: reader = PyPDF2.PdfReader(file) for page in reader.pages: page_text = page.extract_text() if page_text: text += page_text + "\n" # Add newline to separate text from different PDFs except FileNotFoundError: text += f"Error: The file '{pdf_path}' was not found.\n" return text # Define prompt template template = """ You are an expert AI assistant. Use the information provided for answering the question Context: {context} Question: {question} Answer: """ prompt = PromptTemplate(input_variables=["context", "question"], template=template) # Initialize Gemini LLM and chain llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=os.environ["GOOGLE_API_KEY"]) qa_chain = RunnableSequence(prompt | llm) # Updated to use RunnableSequence # Function to answer questions def answer_question(pdf_text, question): if not pdf_text: return "Error: No text extracted from the PDFs." answer = qa_chain.invoke({"context": pdf_text, "question": question}) # Updated to use invoke return answer.content if hasattr(answer, 'content') else answer # Handle response content # Example usage if __name__ == "__main__": pdf_paths = ["sample3.pdf"] # Replace with your list of PDF file paths pdf_text = extract_text_from_pdf(pdf_paths) # Pass the list of PDF paths question = input("Enter text: ") answer = answer_question(pdf_text, question) print(f"Question: {question}\nAnswer: {answer}") # print(len(pdf_text)) # Uncomment to print the length of extracted text
Full Code Walkthrough
Here’s a detailed explanation of every part of the code for those who want the deep dive.
extract_text_from_pdf
- Loops through PDF file paths
- Uses PyPDF2 to read & extract text page-by-page
- Adds newlines to separate pages
Prompt Template
- {context} = extracted PDF text
- {question} = user’s query
- AI responds only based on the provided context
LLM Initialization
- Uses Gemini 1.5 Flash (fast, cost-effective)
- RunnableSequence pipes the prompt output into the AI model
This code is basic, it is just giving you hint how we going to extract the data from pdf. Try it by yourself, add file those are bigger in size then you will know it drawback then we will cover those drawback.
Top comments (1)
Thanks for sharing.