Automated PDF Summarization Using AI Agents: LangChain + Hugging Face + Supabase + Streamlit - Basic

#devto #langchain #huggingface #supabase

Build an AI Agent to Auto-Summarize PDFs with LangChain, Hugging Face, and Supabase

This idea came to me while working on a project with extensive documentation. It was time-consuming and overwhelming to extract only the important details.

I thought — what if I had an assistant that could read the entire document, summarize it, and even answer my questions? That’s when PDF Summarization AI Agent was born.

Live Demo: PDFSUMMARIZATION Site

Sample PDF: Download Here

Why This Matters

PDFs are everywhere — academic papers, contracts, reports, manuals — but manually skimming hundreds of pages isn’t scalable. This is especially painful for:

Researchers: Extract key findings from long papers.
Lawyers: Summarize contracts & compliance docs.
Business Analysts: Turn meeting transcripts into quick insights.
Finance Teams: Condense invoices & statements.
Students: Turn textbooks into study notes.

Tech Stack

Tool	Purpose
Streamlit	Easy Python web app frontend
LangChain	Handles LLM workflows & chaining
Hugging Face	Provides pre-trained AI models
Supabase	Vector DB for semantic search
PyPDF2	Extracts text from PDFs

How It Works (High-Level Flow)

Upload PDF(s)
Extract Text → using PyPDF2
Chunk & Embed → LangChain breaks text into smaller parts
Store in Supabase → for semantic search
Query AI → Hugging Face / Gemini answers based on context
Return Summary or Q&A Answer

Setup Instructions

Get a Google AI Studio API Key
- Visit Google AI Studio API Key
- Click Create API Key (new project)
- Copy your key.

Install Required Libraries

 
bash
pip install langchain langchain-core langchain-google-genai PyPDF2

Lets Start With Basic

Working with AI Agent with Gemini API key.

import warnings
warnings.filterwarnings("ignore")
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnableSequence
from langchain_google_genai import ChatGoogleGenerativeAI
import PyPDF2
import os

# Set your Gemini API key
os.environ["GOOGLE_API_KEY"] = "API"

# Extract text from multiple PDFs
def extract_text_from_pdf(pdf_paths):
    text = ""
    for pdf_path in pdf_paths:  # Iterate over the list of PDF paths
        try:
            with open(pdf_path, "rb") as file:
                reader = PyPDF2.PdfReader(file)
                for page in reader.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"  # Add newline to separate text from different PDFs
        except FileNotFoundError:
            text += f"Error: The file '{pdf_path}' was not found.\n"
    return text

# Define prompt template
template = """
You are an expert AI assistant. Use the information provided for answering the question
Context: {context}
Question: {question}
Answer:
"""
prompt = PromptTemplate(input_variables=["context", "question"], template=template)

# Initialize Gemini LLM and chain
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=os.environ["GOOGLE_API_KEY"])
qa_chain = RunnableSequence(prompt | llm)  # Updated to use RunnableSequence

# Function to answer questions
def answer_question(pdf_text, question):
    if not pdf_text:
        return "Error: No text extracted from the PDFs."
    answer = qa_chain.invoke({"context": pdf_text, "question": question})  # Updated to use invoke
    return answer.content if hasattr(answer, 'content') else answer  # Handle response content

# Example usage
if __name__ == "__main__":
    pdf_paths = ["sample3.pdf"]  # Replace with your list of PDF file paths
    pdf_text = extract_text_from_pdf(pdf_paths)  # Pass the list of PDF paths
    question = input("Enter text: ")
    answer = answer_question(pdf_text, question)
    print(f"Question: {question}\nAnswer: {answer}")
    # print(len(pdf_text))  # Uncomment to print the length of extracted text

Full Code Walkthrough

Here’s a detailed explanation of every part of the code for those who want the deep dive.

extract_text_from_pdf

Loops through PDF file paths
Uses PyPDF2 to read & extract text page-by-page
Adds newlines to separate pages

Prompt Template

{context} = extracted PDF text
{question} = user’s query
AI responds only based on the provided context

LLM Initialization

Uses Gemini 1.5 Flash (fast, cost-effective)
RunnableSequence pipes the prompt output into the AI model

This code is basic, it is just giving you hint how we going to extract the data from pdf. Try it by yourself, add file those are bigger in size then you will know it drawback then we will cover those drawback.

Top comments (1)

Rahul Gupta • Aug 10

Thanks for sharing.