DEV Community

Cover image for Automated PDF Summarization Using AI Agents: LangChain + Hugging Face + Supabase + Streamlit - Basic
datatoinfinity
datatoinfinity

Posted on • Edited on

Automated PDF Summarization Using AI Agents: LangChain + Hugging Face + Supabase + Streamlit - Basic

Build an AI Agent to Auto-Summarize PDFs with LangChain, Hugging Face, and Supabase

This idea came to me while working on a project with extensive documentation. It was time-consuming and overwhelming to extract only the important details.

I thought — what if I had an assistant that could read the entire document, summarize it, and even answer my questions? That’s when PDF Summarization AI Agent was born.

Live Demo: PDFSUMMARIZATION Site

Sample PDF: Download Here


Why This Matters

PDFs are everywhere — academic papers, contracts, reports, manuals — but manually skimming hundreds of pages isn’t scalable. This is especially painful for:

  • Researchers: Extract key findings from long papers.
  • Lawyers: Summarize contracts & compliance docs.
  • Business Analysts: Turn meeting transcripts into quick insights.
  • Finance Teams: Condense invoices & statements.
  • Students: Turn textbooks into study notes.

Tech Stack

Tool Purpose
Streamlit Easy Python web app frontend
LangChain Handles LLM workflows & chaining
Hugging Face Provides pre-trained AI models
Supabase Vector DB for semantic search
PyPDF2 Extracts text from PDFs

How It Works (High-Level Flow)

  1. Upload PDF(s)
  2. Extract Text → using PyPDF2
  3. Chunk & Embed → LangChain breaks text into smaller parts
  4. Store in Supabase → for semantic search
  5. Query AI → Hugging Face / Gemini answers based on context
  6. Return Summary or Q&A Answer

Setup Instructions

  1. Get a Google AI Studio API Key

  2. Install Required Libraries

     
    bash
    pip install langchain langchain-core langchain-google-genai PyPDF2 
    

Lets Start With Basic

Working with AI Agent with Gemini API key.

import warnings
warnings.filterwarnings("ignore")
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnableSequence
from langchain_google_genai import ChatGoogleGenerativeAI
import PyPDF2
import os

# Set your Gemini API key
os.environ["GOOGLE_API_KEY"] = "API"

# Extract text from multiple PDFs
def extract_text_from_pdf(pdf_paths):
    text = ""
    for pdf_path in pdf_paths:  # Iterate over the list of PDF paths
        try:
            with open(pdf_path, "rb") as file:
                reader = PyPDF2.PdfReader(file)
                for page in reader.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"  # Add newline to separate text from different PDFs
        except FileNotFoundError:
            text += f"Error: The file '{pdf_path}' was not found.\n"
    return text

# Define prompt template
template = """
You are an expert AI assistant. Use the information provided for answering the question
Context: {context}
Question: {question}
Answer:
"""
prompt = PromptTemplate(input_variables=["context", "question"], template=template)

# Initialize Gemini LLM and chain
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=os.environ["GOOGLE_API_KEY"])
qa_chain = RunnableSequence(prompt | llm)  # Updated to use RunnableSequence

# Function to answer questions
def answer_question(pdf_text, question):
    if not pdf_text:
        return "Error: No text extracted from the PDFs."
    answer = qa_chain.invoke({"context": pdf_text, "question": question})  # Updated to use invoke
    return answer.content if hasattr(answer, 'content') else answer  # Handle response content

# Example usage
if __name__ == "__main__":
    pdf_paths = ["sample3.pdf"]  # Replace with your list of PDF file paths
    pdf_text = extract_text_from_pdf(pdf_paths)  # Pass the list of PDF paths
    question = input("Enter text: ")
    answer = answer_question(pdf_text, question)
    print(f"Question: {question}\nAnswer: {answer}")
    # print(len(pdf_text))  # Uncomment to print the length of extracted text

Full Code Walkthrough

Here’s a detailed explanation of every part of the code for those who want the deep dive.

extract_text_from_pdf

  • Loops through PDF file paths
  • Uses PyPDF2 to read & extract text page-by-page
  • Adds newlines to separate pages

Prompt Template

  • {context} = extracted PDF text
  • {question} = user’s query
  • AI responds only based on the provided context

LLM Initialization

  • Uses Gemini 1.5 Flash (fast, cost-effective)
  • RunnableSequence pipes the prompt output into the AI model

This code is basic, it is just giving you hint how we going to extract the data from pdf. Try it by yourself, add file those are bigger in size then you will know it drawback then we will cover those drawback.

Top comments (1)

Collapse
 
therahul_gupta profile image
Rahul Gupta

Thanks for sharing.