Ertugrul

Posted on Aug 4

🧠 Building a Local PDF Summarizer with LLMs — In Under 24 Hours

#llm #rag #python #streamlit

No internet. No APIs. Just Python, PDFs, and pure privacy.

📌 Overview

In this project, I built a lightweight PDF Summarization Chatbot that runs entirely offline using a local LLM via LM Studio. The tool reads user-uploaded PDF files, allows selection of specific page ranges, and summarizes them into structured JSON or plain text — all through a sleek Streamlit interface.

Ideal for researchers, students, or anyone who wants AI-powered summarization without compromising data privacy.

🎯 Project Goal

The primary aim was to build a fully local, responsive, and user-friendly summarization tool — in just under a day. No cloud dependencies, no external APIs. A tool you can run on your own machine with full control over your data.

🧱 System Architecture

The app consists of modular components, each serving a distinct responsibility:

1. 📂 File Upload & Page Selection

Upload any .pdf file via the sidebar.
Choose a page range (e.g., pages 3 to 8).
Page count validation and file integrity checks are built-in via PyPDF2.

2. 🔍 Text Extraction

The selected PDF pages are split and isolated into a temporary file.
Raw text is extracted and sanitized before sending it to the model.

3. 🧠 Local LLM Summarization

Text is sent to an LM Studio-powered local model (deepseek-r1 used in this case).
A custom prompt guides the model to extract:
- 📌 Title
- 📄 Summary
- 🔑 Keywords
- 🧹 Section-wise breakdown
Prompt enforced English-only output, even if the original document was in another language.

4. 💬 Chat-style Interface

The user sees real-time status updates (PDF splitting, extraction, inference).
Final results are shown in an expandable JSON viewer.
Outputs can be exported as:
- .json file
- Plain .txt summary
- Or saved locally with one click

🧪 Example Output

Here’s a trimmed sample of what the AI produces:

{
  "title": "History of Computer Vision",
  "summary": "This document covers the evolution of computer vision...",
  "keywords": ["computer vision", "image processing", "AI", ...],
  "sections": {
    "Introduction": "Overview of early image analysis techniques",
    "Deep Learning": "Transition to CNN-based systems post-2012"
  }
}

All content — no matter the original language — is translated and summarized in English.

🧰 Tech Stack

Python
Streamlit – for UI
LM Studio – to run LLMs locally (no internet required)
PyPDF2 – for PDF parsing and splitting
Custom Prompt Engine – for structured AI output
Logging – each action (splitting, parsing, inference) is tracked

🧠 Prompt Design

Here’s a snippet from the actual prompt used for the LLM:

“Carefully read the PDF document below. Analyze the text and provide the following information in JSON format... all responses must be in English...”

The model then returns structured JSON based on this schema:

title
summary
keywords
sections

Prompts are easily extendable to support Q&A, translations, or sentiment analysis later.

🛡️ Why Local?

🔒 No risk of uploading sensitive files to the cloud
🧠 Works even offline or in air-gapped environments
💡 Fast iterations without API rate limits

⏱️ Timeline

This was built in under 24 hours — including UI design, backend logic, and testing.

From idea 💡 → to execution 🚀 → to results 📄 in one sitting.

⭐ Try It Yourself!

The project is fully open-source:

👉 GitHub Repository
🧠 Requires LM Studio installed and model downloaded locally
📦 Clone, run streamlit run main.py and you're good to go!

🙌 Wrap-Up

This project proved how powerful and accessible local LLMs have become. In a world full of cloud-heavy tools, sometimes all you need is a smart offline assistant that just works.

If you found this useful, feel free to ⭐️ the repo and share it around!

DEV Community