No internet. No APIs. Just Python, PDFs, and pure privacy.
π Overview
In this project, I built a lightweight PDF Summarization Chatbot that runs entirely offline using a local LLM via LM Studio. The tool reads user-uploaded PDF files, allows selection of specific page ranges, and summarizes them into structured JSON or plain text β all through a sleek Streamlit interface.
Ideal for researchers, students, or anyone who wants AI-powered summarization without compromising data privacy.
π― Project Goal
The primary aim was to build a fully local, responsive, and user-friendly summarization tool β in just under a day. No cloud dependencies, no external APIs. A tool you can run on your own machine with full control over your data.
π§± System Architecture
The app consists of modular components, each serving a distinct responsibility:
1. π File Upload & Page Selection
- Upload any
.pdf
file via the sidebar. - Choose a page range (e.g., pages 3 to 8).
- Page count validation and file integrity checks are built-in via
PyPDF2
.
2. π Text Extraction
- The selected PDF pages are split and isolated into a temporary file.
- Raw text is extracted and sanitized before sending it to the model.
3. π§ Local LLM Summarization
- Text is sent to an LM Studio-powered local model (
deepseek-r1
used in this case). -
A custom prompt guides the model to extract:
- π Title
- π Summary
- π Keywords
- π§Ή Section-wise breakdown
Prompt enforced English-only output, even if the original document was in another language.
4. π¬ Chat-style Interface
- The user sees real-time status updates (PDF splitting, extraction, inference).
- Final results are shown in an expandable JSON viewer.
-
Outputs can be exported as:
-
.json
file - Plain
.txt
summary - Or saved locally with one click
-
π§ͺ Example Output
Hereβs a trimmed sample of what the AI produces:
{
"title": "History of Computer Vision",
"summary": "This document covers the evolution of computer vision...",
"keywords": ["computer vision", "image processing", "AI", ...],
"sections": {
"Introduction": "Overview of early image analysis techniques",
"Deep Learning": "Transition to CNN-based systems post-2012"
}
}
All content β no matter the original language β is translated and summarized in English.
π§° Tech Stack
- Python
- Streamlit β for UI
- LM Studio β to run LLMs locally (no internet required)
- PyPDF2 β for PDF parsing and splitting
- Custom Prompt Engine β for structured AI output
- Logging β each action (splitting, parsing, inference) is tracked
π§ Prompt Design
Hereβs a snippet from the actual prompt used for the LLM:
βCarefully read the PDF document below. Analyze the text and provide the following information in JSON format... all responses must be in English...β
The model then returns structured JSON based on this schema:
title
summary
keywords
sections
Prompts are easily extendable to support Q&A, translations, or sentiment analysis later.
π‘οΈ Why Local?
- π No risk of uploading sensitive files to the cloud
- π§ Works even offline or in air-gapped environments
- π‘ Fast iterations without API rate limits
β±οΈ Timeline
This was built in under 24 hours β including UI design, backend logic, and testing.
From idea π‘ β to execution π β to results π in one sitting.
β Try It Yourself!
The project is fully open-source:
π GitHub Repository
π§ Requires LM Studio installed and model downloaded locally
π¦ Clone, run streamlit run main.py
and you're good to go!
π Wrap-Up
This project proved how powerful and accessible local LLMs have become. In a world full of cloud-heavy tools, sometimes all you need is a smart offline assistant that just works.
If you found this useful, feel free to βοΈ the repo and share it around!
Top comments (0)