We all know the feeling.
Exams are approaching, notes are scattered across PDFs, handwritten notebooks, lecture slides, and screenshots, and tools like ChatGPT, Gemini, and NotebookLM suddenly become indispensable.
I was using these tools extensively during my own exam preparation when a different question started bothering me:
How are these systems actually built?
Not from a user's perspective.
From an engineer's perspective.
How does an uploaded PDF become searchable?
How does an AI know which paragraph from a 200-page textbook contains the answer?
How does NotebookLM generate responses grounded in your notes instead of hallucinating information?
And perhaps the most practical question:
Could I build something similar that continues working when the internet doesn't?
Living in a PG with unreliable Wi-Fi made that challenge particularly interesting.
That curiosity eventually became NotesGPT.
A hybrid cloud and local AI study companion capable of processing PDFs and handwritten notes, generating revision material, creating flashcards and mock exams, and answering questions using Retrieval-Augmented Generation (RAG).
The Problem
Most AI-powered study tools today are heavily dependent on cloud infrastructure.
The moment your internet becomes unstable:
- Uploads fail
- Responses slow down
- Features become unusable
- Productivity drops
For students, this often happens at the worst possible moment.
I wanted to explore a different approach:
Instead of choosing between cloud and local AI, why not support both?
Project Goals
The project had four major goals:
1. Document Understanding
Accept:
- PDFs
- Lecture notes
- Handwritten notes
- Scanned textbooks
and convert them into searchable knowledge.
2. Context-Grounded Answers
Prevent generic LLM responses.
Answers should come from the uploaded material itself.
3. Offline Capability
Allow the system to continue functioning without cloud access.
4. Multiple Study Outputs
Generate:
- Revision notes
- Flashcards
- Question banks
- Mock examinations
- Interactive Q&A
from the same knowledge source.
High-Level Architecture
Documents
│
▼
Text Extraction
(PDF.js / OCR)
│
▼
Chunking
│
▼
Embeddings
│
▼
Vector Storage
│
▼
Similarity Search
│
▼
Retrieved Context
│
▼
LLM Generation
│
▼
Notes / Flashcards / Chat / Exams
The architecture follows a classic Retrieval-Augmented Generation pipeline, but with support for both cloud and local execution.
Why RAG Instead of Just Sending the PDF to an LLM?
One common beginner approach is:
Upload PDF
↓
Send PDF to LLM
↓
Get Response
This works for small documents.
It breaks down quickly when:
- Documents become large
- Token costs increase
- Context windows are exceeded
- Retrieval quality degrades
Instead, NotesGPT uses Retrieval-Augmented Generation.
The workflow:
- Extract text
- Split into chunks
- Generate embeddings
- Store embeddings
- Retrieve relevant chunks
- Generate answers using retrieved context
This provides:
- Lower token usage
- Better accuracy
- Faster responses
- Grounded answers
- Source traceability
Building the Offline Layer
This became the most interesting part of the project.
Most AI applications support a single inference engine.
I wanted flexibility.
NotesGPT currently supports three different local execution modes.
Ollama
For users with stronger hardware.
Benefits:
- Full local privacy
- Better model quality
- No cloud dependency
Example models:
deepseek-r1:8b
gemma2:2b
WebLLM
This was fascinating.
WebLLM allows LLMs to run entirely inside the browser using WebGPU.
No external application.
No backend.
No cloud calls.
Just:
Browser
+
WebGPU
+
Local Model
=
Offline AI
This makes deployment dramatically simpler.
Gemini Nano (window.ai)
Modern browsers are slowly introducing built-in AI capabilities.
Supporting Gemini Nano was an experiment in understanding what local browser-native AI could look like in the future.
OCR Pipeline
Students don't only upload PDFs.
They upload:
- Notebook photos
- Whiteboard images
- Scanned assignments
Supporting these required OCR.
I implemented two OCR paths.
Local OCR
Using:
Tesseract.js
Benefits:
- Privacy
- Offline support
- Zero API cost
Tradeoff:
- Lower accuracy
Cloud OCR
Using:
Gemini Vision
Benefits:
- Higher accuracy
- Better handwriting recognition
Tradeoff:
- Requires internet
This dual-mode approach gave users flexibility depending on their situation.
One Optimization That Reduced Latency by 70%
The original study-kit generation pipeline looked something like this:
Generate Notes
↓
Wait
↓
Generate Flashcards
↓
Wait
↓
Generate Questions
↓
Wait
↓
Generate Mock Exam
This required multiple LLM calls.
Consequences:
- Slow generation
- Increased token usage
- Higher failure probability
- API rate limits
I redesigned the workflow into a single structured generation request.
Results:
| Metric | Before | After |
|---|---|---|
| Generation Time | ~60 sec | <15 sec |
| API Calls | 4+ | 1 |
| Token Usage | High | Reduced |
| User Experience | Slow | Fast |
The lesson:
System architecture often matters more than model selection.
Optimizing Vector Search
Another challenge appeared during retrieval.
The naive approach:
Fetch everything
Compute similarity
Return results
This quickly becomes inefficient.
Instead:
- Fetch embeddings and metadata
- Compute similarity in memory
- Retrieve only top-ranked chunks
Benefits:
- Lower bandwidth usage
- Faster retrieval
- Reduced database reads
- Better scalability
Tech Stack
Frontend
- Next.js 16
- React 19
- Tailwind CSS 4
- Framer Motion
AI
- Gemini 2.0 Flash
- Ollama
- WebLLM
- Gemini Nano
Storage
- Firestore Vector Collections
- IndexedDB
- TF-IDF Local Search
OCR
- Tesseract.js
- Gemini Vision
Authentication
- Firebase Authentication
What I Learned
Before building this project, I assumed AI applications were mostly about prompts and models.
After building it, I realized the opposite.
The hardest parts were:
- Retrieval quality
- Latency optimization
- Storage architecture
- Offline execution
- OCR reliability
- Error handling
- Cost efficiency
The LLM itself was only one component.
Everything around the model turned out to be equally important.
Future Improvements
A few areas I would like to explore next:
- Hybrid vector search
- Incremental indexing
- Better citation grounding
- Multi-document reasoning
- Voice-based study sessions
- Mobile-first offline deployment
- On-device embedding generation
Final Thoughts
I originally started this project because I was curious about how tools like NotebookLM worked behind the scenes.
What began as an experiment eventually became one of the most educational engineering projects I've built.
It taught me far more about AI systems, retrieval pipelines, optimization, and software architecture than simply consuming AI tools ever could.
If you're interested in AI engineering, RAG systems, local LLMs, or offline-first applications, I'd love to hear your thoughts.
GitHub Repository: https://github.com/di0206-innovator/Notes-GPT
Top comments (0)