Divyanshu Sinha

Posted on Jun 10

Building NotesGPT: An Offline-Capable AI Study Assistant with RAG, Local LLMs, and WebGPU

#ai #rag #opensource #machinelearning

We all know the feeling.

Exams are approaching, notes are scattered across PDFs, handwritten notebooks, lecture slides, and screenshots, and tools like ChatGPT, Gemini, and NotebookLM suddenly become indispensable.

I was using these tools extensively during my own exam preparation when a different question started bothering me:

How are these systems actually built?

Not from a user's perspective.

From an engineer's perspective.

How does an uploaded PDF become searchable?

How does an AI know which paragraph from a 200-page textbook contains the answer?

How does NotebookLM generate responses grounded in your notes instead of hallucinating information?

And perhaps the most practical question:

Could I build something similar that continues working when the internet doesn't?

Living in a PG with unreliable Wi-Fi made that challenge particularly interesting.

That curiosity eventually became NotesGPT.

A hybrid cloud and local AI study companion capable of processing PDFs and handwritten notes, generating revision material, creating flashcards and mock exams, and answering questions using Retrieval-Augmented Generation (RAG).

The Problem

Most AI-powered study tools today are heavily dependent on cloud infrastructure.

The moment your internet becomes unstable:

Uploads fail
Responses slow down
Features become unusable
Productivity drops

For students, this often happens at the worst possible moment.

I wanted to explore a different approach:

Instead of choosing between cloud and local AI, why not support both?

Project Goals

The project had four major goals:

1. Document Understanding

Accept:

PDFs
Lecture notes
Handwritten notes
Scanned textbooks

and convert them into searchable knowledge.

2. Context-Grounded Answers

Prevent generic LLM responses.
Answers should come from the uploaded material itself.

3. Offline Capability

Allow the system to continue functioning without cloud access.

4. Multiple Study Outputs

Generate:

Revision notes
Flashcards
Question banks
Mock examinations
Interactive Q&A

from the same knowledge source.

High-Level Architecture

Documents
      │
      ▼
Text Extraction
(PDF.js / OCR)
      │
      ▼
Chunking
      │
      ▼
Embeddings
      │
      ▼
Vector Storage
      │
      ▼
Similarity Search
      │
      ▼
Retrieved Context
      │
      ▼
LLM Generation
      │
      ▼
Notes / Flashcards / Chat / Exams

The architecture follows a classic Retrieval-Augmented Generation pipeline, but with support for both cloud and local execution.

Why RAG Instead of Just Sending the PDF to an LLM?

One common beginner approach is:

Upload PDF
↓
Send PDF to LLM
↓
Get Response

This works for small documents.

It breaks down quickly when:

Documents become large
Token costs increase
Context windows are exceeded
Retrieval quality degrades

Instead, NotesGPT uses Retrieval-Augmented Generation.

The workflow:

Extract text
Split into chunks
Generate embeddings
Store embeddings
Retrieve relevant chunks
Generate answers using retrieved context

This provides:

Lower token usage
Better accuracy
Faster responses
Grounded answers
Source traceability

Building the Offline Layer

This became the most interesting part of the project.

Most AI applications support a single inference engine.

I wanted flexibility.

NotesGPT currently supports three different local execution modes.

Ollama

For users with stronger hardware.

Benefits:

Full local privacy
Better model quality
No cloud dependency

Example models:

deepseek-r1:8b
gemma2:2b

WebLLM

This was fascinating.

WebLLM allows LLMs to run entirely inside the browser using WebGPU.

No external application.
No backend.
No cloud calls.

Just:

Browser
+
WebGPU
+
Local Model
=
Offline AI

This makes deployment dramatically simpler.

Gemini Nano (window.ai)

Modern browsers are slowly introducing built-in AI capabilities.
Supporting Gemini Nano was an experiment in understanding what local browser-native AI could look like in the future.

OCR Pipeline

Students don't only upload PDFs.

They upload:

Notebook photos
Whiteboard images
Scanned assignments

Supporting these required OCR.

I implemented two OCR paths.

Local OCR

Using:

Tesseract.js

Benefits:

Privacy
Offline support
Zero API cost

Tradeoff:

Lower accuracy

Cloud OCR

Using:

Gemini Vision

Benefits:

Higher accuracy
Better handwriting recognition

Tradeoff:

Requires internet

This dual-mode approach gave users flexibility depending on their situation.

One Optimization That Reduced Latency by 70%

The original study-kit generation pipeline looked something like this:

Generate Notes
     ↓
Wait
     ↓
Generate Flashcards
     ↓
Wait
     ↓
Generate Questions
     ↓
Wait
     ↓
Generate Mock Exam

This required multiple LLM calls.

Consequences:

Slow generation
Increased token usage
Higher failure probability
API rate limits

I redesigned the workflow into a single structured generation request.

Results:

Metric	Before	After
Generation Time	~60 sec	<15 sec
API Calls	4+	1
Token Usage	High	Reduced
User Experience	Slow	Fast

The lesson:

System architecture often matters more than model selection.

Optimizing Vector Search

Another challenge appeared during retrieval.

The naive approach:

Fetch everything
Compute similarity
Return results

This quickly becomes inefficient.

Instead:

Fetch embeddings and metadata
Compute similarity in memory
Retrieve only top-ranked chunks

Benefits:

Lower bandwidth usage
Faster retrieval
Reduced database reads
Better scalability

Tech Stack

Frontend

Next.js 16
React 19
Tailwind CSS 4
Framer Motion

AI

Gemini 2.0 Flash
Ollama
WebLLM
Gemini Nano

Storage

Firestore Vector Collections
IndexedDB
TF-IDF Local Search

OCR

Tesseract.js
Gemini Vision

Authentication

Firebase Authentication

What I Learned

Before building this project, I assumed AI applications were mostly about prompts and models.

After building it, I realized the opposite.

The hardest parts were:

Retrieval quality
Latency optimization
Storage architecture
Offline execution
OCR reliability
Error handling
Cost efficiency

The LLM itself was only one component.

Everything around the model turned out to be equally important.

Future Improvements

A few areas I would like to explore next:

Hybrid vector search
Incremental indexing
Better citation grounding
Multi-document reasoning
Voice-based study sessions
Mobile-first offline deployment
On-device embedding generation

Final Thoughts

I originally started this project because I was curious about how tools like NotebookLM worked behind the scenes.

What began as an experiment eventually became one of the most educational engineering projects I've built.

It taught me far more about AI systems, retrieval pipelines, optimization, and software architecture than simply consuming AI tools ever could.

If you're interested in AI engineering, RAG systems, local LLMs, or offline-first applications, I'd love to hear your thoughts.

GitHub Repository: https://github.com/di0206-innovator/Notes-GPT

DEV Community

Building NotesGPT: An Offline-Capable AI Study Assistant with RAG, Local LLMs, and WebGPU

The Problem

Project Goals

1. Document Understanding

2. Context-Grounded Answers

3. Offline Capability

4. Multiple Study Outputs

High-Level Architecture

Why RAG Instead of Just Sending the PDF to an LLM?

Building the Offline Layer

Ollama

WebLLM

Gemini Nano (window.ai)

OCR Pipeline

Local OCR

Cloud OCR

One Optimization That Reduced Latency by 70%

Optimizing Vector Search

Tech Stack

Frontend

AI

Storage

OCR

Authentication

What I Learned

Future Improvements

Final Thoughts

Top comments (0)