DEV Community

Divyanshu Sinha
Divyanshu Sinha

Posted on

Building NotesGPT: An Offline-Capable AI Study Assistant with RAG, Local LLMs, and WebGPU

We all know the feeling.

Exams are approaching, notes are scattered across PDFs, handwritten notebooks, lecture slides, and screenshots, and tools like ChatGPT, Gemini, and NotebookLM suddenly become indispensable.

I was using these tools extensively during my own exam preparation when a different question started bothering me:

How are these systems actually built?

Not from a user's perspective.

From an engineer's perspective.

How does an uploaded PDF become searchable?

How does an AI know which paragraph from a 200-page textbook contains the answer?

How does NotebookLM generate responses grounded in your notes instead of hallucinating information?

And perhaps the most practical question:

Could I build something similar that continues working when the internet doesn't?

Living in a PG with unreliable Wi-Fi made that challenge particularly interesting.

That curiosity eventually became NotesGPT.

A hybrid cloud and local AI study companion capable of processing PDFs and handwritten notes, generating revision material, creating flashcards and mock exams, and answering questions using Retrieval-Augmented Generation (RAG).


The Problem

Most AI-powered study tools today are heavily dependent on cloud infrastructure.

The moment your internet becomes unstable:

  • Uploads fail
  • Responses slow down
  • Features become unusable
  • Productivity drops

For students, this often happens at the worst possible moment.

I wanted to explore a different approach:

Instead of choosing between cloud and local AI, why not support both?

Project Goals

The project had four major goals:

1. Document Understanding

Accept:

  • PDFs
  • Lecture notes
  • Handwritten notes
  • Scanned textbooks

and convert them into searchable knowledge.

2. Context-Grounded Answers

Prevent generic LLM responses.
Answers should come from the uploaded material itself.

3. Offline Capability

Allow the system to continue functioning without cloud access.

4. Multiple Study Outputs

Generate:

  • Revision notes
  • Flashcards
  • Question banks
  • Mock examinations
  • Interactive Q&A

from the same knowledge source.


High-Level Architecture

Documents
      │
      ▼
Text Extraction
(PDF.js / OCR)
      │
      ▼
Chunking
      │
      ▼
Embeddings
      │
      ▼
Vector Storage
      │
      ▼
Similarity Search
      │
      ▼
Retrieved Context
      │
      ▼
LLM Generation
      │
      ▼
Notes / Flashcards / Chat / Exams
Enter fullscreen mode Exit fullscreen mode

The architecture follows a classic Retrieval-Augmented Generation pipeline, but with support for both cloud and local execution.

Why RAG Instead of Just Sending the PDF to an LLM?

One common beginner approach is:

Upload PDF
↓
Send PDF to LLM
↓
Get Response
Enter fullscreen mode Exit fullscreen mode

This works for small documents.

It breaks down quickly when:

  • Documents become large
  • Token costs increase
  • Context windows are exceeded
  • Retrieval quality degrades

Instead, NotesGPT uses Retrieval-Augmented Generation.

The workflow:

  1. Extract text
  2. Split into chunks
  3. Generate embeddings
  4. Store embeddings
  5. Retrieve relevant chunks
  6. Generate answers using retrieved context

This provides:

  • Lower token usage
  • Better accuracy
  • Faster responses
  • Grounded answers
  • Source traceability

Building the Offline Layer

This became the most interesting part of the project.

Most AI applications support a single inference engine.

I wanted flexibility.

NotesGPT currently supports three different local execution modes.

Ollama

For users with stronger hardware.

Benefits:

  • Full local privacy
  • Better model quality
  • No cloud dependency

Example models:

deepseek-r1:8b
gemma2:2b
Enter fullscreen mode Exit fullscreen mode

WebLLM

This was fascinating.

WebLLM allows LLMs to run entirely inside the browser using WebGPU.

No external application.
No backend.
No cloud calls.

Just:

Browser
+
WebGPU
+
Local Model
=
Offline AI
Enter fullscreen mode Exit fullscreen mode

This makes deployment dramatically simpler.


Gemini Nano (window.ai)

Modern browsers are slowly introducing built-in AI capabilities.
Supporting Gemini Nano was an experiment in understanding what local browser-native AI could look like in the future.


OCR Pipeline

Students don't only upload PDFs.

They upload:

  • Notebook photos
  • Whiteboard images
  • Scanned assignments

Supporting these required OCR.

I implemented two OCR paths.

Local OCR

Using:

Tesseract.js
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Privacy
  • Offline support
  • Zero API cost

Tradeoff:

  • Lower accuracy

Cloud OCR

Using:

Gemini Vision
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Higher accuracy
  • Better handwriting recognition

Tradeoff:

  • Requires internet

This dual-mode approach gave users flexibility depending on their situation.


One Optimization That Reduced Latency by 70%

The original study-kit generation pipeline looked something like this:

Generate Notes
     ↓
Wait
     ↓
Generate Flashcards
     ↓
Wait
     ↓
Generate Questions
     ↓
Wait
     ↓
Generate Mock Exam
Enter fullscreen mode Exit fullscreen mode

This required multiple LLM calls.

Consequences:

  • Slow generation
  • Increased token usage
  • Higher failure probability
  • API rate limits

I redesigned the workflow into a single structured generation request.

Results:

Metric Before After
Generation Time ~60 sec <15 sec
API Calls 4+ 1
Token Usage High Reduced
User Experience Slow Fast

The lesson:

System architecture often matters more than model selection.


Optimizing Vector Search

Another challenge appeared during retrieval.

The naive approach:

Fetch everything
Compute similarity
Return results
Enter fullscreen mode Exit fullscreen mode

This quickly becomes inefficient.

Instead:

  1. Fetch embeddings and metadata
  2. Compute similarity in memory
  3. Retrieve only top-ranked chunks

Benefits:

  • Lower bandwidth usage
  • Faster retrieval
  • Reduced database reads
  • Better scalability

Tech Stack

Frontend

  • Next.js 16
  • React 19
  • Tailwind CSS 4
  • Framer Motion

AI

  • Gemini 2.0 Flash
  • Ollama
  • WebLLM
  • Gemini Nano

Storage

  • Firestore Vector Collections
  • IndexedDB
  • TF-IDF Local Search

OCR

  • Tesseract.js
  • Gemini Vision

Authentication

  • Firebase Authentication

What I Learned

Before building this project, I assumed AI applications were mostly about prompts and models.

After building it, I realized the opposite.

The hardest parts were:

  • Retrieval quality
  • Latency optimization
  • Storage architecture
  • Offline execution
  • OCR reliability
  • Error handling
  • Cost efficiency

The LLM itself was only one component.

Everything around the model turned out to be equally important.


Future Improvements

A few areas I would like to explore next:

  • Hybrid vector search
  • Incremental indexing
  • Better citation grounding
  • Multi-document reasoning
  • Voice-based study sessions
  • Mobile-first offline deployment
  • On-device embedding generation

Final Thoughts

I originally started this project because I was curious about how tools like NotebookLM worked behind the scenes.

What began as an experiment eventually became one of the most educational engineering projects I've built.

It taught me far more about AI systems, retrieval pipelines, optimization, and software architecture than simply consuming AI tools ever could.

If you're interested in AI engineering, RAG systems, local LLMs, or offline-first applications, I'd love to hear your thoughts.

GitHub Repository: https://github.com/di0206-innovator/Notes-GPT

Top comments (0)