<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Divyanshu Sinha</title>
    <description>The latest articles on DEV Community by Divyanshu Sinha (@divyanshu_sinha_72e579e28).</description>
    <link>https://dev.to/divyanshu_sinha_72e579e28</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3976831%2Fa72a5101-cca2-47d4-917c-42ff25794f69.jpg</url>
      <title>DEV Community: Divyanshu Sinha</title>
      <link>https://dev.to/divyanshu_sinha_72e579e28</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/divyanshu_sinha_72e579e28"/>
    <language>en</language>
    <item>
      <title>Building NotesGPT: An Offline-Capable AI Study Assistant with RAG, Local LLMs, and WebGPU</title>
      <dc:creator>Divyanshu Sinha</dc:creator>
      <pubDate>Wed, 10 Jun 2026 03:27:25 +0000</pubDate>
      <link>https://dev.to/divyanshu_sinha_72e579e28/building-notesgpt-an-offline-capable-ai-study-assistant-with-rag-local-llms-and-webgpu-3l22</link>
      <guid>https://dev.to/divyanshu_sinha_72e579e28/building-notesgpt-an-offline-capable-ai-study-assistant-with-rag-local-llms-and-webgpu-3l22</guid>
      <description>&lt;p&gt;We all know the feeling.&lt;/p&gt;

&lt;p&gt;Exams are approaching, notes are scattered across PDFs, handwritten notebooks, lecture slides, and screenshots, and tools like ChatGPT, Gemini, and NotebookLM suddenly become indispensable.&lt;/p&gt;

&lt;p&gt;I was using these tools extensively during my own exam preparation when a different question started bothering me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How are these systems actually built?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not from a user's perspective.&lt;/p&gt;

&lt;p&gt;From an engineer's perspective.&lt;/p&gt;

&lt;p&gt;How does an uploaded PDF become searchable?&lt;/p&gt;

&lt;p&gt;How does an AI know which paragraph from a 200-page textbook contains the answer?&lt;/p&gt;

&lt;p&gt;How does NotebookLM generate responses grounded in your notes instead of hallucinating information?&lt;/p&gt;

&lt;p&gt;And perhaps the most practical question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Could I build something similar that continues working when the internet doesn't?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Living in a PG with unreliable Wi-Fi made that challenge particularly interesting.&lt;/p&gt;

&lt;p&gt;That curiosity eventually became &lt;strong&gt;NotesGPT&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A hybrid cloud and local AI study companion capable of processing PDFs and handwritten notes, generating revision material, creating flashcards and mock exams, and answering questions using Retrieval-Augmented Generation (RAG).&lt;/p&gt;




&lt;h1&gt;
  
  
  The Problem
&lt;/h1&gt;

&lt;p&gt;Most AI-powered study tools today are heavily dependent on cloud infrastructure.&lt;/p&gt;

&lt;p&gt;The moment your internet becomes unstable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uploads fail&lt;/li&gt;
&lt;li&gt;Responses slow down&lt;/li&gt;
&lt;li&gt;Features become unusable&lt;/li&gt;
&lt;li&gt;Productivity drops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For students, this often happens at the worst possible moment.&lt;/p&gt;

&lt;p&gt;I wanted to explore a different approach:&lt;/p&gt;

&lt;p&gt;Instead of choosing between cloud and local AI, why not support both?&lt;/p&gt;

&lt;h1&gt;
  
  
  Project Goals
&lt;/h1&gt;

&lt;p&gt;The project had four major goals:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Document Understanding
&lt;/h3&gt;

&lt;p&gt;Accept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PDFs&lt;/li&gt;
&lt;li&gt;Lecture notes&lt;/li&gt;
&lt;li&gt;Handwritten notes&lt;/li&gt;
&lt;li&gt;Scanned textbooks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and convert them into searchable knowledge.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context-Grounded Answers
&lt;/h3&gt;

&lt;p&gt;Prevent generic LLM responses.&lt;br&gt;
Answers should come from the uploaded material itself.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Offline Capability
&lt;/h3&gt;

&lt;p&gt;Allow the system to continue functioning without cloud access.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Multiple Study Outputs
&lt;/h3&gt;

&lt;p&gt;Generate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Revision notes&lt;/li&gt;
&lt;li&gt;Flashcards&lt;/li&gt;
&lt;li&gt;Question banks&lt;/li&gt;
&lt;li&gt;Mock examinations&lt;/li&gt;
&lt;li&gt;Interactive Q&amp;amp;A&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;from the same knowledge source.&lt;/p&gt;


&lt;h1&gt;
  
  
  High-Level Architecture
&lt;/h1&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Documents
      │
      ▼
Text Extraction
(PDF.js / OCR)
      │
      ▼
Chunking
      │
      ▼
Embeddings
      │
      ▼
Vector Storage
      │
      ▼
Similarity Search
      │
      ▼
Retrieved Context
      │
      ▼
LLM Generation
      │
      ▼
Notes / Flashcards / Chat / Exams
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The architecture follows a classic Retrieval-Augmented Generation pipeline, but with support for both cloud and local execution.&lt;/p&gt;
&lt;h1&gt;
  
  
  Why RAG Instead of Just Sending the PDF to an LLM?
&lt;/h1&gt;

&lt;p&gt;One common beginner approach is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Upload PDF
↓
Send PDF to LLM
↓
Get Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works for small documents.&lt;/p&gt;

&lt;p&gt;It breaks down quickly when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Documents become large&lt;/li&gt;
&lt;li&gt;Token costs increase&lt;/li&gt;
&lt;li&gt;Context windows are exceeded&lt;/li&gt;
&lt;li&gt;Retrieval quality degrades&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, NotesGPT uses Retrieval-Augmented Generation.&lt;/p&gt;

&lt;p&gt;The workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract text&lt;/li&gt;
&lt;li&gt;Split into chunks&lt;/li&gt;
&lt;li&gt;Generate embeddings&lt;/li&gt;
&lt;li&gt;Store embeddings&lt;/li&gt;
&lt;li&gt;Retrieve relevant chunks&lt;/li&gt;
&lt;li&gt;Generate answers using retrieved context&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower token usage&lt;/li&gt;
&lt;li&gt;Better accuracy&lt;/li&gt;
&lt;li&gt;Faster responses&lt;/li&gt;
&lt;li&gt;Grounded answers&lt;/li&gt;
&lt;li&gt;Source traceability&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Building the Offline Layer
&lt;/h1&gt;

&lt;p&gt;This became the most interesting part of the project.&lt;/p&gt;

&lt;p&gt;Most AI applications support a single inference engine.&lt;/p&gt;

&lt;p&gt;I wanted flexibility.&lt;/p&gt;

&lt;p&gt;NotesGPT currently supports three different local execution modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ollama
&lt;/h2&gt;

&lt;p&gt;For users with stronger hardware.&lt;/p&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full local privacy&lt;/li&gt;
&lt;li&gt;Better model quality&lt;/li&gt;
&lt;li&gt;No cloud dependency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;deepseek-r1:8b
gemma2:2b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  WebLLM
&lt;/h2&gt;

&lt;p&gt;This was fascinating.&lt;/p&gt;

&lt;p&gt;WebLLM allows LLMs to run entirely inside the browser using WebGPU.&lt;/p&gt;

&lt;p&gt;No external application.&lt;br&gt;
No backend.&lt;br&gt;
No cloud calls.&lt;/p&gt;

&lt;p&gt;Just:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser
+
WebGPU
+
Local Model
=
Offline AI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes deployment dramatically simpler.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gemini Nano (window.ai)
&lt;/h2&gt;

&lt;p&gt;Modern browsers are slowly introducing built-in AI capabilities.&lt;br&gt;
Supporting Gemini Nano was an experiment in understanding what local browser-native AI could look like in the future.&lt;/p&gt;


&lt;h1&gt;
  
  
  OCR Pipeline
&lt;/h1&gt;

&lt;p&gt;Students don't only upload PDFs.&lt;/p&gt;

&lt;p&gt;They upload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Notebook photos&lt;/li&gt;
&lt;li&gt;Whiteboard images&lt;/li&gt;
&lt;li&gt;Scanned assignments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Supporting these required OCR.&lt;/p&gt;

&lt;p&gt;I implemented two OCR paths.&lt;/p&gt;
&lt;h2&gt;
  
  
  Local OCR
&lt;/h2&gt;

&lt;p&gt;Using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tesseract.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Privacy&lt;/li&gt;
&lt;li&gt;Offline support&lt;/li&gt;
&lt;li&gt;Zero API cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tradeoff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower accuracy&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Cloud OCR
&lt;/h2&gt;

&lt;p&gt;Using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gemini Vision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher accuracy&lt;/li&gt;
&lt;li&gt;Better handwriting recognition&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tradeoff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires internet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This dual-mode approach gave users flexibility depending on their situation.&lt;/p&gt;




&lt;h1&gt;
  
  
  One Optimization That Reduced Latency by 70%
&lt;/h1&gt;

&lt;p&gt;The original study-kit generation pipeline looked something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generate Notes
     ↓
Wait
     ↓
Generate Flashcards
     ↓
Wait
     ↓
Generate Questions
     ↓
Wait
     ↓
Generate Mock Exam
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This required multiple LLM calls.&lt;/p&gt;

&lt;p&gt;Consequences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow generation&lt;/li&gt;
&lt;li&gt;Increased token usage&lt;/li&gt;
&lt;li&gt;Higher failure probability&lt;/li&gt;
&lt;li&gt;API rate limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I redesigned the workflow into a single structured generation request.&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation Time&lt;/td&gt;
&lt;td&gt;~60 sec&lt;/td&gt;
&lt;td&gt;&amp;lt;15 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Calls&lt;/td&gt;
&lt;td&gt;4+&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token Usage&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Reduced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User Experience&lt;/td&gt;
&lt;td&gt;Slow&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;System architecture often matters more than model selection.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Optimizing Vector Search
&lt;/h1&gt;

&lt;p&gt;Another challenge appeared during retrieval.&lt;/p&gt;

&lt;p&gt;The naive approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fetch everything
Compute similarity
Return results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This quickly becomes inefficient.&lt;/p&gt;

&lt;p&gt;Instead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetch embeddings and metadata&lt;/li&gt;
&lt;li&gt;Compute similarity in memory&lt;/li&gt;
&lt;li&gt;Retrieve only top-ranked chunks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower bandwidth usage&lt;/li&gt;
&lt;li&gt;Faster retrieval&lt;/li&gt;
&lt;li&gt;Reduced database reads&lt;/li&gt;
&lt;li&gt;Better scalability&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Tech Stack
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Frontend
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Next.js 16&lt;/li&gt;
&lt;li&gt;React 19&lt;/li&gt;
&lt;li&gt;Tailwind CSS 4&lt;/li&gt;
&lt;li&gt;Framer Motion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Gemini 2.0 Flash&lt;/li&gt;
&lt;li&gt;Ollama&lt;/li&gt;
&lt;li&gt;WebLLM&lt;/li&gt;
&lt;li&gt;Gemini Nano&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Storage
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Firestore Vector Collections&lt;/li&gt;
&lt;li&gt;IndexedDB&lt;/li&gt;
&lt;li&gt;TF-IDF Local Search&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  OCR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tesseract.js&lt;/li&gt;
&lt;li&gt;Gemini Vision&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Authentication
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Firebase Authentication&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  What I Learned
&lt;/h1&gt;

&lt;p&gt;Before building this project, I assumed AI applications were mostly about prompts and models.&lt;/p&gt;

&lt;p&gt;After building it, I realized the opposite.&lt;/p&gt;

&lt;p&gt;The hardest parts were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval quality&lt;/li&gt;
&lt;li&gt;Latency optimization&lt;/li&gt;
&lt;li&gt;Storage architecture&lt;/li&gt;
&lt;li&gt;Offline execution&lt;/li&gt;
&lt;li&gt;OCR reliability&lt;/li&gt;
&lt;li&gt;Error handling&lt;/li&gt;
&lt;li&gt;Cost efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM itself was only one component.&lt;/p&gt;

&lt;p&gt;Everything around the model turned out to be equally important.&lt;/p&gt;




&lt;h1&gt;
  
  
  Future Improvements
&lt;/h1&gt;

&lt;p&gt;A few areas I would like to explore next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid vector search&lt;/li&gt;
&lt;li&gt;Incremental indexing&lt;/li&gt;
&lt;li&gt;Better citation grounding&lt;/li&gt;
&lt;li&gt;Multi-document reasoning&lt;/li&gt;
&lt;li&gt;Voice-based study sessions&lt;/li&gt;
&lt;li&gt;Mobile-first offline deployment&lt;/li&gt;
&lt;li&gt;On-device embedding generation&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;I originally started this project because I was curious about how tools like NotebookLM worked behind the scenes.&lt;/p&gt;

&lt;p&gt;What began as an experiment eventually became one of the most educational engineering projects I've built.&lt;/p&gt;

&lt;p&gt;It taught me far more about AI systems, retrieval pipelines, optimization, and software architecture than simply consuming AI tools ever could.&lt;/p&gt;

&lt;p&gt;If you're interested in AI engineering, RAG systems, local LLMs, or offline-first applications, I'd love to hear your thoughts.&lt;/p&gt;

&lt;p&gt;GitHub Repository: &lt;a href="https://github.com/di0206-innovator/Notes-GPT" rel="noopener noreferrer"&gt;https://github.com/di0206-innovator/Notes-GPT&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
