<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rahul Talreja</title>
    <description>The latest articles on DEV Community by Rahul Talreja (@rahul_talreja_946a8621542).</description>
    <link>https://dev.to/rahul_talreja_946a8621542</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1532066%2Fb158c26e-4ea4-4003-bb15-208a738da190.png</url>
      <title>DEV Community: Rahul Talreja</title>
      <link>https://dev.to/rahul_talreja_946a8621542</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rahul_talreja_946a8621542"/>
    <language>en</language>
    <item>
      <title>Building a Private RAG System: Lessons from a Local-First AI Journal</title>
      <dc:creator>Rahul Talreja</dc:creator>
      <pubDate>Sat, 23 May 2026 10:19:15 +0000</pubDate>
      <link>https://dev.to/rahul_talreja_946a8621542/building-a-private-rag-system-lessons-from-a-local-first-ai-journal-2dol</link>
      <guid>https://dev.to/rahul_talreja_946a8621542/building-a-private-rag-system-lessons-from-a-local-first-ai-journal-2dol</guid>
      <description>&lt;p&gt;&lt;em&gt;Most AI apps quietly send your data to the cloud. DiaryGPT does the opposite — and this is the full technical story.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With AI + Private Data
&lt;/h2&gt;

&lt;p&gt;When you write in a journal, you write the things you'd never say out loud. The last thing you want is that text sitting on someone else's server, used to train a model, or exposed in a breach.&lt;/p&gt;

&lt;p&gt;But AI is genuinely useful for journaling. It can find patterns you miss, reflect things back to you, ask questions a blank page never would. The tension is real: &lt;strong&gt;you want AI insight without sacrificing privacy.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most apps solve this by trusting a privacy policy. I wanted a technical guarantee.&lt;/p&gt;

&lt;p&gt;So I built DiaryGPT — an AI-powered personal journal where, by default, &lt;strong&gt;zero data leaves your machine.&lt;/strong&gt; Here's exactly how it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  What DiaryGPT Does
&lt;/h2&gt;

&lt;p&gt;Before the architecture, here's what the app gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI mood analysis&lt;/strong&gt; on every entry — mood, themes, a reflective response, and a follow-up question&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG-powered chat&lt;/strong&gt; — ask "when was I most anxious?" and get answers grounded in your actual entries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic search&lt;/strong&gt; — find entries by meaning, not keywords ("times I felt lonely" finds entries with "isolated", "disconnected", "blue")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekly reflection&lt;/strong&gt; — AI summary of your emotional arc across the week&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personalized journaling prompts&lt;/strong&gt; — generated from your recent writing patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writing streaks and memories&lt;/strong&gt; — "on this day last year you wrote…"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI companion mode&lt;/strong&gt; — CBT/DBT-grounded reflection with built-in crisis detection (not a replacement for a licensed therapist)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mood check-ins&lt;/strong&gt; — 1–10 logging with history chart&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice dictation and voice chat&lt;/strong&gt; — speak entries, hear responses read back&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full AES-256-GCM encryption&lt;/strong&gt; at rest — every diary entry, chat message, and note&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Privacy Architecture
&lt;/h2&gt;

&lt;p&gt;DiaryGPT has two modes. You choose in Settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  🟢 Local Mode (Default)
&lt;/h3&gt;

&lt;p&gt;Everything runs on your machine. The AI model, the search, the analysis — all local via &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your diary entry
      ↓
Ollama (nomic-embed-text) → converts to numbers → saved in SQLite
      ↓
Ollama (llama3.2 / qwen2.5) → analyzes mood → saved encrypted

Zero data leaves your machine.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🟡 Cloud Mode (Opt-in)
&lt;/h3&gt;

&lt;p&gt;For users who want higher reasoning quality and are comfortable with API transit. You bring your own API key — Groq, OpenAI, Anthropic, or Gemini. The key is stored locally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your diary entry
      ↓
Ollama (embeddings) → still local, nothing sent
      ↓
Top 5 relevant excerpts → your provider's API → answer streams back

Only a small slice of your diary transits. Never the full thing.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The RAG Pipeline — How the AI "Remembers" Your Life
&lt;/h2&gt;

&lt;p&gt;RAG stands for &lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt;. It's the technique that makes the AI feel like it actually knows you — without sending everything you've ever written to a language model on every request.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is an Embedding?
&lt;/h3&gt;

&lt;p&gt;Every diary entry gets converted into a list of numbers — like GPS coordinates for meaning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"I felt anxious today"    → [0.21, 0.83, 0.12, 0.74, ...]
"I was really stressed"   → [0.22, 0.81, 0.14, 0.71, ...]  ← very similar
"I love hiking"           → [0.91, 0.12, 0.67, 0.23, ...]  ← very different
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar meaning = similar numbers. This is what makes semantic search work — you search by concept, not exact words.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1 — Writing an Entry
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You write: "Today was rough. Felt anxious about the deadline."
                    ↓
       Ollama (nomic-embed-text)
       converts text → [0.21, 0.83, 0.12, 0.74, ...]
                    ↓
       Saved in SQLite / PostgreSQL:
         entry text    → AES-256-GCM encrypted
         embedding     → stored raw (math requires it)
         mood/themes   → analyzed by LLM, stored encrypted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This happens asynchronously — the entry saves immediately, analysis runs in the background.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2 — Asking a Question
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You ask: "When did I feel anxious about work?"
                    ↓
       Ollama converts question → numbers
                    ↓
       Cosine similarity search runs in YOUR database
       (sqlite-vec or pgvector — pure math, no external call)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;entry A: 0.91 match ✓
entry B: 0.87 match ✓
entry C: 0.79 match ✓
entry D: 0.31 match ✗  (skipped)
             ↓
Top 5 entries decrypted in memory
             ↓
LLM receives: system prompt + diary excerpts + your question
             ↓
Streams answer word by word (SSE)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The key insight: embeddings find &lt;em&gt;what&lt;/em&gt; to read. The LLM decides &lt;em&gt;what to say&lt;/em&gt; about it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLM never sees your full diary — only the 5 most relevant entries. Cosine similarity runs entirely on your server. Nothing goes to an external service unless you've opted into cloud mode.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6pvhmtlf7uqvgliyeyk6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6pvhmtlf7uqvgliyeyk6.png" alt=" " width="800" height="1222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnjwbv9z1r2s4smyko0lo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnjwbv9z1r2s4smyko0lo.png" alt=" " width="800" height="1311"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Companion Pipeline — Safety First
&lt;/h2&gt;

&lt;p&gt;The companion mode is built around one rule: &lt;strong&gt;if someone is in crisis, the LLM never runs.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You type a message
        ↓
Crisis detection (keyword matching, server-side)
"suicide", "hurt myself", "want to die", etc.
        ↓
    CRISIS?          SAFE?
      ↓                 ↓
Hardcoded response   LLM runs with CBT/DBT prompt
988 + Crisis Text    Acknowledges → reflects → one question
Line + findahelpline
LLM never called     Saves encrypted to companion_messages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The crisis response is hardcoded. It cannot be hallucinated, modified, or bypassed by a clever prompt. The companion banner — &lt;em&gt;"This is an AI companion, not a licensed therapist"&lt;/em&gt; — is also hardcoded in the UI, never AI-generated.&lt;/p&gt;

&lt;p&gt;The companion system uses a distinct system prompt built around CBT thought-reframing, DBT skills, and reflective listening. Sessions are saved and resumable.&lt;/p&gt;

&lt;p&gt;A real limitation worth naming: keyword detection catches explicit phrases like "I want to die" but will miss oblique crisis language like "I just want it to stop" or "everyone would be better off without me." A small local classifier as a second layer is on the roadmap — keyword filter as the fast, auditable first line, classifier as the safety net for implicit signals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4m0bw8vsxrhlbk7wcyp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4m0bw8vsxrhlbk7wcyp.png" alt=" " width="800" height="1186"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Encryption Layer
&lt;/h2&gt;

&lt;p&gt;Every piece of user content goes through AES-256-GCM encryption before hitting the database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Every diary entry, chat message, companion note goes through this&lt;/span&gt;
&lt;span class="nf"&gt;encrypt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;// before DB insert&lt;/span&gt;
&lt;span class="nf"&gt;decrypt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;// after DB read, before sending to LLM or browser&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The encryption key is yours — a 64-character hex string you generate and store in your &lt;code&gt;.env&lt;/code&gt;. Without it, the database is unreadable. The server never transmits the key.&lt;/p&gt;

&lt;p&gt;The one exception: &lt;strong&gt;embedding vectors are stored unencrypted.&lt;/strong&gt; Cosine similarity requires the raw numbers. The chunk text that generated the embedding is stored separately, encrypted. The security boundary lives at the source text, not the derived vector.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Technical Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;Runtime&lt;/span&gt;        &lt;span class="err"&gt;Node.js&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="err"&gt;Express&lt;/span&gt;
&lt;span class="err"&gt;Frontend&lt;/span&gt;       &lt;span class="err"&gt;Vanilla&lt;/span&gt; &lt;span class="err"&gt;JS&lt;/span&gt; &lt;span class="err"&gt;SPA&lt;/span&gt; &lt;span class="err"&gt;(no&lt;/span&gt; &lt;span class="err"&gt;build&lt;/span&gt; &lt;span class="err"&gt;step,&lt;/span&gt; &lt;span class="err"&gt;no&lt;/span&gt; &lt;span class="err"&gt;framework)&lt;/span&gt;
&lt;span class="err"&gt;Auth&lt;/span&gt;           &lt;span class="err"&gt;JWT&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="err"&gt;Argon2id&lt;/span&gt; &lt;span class="err"&gt;password&lt;/span&gt; &lt;span class="err"&gt;hashing&lt;/span&gt;
&lt;span class="err"&gt;Encryption&lt;/span&gt;     &lt;span class="err"&gt;AES-256-GCM&lt;/span&gt; &lt;span class="err"&gt;(Node.js&lt;/span&gt; &lt;span class="err"&gt;crypto&lt;/span&gt; &lt;span class="err"&gt;module)&lt;/span&gt;
&lt;span class="err"&gt;Storage&lt;/span&gt;        &lt;span class="err"&gt;SQLite&lt;/span&gt; &lt;span class="err"&gt;(local&lt;/span&gt; &lt;span class="err"&gt;default)&lt;/span&gt; &lt;span class="err"&gt;or&lt;/span&gt; &lt;span class="err"&gt;PostgreSQL&lt;/span&gt; &lt;span class="err"&gt;(multi-device)&lt;/span&gt;
&lt;span class="err"&gt;Vector&lt;/span&gt; &lt;span class="err"&gt;search&lt;/span&gt;  &lt;span class="err"&gt;sqlite-vec&lt;/span&gt; &lt;span class="err"&gt;(local)&lt;/span&gt; &lt;span class="err"&gt;or&lt;/span&gt; &lt;span class="err"&gt;pgvector&lt;/span&gt; &lt;span class="err"&gt;(Postgres)&lt;/span&gt;
&lt;span class="err"&gt;Embeddings&lt;/span&gt;     &lt;span class="err"&gt;Ollama&lt;/span&gt; &lt;span class="err"&gt;nomic-embed-text&lt;/span&gt; &lt;span class="err"&gt;(local&lt;/span&gt; &lt;span class="err"&gt;default)&lt;/span&gt;
&lt;span class="err"&gt;LLM&lt;/span&gt;            &lt;span class="err"&gt;Ollama&lt;/span&gt; &lt;span class="err"&gt;(local&lt;/span&gt; &lt;span class="err"&gt;default)&lt;/span&gt; &lt;span class="err"&gt;/&lt;/span&gt; &lt;span class="err"&gt;Groq&lt;/span&gt; &lt;span class="err"&gt;/&lt;/span&gt; &lt;span class="err"&gt;OpenAI&lt;/span&gt; &lt;span class="err"&gt;/&lt;/span&gt; &lt;span class="err"&gt;Gemini&lt;/span&gt; &lt;span class="err"&gt;/&lt;/span&gt; &lt;span class="err"&gt;Anthropic&lt;/span&gt;
&lt;span class="err"&gt;Streaming&lt;/span&gt;      &lt;span class="err"&gt;SSE&lt;/span&gt; &lt;span class="err"&gt;(Server-Sent&lt;/span&gt; &lt;span class="err"&gt;Events)&lt;/span&gt; &lt;span class="err"&gt;over&lt;/span&gt; &lt;span class="err"&gt;POST&lt;/span&gt; &lt;span class="err"&gt;with&lt;/span&gt; &lt;span class="err"&gt;ReadableStream&lt;/span&gt;
&lt;span class="err"&gt;Voice&lt;/span&gt;          &lt;span class="err"&gt;Browser&lt;/span&gt; &lt;span class="err"&gt;SpeechRecognition&lt;/span&gt; &lt;span class="err"&gt;API&lt;/span&gt; &lt;span class="err"&gt;(free)&lt;/span&gt; &lt;span class="err"&gt;or&lt;/span&gt; &lt;span class="err"&gt;Whisper&lt;/span&gt; &lt;span class="err"&gt;(premium)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The frontend is deliberately no-framework. No React, no build pipeline, no &lt;code&gt;node_modules&lt;/code&gt; in the browser. It loads instantly and works offline (except for cloud LLM calls).&lt;/p&gt;




&lt;h2&gt;
  
  
  LLM Provider Architecture
&lt;/h2&gt;

&lt;p&gt;The LLM layer is a thin factory that routes every call to whatever provider is active:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// services/llm.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;PROVIDERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;groq&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;streamChat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onDelta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
  &lt;span class="nx"&gt;PROVIDERS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;getConfig&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;streamChat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onDelta&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Switching providers happens at runtime — no restart needed. Every provider implements the same three-function contract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nf"&gt;analyzeEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                              &lt;span class="c1"&gt;// → { mood, themes, reflection, followUpQuestion }&lt;/span&gt;
&lt;span class="nf"&gt;generateText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;userMessage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;// → string&lt;/span&gt;
&lt;span class="nf"&gt;streamChat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onDelta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// → full string, streams via onDelta&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Groq uses the OpenAI SDK pointed at &lt;code&gt;https://api.groq.com/openai/v1&lt;/code&gt;. Ollama uses the same SDK pointed at &lt;code&gt;http://localhost:11434/v1&lt;/code&gt;. Identical interface, completely different privacy properties.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Embeddings and LLMs are completely separate concerns.&lt;/strong&gt; The model that converts text to numbers has nothing to do with the model that generates answers. You can run Ollama for embeddings and Groq for chat simultaneously. Most people conflate the two.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. 7B–8B models are good enough for structured diary tasks.&lt;/strong&gt; Mood detection, theme extraction, journaling prompts — a well-prompted &lt;code&gt;qwen2.5:7b&lt;/code&gt; handles all of these reliably. The quality gap versus 70B only shows up in long-form weekly summaries. Use &lt;code&gt;format: json&lt;/code&gt; mode in Ollama for structured output; without it, small models will eventually return malformed JSON and break your pipeline silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cosine similarity belongs in your database, not a vector database.&lt;/strong&gt; For a personal app with thousands (not millions) of entries, &lt;code&gt;sqlite-vec&lt;/code&gt; and &lt;code&gt;pgvector&lt;/code&gt; are more than sufficient. No Pinecone, no Weaviate, no extra infra. The math is simple and fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. SSE over POST is the right call for streaming.&lt;/strong&gt; The standard advice is to use &lt;code&gt;EventSource&lt;/code&gt;, but &lt;code&gt;EventSource&lt;/code&gt; is GET-only. Chat requires POST (to send the message body). The fix is &lt;code&gt;fetch&lt;/code&gt; + &lt;code&gt;ReadableStream&lt;/code&gt; on the client — full control over the stream lifecycle, no awkward query-string payloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Crisis detection must run before the LLM, not inside it.&lt;/strong&gt; You cannot rely on an LLM to consistently detect crisis language and respond safely. Keyword matching before the LLM call is not elegant, but it is reliable and auditable. An LLM should never be the first line of defense for someone in crisis — it should never even get the message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. The hardest engineering decisions in a privacy-first app are about what &lt;em&gt;not&lt;/em&gt; to do.&lt;/strong&gt; No analytics. No telemetry. No "anonymized" usage data. Every one of those is a useful product feature you give up — and giving them up is the point.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;DiaryGPT is open source. Self-host it, read every line, verify the privacy claims.&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/rahul70-code/diarygpt" rel="noopener noreferrer"&gt;https://github.com/rahul70-code/diarygpt&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Your diary is yours. The AI should work for you, not harvest from you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Stack: Node.js · Ollama · SQLite · AES-256-GCM · Vanilla JS&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tags: #LLM #RAG #Privacy #LocalFirst #OpenSource&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>privacy</category>
      <category>ollama</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
