Rahul Talreja

Posted on May 23

Building a Private RAG System: Lessons from a Local-First AI Journal

#llm #ai #ollama #privacy

Most AI apps quietly send your data to the cloud. DiaryGPT does the opposite — and this is the full technical story.

The Problem With AI + Private Data

When you write in a journal, you write the things you'd never say out loud. The last thing you want is that text sitting on someone else's server, used to train a model, or exposed in a breach.

But AI is genuinely useful for journaling. It can find patterns you miss, reflect things back to you, ask questions a blank page never would. The tension is real: you want AI insight without sacrificing privacy.

Most apps solve this by trusting a privacy policy. I wanted a technical guarantee.

So I built DiaryGPT — an AI-powered personal journal where, by default, zero data leaves your machine. Here's exactly how it works.

What DiaryGPT Does

Before the architecture, here's what the app gives you:

AI mood analysis on every entry — mood, themes, a reflective response, and a follow-up question
RAG-powered chat — ask "when was I most anxious?" and get answers grounded in your actual entries
Semantic search — find entries by meaning, not keywords ("times I felt lonely" finds entries with "isolated", "disconnected", "blue")
Weekly reflection — AI summary of your emotional arc across the week
Personalized journaling prompts — generated from your recent writing patterns
Writing streaks and memories — "on this day last year you wrote…"
AI companion mode — CBT/DBT-grounded reflection with built-in crisis detection (not a replacement for a licensed therapist)
Mood check-ins — 1–10 logging with history chart
Voice dictation and voice chat — speak entries, hear responses read back
Full AES-256-GCM encryption at rest — every diary entry, chat message, and note

The Privacy Architecture

DiaryGPT has two modes. You choose in Settings.

🟢 Local Mode (Default)

Everything runs on your machine. The AI model, the search, the analysis — all local via Ollama.

Your diary entry
      ↓
Ollama (nomic-embed-text) → converts to numbers → saved in SQLite
      ↓
Ollama (llama3.2 / qwen2.5) → analyzes mood → saved encrypted

Zero data leaves your machine.

🟡 Cloud Mode (Opt-in)

For users who want higher reasoning quality and are comfortable with API transit. You bring your own API key — Groq, OpenAI, Anthropic, or Gemini. The key is stored locally.

Your diary entry
      ↓
Ollama (embeddings) → still local, nothing sent
      ↓
Top 5 relevant excerpts → your provider's API → answer streams back

Only a small slice of your diary transits. Never the full thing.

The RAG Pipeline — How the AI "Remembers" Your Life

RAG stands for Retrieval-Augmented Generation. It's the technique that makes the AI feel like it actually knows you — without sending everything you've ever written to a language model on every request.

What is an Embedding?

Every diary entry gets converted into a list of numbers — like GPS coordinates for meaning.

"I felt anxious today"    → [0.21, 0.83, 0.12, 0.74, ...]
"I was really stressed"   → [0.22, 0.81, 0.14, 0.71, ...]  ← very similar
"I love hiking"           → [0.91, 0.12, 0.67, 0.23, ...]  ← very different

Similar meaning = similar numbers. This is what makes semantic search work — you search by concept, not exact words.

Phase 1 — Writing an Entry

You write: "Today was rough. Felt anxious about the deadline."
                    ↓
       Ollama (nomic-embed-text)
       converts text → [0.21, 0.83, 0.12, 0.74, ...]
                    ↓
       Saved in SQLite / PostgreSQL:
         entry text    → AES-256-GCM encrypted
         embedding     → stored raw (math requires it)
         mood/themes   → analyzed by LLM, stored encrypted

This happens asynchronously — the entry saves immediately, analysis runs in the background.

Phase 2 — Asking a Question

You ask: "When did I feel anxious about work?"
                    ↓
       Ollama converts question → numbers
                    ↓
       Cosine similarity search runs in YOUR database
       (sqlite-vec or pgvector — pure math, no external call)

entry A: 0.91 match ✓
entry B: 0.87 match ✓
entry C: 0.79 match ✓
entry D: 0.31 match ✗  (skipped)
             ↓
Top 5 entries decrypted in memory
             ↓
LLM receives: system prompt + diary excerpts + your question
             ↓
Streams answer word by word (SSE)

The key insight: embeddings find what to read. The LLM decides what to say about it.

The LLM never sees your full diary — only the 5 most relevant entries. Cosine similarity runs entirely on your server. Nothing goes to an external service unless you've opted into cloud mode.

The Companion Pipeline — Safety First

The companion mode is built around one rule: if someone is in crisis, the LLM never runs.

You type a message
        ↓
Crisis detection (keyword matching, server-side)
"suicide", "hurt myself", "want to die", etc.
        ↓
    CRISIS?          SAFE?
      ↓                 ↓
Hardcoded response   LLM runs with CBT/DBT prompt
988 + Crisis Text    Acknowledges → reflects → one question
Line + findahelpline
LLM never called     Saves encrypted to companion_messages

The crisis response is hardcoded. It cannot be hallucinated, modified, or bypassed by a clever prompt. The companion banner — "This is an AI companion, not a licensed therapist" — is also hardcoded in the UI, never AI-generated.

The companion system uses a distinct system prompt built around CBT thought-reframing, DBT skills, and reflective listening. Sessions are saved and resumable.

A real limitation worth naming: keyword detection catches explicit phrases like "I want to die" but will miss oblique crisis language like "I just want it to stop" or "everyone would be better off without me." A small local classifier as a second layer is on the roadmap — keyword filter as the fast, auditable first line, classifier as the safety net for implicit signals.

The Encryption Layer

Every piece of user content goes through AES-256-GCM encryption before hitting the database.

// Every diary entry, chat message, companion note goes through this
encrypt(text)   // before DB insert
decrypt(text)   // after DB read, before sending to LLM or browser

The encryption key is yours — a 64-character hex string you generate and store in your .env. Without it, the database is unreadable. The server never transmits the key.

The one exception: embedding vectors are stored unencrypted. Cosine similarity requires the raw numbers. The chunk text that generated the embedding is stored separately, encrypted. The security boundary lives at the source text, not the derived vector.

The Technical Stack

Runtime        Node.js + Express
Frontend       Vanilla JS SPA (no build step, no framework)
Auth           JWT + Argon2id password hashing
Encryption     AES-256-GCM (Node.js crypto module)
Storage        SQLite (local default) or PostgreSQL (multi-device)
Vector search  sqlite-vec (local) or pgvector (Postgres)
Embeddings     Ollama nomic-embed-text (local default)
LLM            Ollama (local default) / Groq / OpenAI / Gemini / Anthropic
Streaming      SSE (Server-Sent Events) over POST with ReadableStream
Voice          Browser SpeechRecognition API (free) or Whisper (premium)

The frontend is deliberately no-framework. No React, no build pipeline, no node_modules in the browser. It loads instantly and works offline (except for cloud LLM calls).

LLM Provider Architecture

The LLM layer is a thin factory that routes every call to whatever provider is active:

// services/llm.js
const PROVIDERS = { ollama, anthropic, openai, gemini, groq };

export const streamChat = (history, message, context, onDelta) =>
  PROVIDERS[getConfig().provider].streamChat(history, message, context, onDelta);

Switching providers happens at runtime — no restart needed. Every provider implements the same three-function contract:

analyzeEntry(text)                              // → { mood, themes, reflection, followUpQuestion }
generateText(systemPrompt, userMessage)         // → string
streamChat(history, message, context, onDelta)  // → full string, streams via onDelta

Groq uses the OpenAI SDK pointed at https://api.groq.com/openai/v1. Ollama uses the same SDK pointed at http://localhost:11434/v1. Identical interface, completely different privacy properties.

What I Learned

1. Embeddings and LLMs are completely separate concerns. The model that converts text to numbers has nothing to do with the model that generates answers. You can run Ollama for embeddings and Groq for chat simultaneously. Most people conflate the two.

2. 7B–8B models are good enough for structured diary tasks. Mood detection, theme extraction, journaling prompts — a well-prompted qwen2.5:7b handles all of these reliably. The quality gap versus 70B only shows up in long-form weekly summaries. Use format: json mode in Ollama for structured output; without it, small models will eventually return malformed JSON and break your pipeline silently.

3. Cosine similarity belongs in your database, not a vector database. For a personal app with thousands (not millions) of entries, sqlite-vec and pgvector are more than sufficient. No Pinecone, no Weaviate, no extra infra. The math is simple and fast.

4. SSE over POST is the right call for streaming. The standard advice is to use EventSource, but EventSource is GET-only. Chat requires POST (to send the message body). The fix is fetch + ReadableStream on the client — full control over the stream lifecycle, no awkward query-string payloads.

5. Crisis detection must run before the LLM, not inside it. You cannot rely on an LLM to consistently detect crisis language and respond safely. Keyword matching before the LLM call is not elegant, but it is reliable and auditable. An LLM should never be the first line of defense for someone in crisis — it should never even get the message.

6. The hardest engineering decisions in a privacy-first app are about what not to do. No analytics. No telemetry. No "anonymized" usage data. Every one of those is a useful product feature you give up — and giving them up is the point.

Try It

DiaryGPT is open source. Self-host it, read every line, verify the privacy claims.

🔗 GitHub: https://github.com/rahul70-code/diarygpt

Your diary is yours. The AI should work for you, not harvest from you.

Stack: Node.js · Ollama · SQLite · AES-256-GCM · Vanilla JS

Tags: #LLM #RAG #Privacy #LocalFirst #OpenSource

Top comments (1)

Harjot Singh • May 31

A journal is the perfect forcing function for private RAG, because it's the one dataset where "send my most personal thoughts to a third-party API" is an obvious non-starter - so local-first isn't a nice-to-have, it's the requirement. That constraint makes you solve the genuinely hard parts: local embeddings good enough to retrieve well, a vector store that stays fast as entries pile up, and chunking that respects the fact that journal entries are temporal and personal (today's entry relates to last month's, not just semantically-nearest text). Most RAG demos dodge all of that with a cloud API and a clean corpus.

The lesson I'd guess bit hardest: retrieval quality on personal, messy, time-ordered text is way harder than on clean docs, and a wrong retrieval here doesn't just give a bad answer, it surfaces the wrong memory back to you. That "be right about what you surface, and abstain when unsure" discipline is core to how I build Moonshift, the thing I work on - a multi-agent pipeline that takes a prompt to a deployed SaaS, where context is verified, not just nearest-match. Multi-model routing keeps a build ~$3 flat, first run free no card. Genuinely like this project. What's the embedding model you settled on for local, and did recency/time-weighting beat pure semantic similarity for journal retrieval? I'd bet time-aware retrieval matters more here than almost any other RAG use case.