Gunjan Tailor

Posted on May 30 • Edited on Jun 8

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

#ai #rag #python #opensource

# I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

After building docnest-ai — a hybrid RAG engine for Python — the next logical question was: what does a great end-user app built on top of it actually look like?

That question led me to build Knovex: a local-first, AI-powered desktop knowledge base that runs entirely on your machine. No cloud uploads. No subscriptions. No data leakage. Just drop in your documents, ask questions, and learn.

This post covers the architecture decisions, the problems I hit, and the interesting technical bits. If you want to skip straight to the app: tailorgunjan93.github.io/knovex

Why build a desktop app in 2026?

Every AI knowledge tool I tried had the same deal: your documents leave your machine. Legal contracts, research notes, personal journals — all uploaded to some company's inference server. The privacy trade-off felt wrong.

The local-first principle changes the threat model entirely:

Your files never leave your machine unless you choose to enable cloud features
The app works fully offline (use Ollama for a zero-network setup)
API keys are encrypted at rest with Fernet AES-128, readable only by your OS account

The constraint also forced better engineering. When you can't lean on a cloud backend, you have to make the local stack actually fast.

Architecture overview

Knovex is a fully decoupled tri-layer app:

┌─────────────────────────────────────────┐
│  Electron 33 (desktop shell)            │
│  ┌─────────────────────────────────┐    │
│  │  React 18 + MUI v6 + TypeScript │    │
│  │  TanStack Query v5 + Zustand    │    │
│  └──────────────┬──────────────────┘    │
└─────────────────│───────────────────────┘
                  │  REST + SSE  (localhost:8765)
┌─────────────────▼───────────────────────┐
│  FastAPI + Python 3.11                  │
│  docnest-ai (hybrid RAG engine)         │
│  SQLite WAL + FTS5                      │
│  LiteLLM (multi-provider LLM bridge)    │
└─────────────────────────────────────────┘

The frontend is a pure API consumer — it knows nothing about RAG, embeddings, or LLMs. All intelligence lives in the Python backend. This made it very easy to swap out components independently.

Why Electron?

Electron gets a bad reputation, but for a privacy-first desktop app it's the right call:

Single installer ships backend binary (PyInstaller) + frontend + Electron in one .exe/.dmg/.AppImage
The backend process is spawned as a child process, communicates over localhost
Window state, tray, native OS file dialogs — all handled properly
Cross-platform with one codebase

The binary is ~85-92 MB depending on platform. Not tiny, but users get zero setup — no Python, no Node, no CLI gymnastics.

The RAG engine: docnest-ai

Rather than naive chunking (split every 512 chars → embed → hope), docnest-ai runs a 6-stage normalization pipeline:

Structure extraction — reads heading hierarchy, tables, lists (Docling or PyMuPDF)
Section assignment — every heading becomes a navigable §section
Table normalization — { caption, headers, rows[] } JSON, never loses column context
Section summarization — LLM called once per document
Document intelligence — summary, key numbers, insights
Embedding + quantize — BM25 keywords + float16 vectors

Stages 1–3 and 6 run locally at zero LLM cost. Stages 4–5 call an LLM once per document at ingest time. Every future query benefits from that upfront investment for free.

Query resolution: five layers

The query engine tries cheaper layers first before escalating:

Layer	Mechanism	Tokens	Latency
L0	Pre-computed summary/insights	0	< 1ms
L1	BM25 + cosine → navigate to §section	0	< 20ms
L2	Section-scoped LLM	~300	1–3s
L3	Multi-section synthesis	~900	2–5s
L4	Full-document fallback	~4000+	5–15s

In practice, L0+L1 answer ~70% of real-world questions at zero LLM cost. You only pay when you genuinely need the model.

Semantic search (v0.7.0+)

For Knovex v0.7.0 I added hybrid semantic search on top:

# ONNX-based local embeddings (all-MiniLM-L6-v2, ~45 MB, one-time download)
# OR OpenAI text-embedding-3-small via API

# Results fused with Reciprocal Rank Fusion (RRF):
# score = 1/(k + rank_fts5) + 1/(k + rank_ann)

RRF fusion handles the case where BM25 ranks a document high on keyword match but the semantic model ranks it high on conceptual similarity. The union tends to beat either individually.

Average query latency on a typical KB is still sub-millisecond for the FTS5 path and ~0.9s end-to-end including the LLM call on an M-series Mac.

Learn Mode: turning documents into learning sessions

This was the most fun feature to build. The idea: instead of just answering questions, the app can generate structured learning content from any document or topic.

Nine formats, all streaming via SSE:

Quiz — interactive MCQ with XP rewards per question
Flashcards — spaced repetition with interval scheduling
Mind Map — collapsible JSON tree rendered with D3
Timeline — chronological events extracted from the text
Guided — step-by-step walkthrough via GuidedViewer
Story — narrative markdown retelling of the content
ELI5 — explain like I'm five
Brainstorm — creative connections and lateral ideas
Speed Learn — bullet-point summary for fast review

The JSON formats (Quiz, Flashcards, Mind Map, Timeline) use a two-phase approach: LLM generates structured JSON → parse → re-stream the parsed results. Text formats (Story, ELI5, etc.) stream in real-time token by token.

Gamification

I added XP, level progression (10 tiers), daily streaks, and achievement badges. This was partly experimental — does adding game mechanics to a local productivity tool actually improve usage? Anecdotally yes: the streak counter creates a small daily habit pull.

The Progress Page (v0.8.0) shows:

26-week activity heatmap (sessions per day, colour-coded)
Learning velocity chart (sessions/week + active days/week dual-axis)
XP level with badge
Week-over-week session delta

Design patterns used throughout

Adapter pattern (anti-corruption layer)

Every third-party dependency sits behind a swappable interface:

# backend/adapters/llm_client.py
class ILLMClient(Protocol):
    async def complete(self, messages: list[dict], **kwargs) -> str: ...
    async def stream(self, messages: list[dict], **kwargs) -> AsyncIterator[str]: ...

class LiteLLMAdapter(ILLMClient):
    """Wraps litellm — the only place litellm is imported"""
    ...

class StubLLMClient(ILLMClient):
    """Used in tests — zero network calls"""
    ...

Same pattern for: HTTP client (httpx), PDF parser (PyMuPDF / Docling), web search (DuckDuckGo / Serper / Brave), paragraph parser (python-docx).

This made testing painless — all 61 E2E tests mock at the adapter boundary.

Strategy + plugin registration for parsers

_PARSERS: dict[str, type[IFileParser]] = {}

def register_parser(ext: str):
    def decorator(cls):
        _PARSERS[ext] = cls
        return cls
    return decorator

@register_parser(".pdf")
class PDFParser(IFileParser): ...

@register_parser(".docx")
class DocxParser(IFileParser): ...

Adding a new file format means writing one class and adding one decorator. Zero changes to the orchestration layer.

EventBus for decoupled notifications

# In-process typed EventBus — no external dependencies
bus = EventBus()

@dataclass
class FileIngested:
    file_id: str
    kb_id: str
    chunk_count: int

bus.emit_typed(FileIngested(file_id=..., kb_id=..., chunk_count=42))

The watcher service (which detects stale/missing files) communicates with the KB service through events rather than direct calls. This kept the service layer clean.

Challenges worth noting

SQLite WAL mode + concurrent async writes — FastAPI runs async, and SQLite's WAL mode handles readers well but writers queue. I had to add retry logic with exponential backoff for the ingestion pipeline, which can run as a background task while chat is active.

PyInstaller + Python 3.11 + ONNX — packaging the ONNX runtime into a PyInstaller binary was the most painful part of the v0.7.0 release. The model weights need to be bundled correctly, paths resolved at runtime via sys._MEIPASS. Worth documenting if you're going down this path.

SSE streaming through Electron's IPC — Electron's fetch API handles SSE properly, but the preload script needed explicit keep-alive handling to prevent the renderer from killing long-running streams during Learn Mode generation (which can take 10–30 seconds for complex documents).

Windows SmartScreen — unsigned NSIS installers get flagged. Adding instructions to the download page for "More info → Run anyway" reduced support questions significantly.

What's next

Phase 2 of Knovex moves toward cloud + organisation features:

Cloud Portal — web admin for org key management and user management
3 deployment modes — Personal (own keys) / Organisation (portal-managed) / Self-hosted (Docker)
LangGraph agent orchestration — beyond single-turn Q&A
Visual workflow builder — chain operations on your KB
Mobile app — React Native, same backend API
Plugin/connector marketplace — Notion, Confluence, GitHub, etc.

Try it

App: tailorgunjan93.github.io/knovex — free one-click installer for Windows, macOS, Linux

GitHub: github.com/tailorgunjan93/knovex

RAG engine: pip install docnest-ai

MIT licensed. v0.10.0 is stable with 61 E2E tests passing.

Happy to answer questions about any part of the stack in the comments.

Top comments (2)

Harjot Singh • May 31

Local-first AI knowledge base hits two things people increasingly want: privacy (your notes never leave) and zero per-query cost. The hard part you probably ran into is retrieval quality on local, a small embedding model plus a keyword fallback often beats trying to push everything through one local LLM. And the unsexy truth is the value is mostly in ingestion and chunking, not the model, garbage in means useless recall out. I think about that same local-vs-hosted, retrieval-first tradeoff in Moonshift. What did you learn the hard way, the embedding/retrieval quality or just keeping it fast enough running locally?

Gunjan Tailor • Jun 2

You're describing Knovex's thesis almost exactly — retrieval-first, model second.

Honest answer to your question: retrieval quality was the hard-won lesson; speed mostly took care of itself. I expected local performance to be the wall, but BM25 over SQLite FTS5 + a dense ANN pass lands around ~1 ms/query — keeping it fast was never really the problem.

Quality was the grind. A small local embedding model alone (MiniLM-class) gave mediocre recall on its own; what actually moved the needle was fusing keyword + dense with RRF and then reranking the top-k with a cross-encoder. The keyword leg quietly saves you on exact terms, IDs, and rare jargon that a tiny embedding model fumbles.

And +1 hard on your "garbage in" point — most of my effort went into ingestion: normalization, section assignment, table structure. Better chunks beat a better model every time at this scale.

How are you splitting it in Moonshift — single embedding model, or also running a keyword/hybrid leg?