DEV Community

Cover image for I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned
Gunjan Tailor
Gunjan Tailor

Posted on

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

 # I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

After building docnest-ai — a hybrid RAG engine for Python — the next logical question was: what does a great end-user app built on top of it actually look like?

That question led me to build Knovex: a local-first, AI-powered desktop knowledge base that runs entirely on your machine. No cloud uploads. No subscriptions. No data leakage. Just drop in your documents, ask questions, and learn.

This post covers the architecture decisions, the problems I hit, and the interesting technical bits. If you want to skip straight to the app: tailorgunjan93.github.io/knovex


Why build a desktop app in 2026?

Every AI knowledge tool I tried had the same deal: your documents leave your machine. Legal contracts, research notes, personal journals — all uploaded to some company's inference server. The privacy trade-off felt wrong.

The local-first principle changes the threat model entirely:

  • Your files never leave your machine unless you choose to enable cloud features
  • The app works fully offline (use Ollama for a zero-network setup)
  • API keys are encrypted at rest with Fernet AES-128, readable only by your OS account

The constraint also forced better engineering. When you can't lean on a cloud backend, you have to make the local stack actually fast.


Architecture overview

Knovex is a fully decoupled tri-layer app:

┌─────────────────────────────────────────┐
│  Electron 33 (desktop shell)            │
│  ┌─────────────────────────────────┐    │
│  │  React 18 + MUI v6 + TypeScript │    │
│  │  TanStack Query v5 + Zustand    │    │
│  └──────────────┬──────────────────┘    │
└─────────────────│───────────────────────┘
                  │  REST + SSE  (localhost:8765)
┌─────────────────▼───────────────────────┐
│  FastAPI + Python 3.11                  │
│  docnest-ai (hybrid RAG engine)         │
│  SQLite WAL + FTS5                      │
│  LiteLLM (multi-provider LLM bridge)    │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The frontend is a pure API consumer — it knows nothing about RAG, embeddings, or LLMs. All intelligence lives in the Python backend. This made it very easy to swap out components independently.

Why Electron?

Electron gets a bad reputation, but for a privacy-first desktop app it's the right call:

  • Single installer ships backend binary (PyInstaller) + frontend + Electron in one .exe/.dmg/.AppImage
  • The backend process is spawned as a child process, communicates over localhost
  • Window state, tray, native OS file dialogs — all handled properly
  • Cross-platform with one codebase

The binary is ~85-92 MB depending on platform. Not tiny, but users get zero setup — no Python, no Node, no CLI gymnastics.


The RAG engine: docnest-ai

Rather than naive chunking (split every 512 chars → embed → hope), docnest-ai runs a 6-stage normalization pipeline:

  1. Structure extraction — reads heading hierarchy, tables, lists (Docling or PyMuPDF)
  2. Section assignment — every heading becomes a navigable §section
  3. Table normalization{ caption, headers, rows[] } JSON, never loses column context
  4. Section summarization — LLM called once per document
  5. Document intelligence — summary, key numbers, insights
  6. Embedding + quantize — BM25 keywords + float16 vectors

Stages 1–3 and 6 run locally at zero LLM cost. Stages 4–5 call an LLM once per document at ingest time. Every future query benefits from that upfront investment for free.

Query resolution: five layers

The query engine tries cheaper layers first before escalating:

Layer Mechanism Tokens Latency
L0 Pre-computed summary/insights 0 < 1ms
L1 BM25 + cosine → navigate to §section 0 < 20ms
L2 Section-scoped LLM ~300 1–3s
L3 Multi-section synthesis ~900 2–5s
L4 Full-document fallback ~4000+ 5–15s

In practice, L0+L1 answer ~70% of real-world questions at zero LLM cost. You only pay when you genuinely need the model.

Semantic search (v0.7.0+)

For Knovex v0.7.0 I added hybrid semantic search on top:

# ONNX-based local embeddings (all-MiniLM-L6-v2, ~45 MB, one-time download)
# OR OpenAI text-embedding-3-small via API

# Results fused with Reciprocal Rank Fusion (RRF):
# score = 1/(k + rank_fts5) + 1/(k + rank_ann)
Enter fullscreen mode Exit fullscreen mode

RRF fusion handles the case where BM25 ranks a document high on keyword match but the semantic model ranks it high on conceptual similarity. The union tends to beat either individually.

Average query latency on a typical KB is still sub-millisecond for the FTS5 path and ~0.9s end-to-end including the LLM call on an M-series Mac.


Learn Mode: turning documents into learning sessions

This was the most fun feature to build. The idea: instead of just answering questions, the app can generate structured learning content from any document or topic.

Nine formats, all streaming via SSE:

  • Quiz — interactive MCQ with XP rewards per question
  • Flashcards — spaced repetition with interval scheduling
  • Mind Map — collapsible JSON tree rendered with D3
  • Timeline — chronological events extracted from the text
  • Guided — step-by-step walkthrough via GuidedViewer
  • Story — narrative markdown retelling of the content
  • ELI5 — explain like I'm five
  • Brainstorm — creative connections and lateral ideas
  • Speed Learn — bullet-point summary for fast review

The JSON formats (Quiz, Flashcards, Mind Map, Timeline) use a two-phase approach: LLM generates structured JSON → parse → re-stream the parsed results. Text formats (Story, ELI5, etc.) stream in real-time token by token.

Gamification

I added XP, level progression (10 tiers), daily streaks, and achievement badges. This was partly experimental — does adding game mechanics to a local productivity tool actually improve usage? Anecdotally yes: the streak counter creates a small daily habit pull.

The Progress Page (v0.8.0) shows:

  • 26-week activity heatmap (sessions per day, colour-coded)
  • Learning velocity chart (sessions/week + active days/week dual-axis)
  • XP level with badge
  • Week-over-week session delta

Design patterns used throughout

Adapter pattern (anti-corruption layer)

Every third-party dependency sits behind a swappable interface:

# backend/adapters/llm_client.py
class ILLMClient(Protocol):
    async def complete(self, messages: list[dict], **kwargs) -> str: ...
    async def stream(self, messages: list[dict], **kwargs) -> AsyncIterator[str]: ...

class LiteLLMAdapter(ILLMClient):
    """Wraps litellm — the only place litellm is imported"""
    ...

class StubLLMClient(ILLMClient):
    """Used in tests — zero network calls"""
    ...
Enter fullscreen mode Exit fullscreen mode

Same pattern for: HTTP client (httpx), PDF parser (PyMuPDF / Docling), web search (DuckDuckGo / Serper / Brave), paragraph parser (python-docx).

This made testing painless — all 61 E2E tests mock at the adapter boundary.

Strategy + plugin registration for parsers

_PARSERS: dict[str, type[IFileParser]] = {}

def register_parser(ext: str):
    def decorator(cls):
        _PARSERS[ext] = cls
        return cls
    return decorator

@register_parser(".pdf")
class PDFParser(IFileParser): ...

@register_parser(".docx")
class DocxParser(IFileParser): ...
Enter fullscreen mode Exit fullscreen mode

Adding a new file format means writing one class and adding one decorator. Zero changes to the orchestration layer.

EventBus for decoupled notifications

# In-process typed EventBus — no external dependencies
bus = EventBus()

@dataclass
class FileIngested:
    file_id: str
    kb_id: str
    chunk_count: int

bus.emit_typed(FileIngested(file_id=..., kb_id=..., chunk_count=42))
Enter fullscreen mode Exit fullscreen mode

The watcher service (which detects stale/missing files) communicates with the KB service through events rather than direct calls. This kept the service layer clean.


Challenges worth noting

SQLite WAL mode + concurrent async writes — FastAPI runs async, and SQLite's WAL mode handles readers well but writers queue. I had to add retry logic with exponential backoff for the ingestion pipeline, which can run as a background task while chat is active.

PyInstaller + Python 3.11 + ONNX — packaging the ONNX runtime into a PyInstaller binary was the most painful part of the v0.7.0 release. The model weights need to be bundled correctly, paths resolved at runtime via sys._MEIPASS. Worth documenting if you're going down this path.

SSE streaming through Electron's IPC — Electron's fetch API handles SSE properly, but the preload script needed explicit keep-alive handling to prevent the renderer from killing long-running streams during Learn Mode generation (which can take 10–30 seconds for complex documents).

Windows SmartScreen — unsigned NSIS installers get flagged. Adding instructions to the download page for "More info → Run anyway" reduced support questions significantly.


What's next

Phase 2 of Knovex moves toward cloud + organisation features:

  • Cloud Portal — web admin for org key management and user management
  • 3 deployment modes — Personal (own keys) / Organisation (portal-managed) / Self-hosted (Docker)
  • LangGraph agent orchestration — beyond single-turn Q&A
  • Visual workflow builder — chain operations on your KB
  • Mobile app — React Native, same backend API
  • Plugin/connector marketplace — Notion, Confluence, GitHub, etc.

Try it

App: tailorgunjan93.github.io/knovex — free one-click installer for Windows, macOS, Linux

GitHub: github.com/tailorgunjan93/knovex

RAG engine: pip install docnest-ai

MIT licensed. v0.10.0 is stable with 61 E2E tests passing.

Happy to answer questions about any part of the stack in the comments.

Top comments (0)