How I Built a Desktop App for Local RAG: Why I Ditched Full Cloud Processing

#programming #electron #rag #productivity

Hey Dev.to!

I’m an "old-school" dev (56 y.o.), and frankly, the way modern AI services handle data has always rubbed me the wrong way. You’re usually stuck between two extremes: either you upload your PDFs entirely to the cloud (hello, data leaks and storage fees), or you install a local Ollama instance that turns your laptop into a space heater and eats up all your RAM.

I decided to build Loomind — a hybrid tool that sits right in the middle. I needed to index hundreds of files (docs, research papers, specs, legacy source code) locally, but I still wanted top-tier answers without paying for millions of context tokens.

I want to share a bit about the architecture and the hurdles I faced while building the local vectorization engine.

The Goal: A "Chat with your Knowledge Base" that doesn't lag

The specs were simple:

Omnivorous: Must eat PDF, DOCX, MD, TXT, and Code files.
Local-First: The index and the files themselves stay on the user's disk (SQLite/Local DB). No vectors stored in the cloud.
Privacy: Only the relevant context (user query + found snippets) gets sent to the LLM, not the whole document.

Hurdle #1: Local Vectorization & Parsing

The hardest part wasn't hooking up an API, but teaching a desktop app (Electron) to chew through different file formats quickly.

When you try to load 50 PDFs with complex layouts into memory, "naive" parsing freezes the UI instantly. I had to mess around with workers and queues to keep things smooth.
The second headache was the vectorization engine itself. I didn't want to bundle heavy Python libraries into the distribution. I had to find a balance between embedding quality (so the search is semantic, not just keyword-based) and performance on a standard office laptop.

How it works now:

File is chunked locally.
Vectors are built by a lightweight model right on your machine.
Everything is written to a local database.

Result: Searching the database is instant, and you don't need internet for the retrieval part.

Hurdle #2: The Hybrid Architecture (Privacy vs. Quality)

A lot of folks root for the "Full Local" approach (running Llama 3 on-device). But let’s be honest: on a standard ultrabook, local models can be sluggish and lose context.

I went with a Hybrid Approach:

Storage & Search: 100% local. Your files sit on your drive.
Inference: Only a small fragment of text (found via the local vector search) and your specific question are sent to the cloud model.

This solves the hallucination problem (the model answers strictly based on the text provided) and mitigates privacy risks (we aren't dumping the whole document into the cloud, just anonymous chunks at the moment of the query).

Features for Geeks

Once I realized how powerful RAG is, I added the stuff I was missing in standard chat interfaces:

Roles (System Prompts): You can switch modes on the fly. "You are a Senior C++ Developer, find bugs in this code" or "You are a Science Editor, summarize this abstract."
Permanent Storage: Projects don't get deleted. You can come back 6 months later, and your context is still there.

Wrap up

Loomind is my attempt to make RAG usable on desktop, without needing the command line or Docker containers.
It is a paid tool (gotta pay those API bills), but there is a Free Tier that is more than enough for testing.

If you’re curious about how it works under the hood, or want to critique my approach to local vectorization — let me know in the comments.

🔗 You can give it a spin here: https://loomind.me/