DEV Community

Jason Shotwell
Jason Shotwell

Posted on

Chat with your own docs using local LLMs — no cloud, no API keys

Every time you paste a contract, a client note, or a financial doc into ChatGPT to ask a question, that content hits OpenAI's servers. For most people that's fine. For a lot of workflows it isn't.

I wanted something that worked like that — drop in a document, ask questions — but where nothing left my machine. I looked at the existing options and none of them felt right. Too heavy, too opinionated, or they still phoned home somewhere. So I built VaultMind.


What it is

VaultMind is a local RAG system. You drop in PDFs, Word docs, CSVs, markdown files, or paste URLs, and then chat with them using open-source LLMs running through Ollama. The entire stack — inference, vector storage, embeddings, frontend — runs on your computer.

git clone https://github.com/airblackbox/VaultMind
cd VaultMind
bash start.sh
Enter fullscreen mode Exit fullscreen mode

That's it. start.sh pulls the embedding model and default LLM, starts the FastAPI backend, and opens the UI. First run takes a few minutes while Ollama downloads the models. After that it's instant.


How it works under the hood

The architecture is deliberately simple:

Document → extract text → chunk (150 words, 20-word overlap)
         → nomic-embed-text embeddings → ChromaDB

Query → embed query → cosine similarity search
      → relevance filter (distance < 0.75)
      → inject context into Mistral prompt
      → stream response via SSE
Enter fullscreen mode Exit fullscreen mode

Embedding: nomic-embed-text via Ollama. 274MB one-time download, fast, good quality for mixed document types.

Vector store: ChromaDB with persistent storage. Each named workspace gets its own collection so your work documents and personal documents never bleed into each other.

Relevance filtering: This was the part that took the most tuning. Without it, every query pulls vault context even when it's irrelevant — ask "what's the weather today?" and it would inject chunks from your tax returns. ChromaDB returns cosine distances alongside documents, so I added a threshold:

RELEVANCE_THRESHOLD = 0.75

for doc, meta, dist in zip(documents, metadatas, distances):
    if dist < RELEVANCE_THRESHOLD:
        relevant_docs.append(doc)
Enter fullscreen mode Exit fullscreen mode

Lower distance = more similar. Only chunks below the threshold get injected. Unrelated queries fall through to a clean "I don't have that indexed" response instead of hallucinating from vaguely similar chunks.

Streaming: FastAPI SSE endpoint streams tokens as they arrive from Ollama. The frontend renders tokens one at a time with a blinking cursor. Markdown is stripped client-side so responses stay clean regardless of what the model outputs.


Agent mode

Beyond the vault, there's an agent mode that combines your private documents with live DuckDuckGo search results in one streamed response. Useful when a question needs both your internal context and current public information. The same relevance filter applies — vault context is only included if it's actually relevant to the query.

Status updates stream to the UI while the agent is working so it doesn't feel like a black box.


Model switching

Default is Mistral 7B but you can switch from a dropdown in the UI — no config changes, no restart. Supported models: Mistral, Llama 3.2, Phi-3 Mini, Gemma 2, Qwen 2.5, DeepSeek R1. The model choice is passed per request so different workspaces can use different models.

Pull whichever ones you want:

ollama pull phi3
ollama pull gemma2
ollama pull deepseek-r1
Enter fullscreen mode Exit fullscreen mode

Programmatic access

There's a synchronous /query endpoint for scripts and integrations:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"message": "summarize the renewal terms", "mode": "vault", "workspace": "Work"}'
Enter fullscreen mode Exit fullscreen mode

Returns plain JSON — answer, sources, mode. I use this with an OpenClaw skill to query my vault from WhatsApp.


Docker

docker compose up
Enter fullscreen mode Exit fullscreen mode

Three services: Ollama, a one-shot model puller on first start, and VaultMind. Models and ChromaDB persist across restarts via named volumes.


What I'm still figuring out

The 0.75 relevance threshold is tuned by feel. I don't have a principled eval set for it. If anyone has done systematic work on relevance thresholds for mixed-type RAG I'd genuinely like to read it.

Chunk size is 150 words with 20-word overlap. Works well for prose, less ideal for CSVs. Planning to add document-type-aware chunking.

Agent mode scrapes URLs synchronously which adds latency. Async scraping is on the list.


Try it

git clone https://github.com/airblackbox/VaultMind
cd VaultMind && bash start.sh
Enter fullscreen mode Exit fullscreen mode

Or with Docker:

docker compose up
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8000. Drop in a document. Ask it something.

Apache 2.0. PRs welcome. Happy to answer questions about the architecture in the comments.

github.com/airblackbox/VaultMind

Top comments (0)