This article was originally published on aifoss.dev
TL;DR: LocalGPT is a self-hosted RAG tool that runs entirely on your own hardware — no telemetry, no cloud fallback, zero data leaving your machine. The v2 rewrite shifted from raw llama.cpp to an Ollama-first architecture with hybrid search, which makes setup much cleaner. Trade-off: it currently only ingests PDFs and has no multi-user support, so it's a single-user privacy tool, not a team platform.
| LocalGPT | AnythingLLM | PrivateGPT | |
|---|---|---|---|
| Best for | Maximum privacy, single user | Team RAG with a GUI | Developer API-first RAG |
| Setup complexity | Medium (Python + Node + Ollama) | Low (Docker / desktop app) | High (Python 3.11 + Poetry) |
| Document types | PDF only (currently) | PDF, DOCX, XLSX, PPTX, HTML, audio, 50+ types | PDF, DOCX, TXT, HTML, PPTX |
| Multi-user | No | Yes (Docker) | No |
| License | MIT | MIT | Apache-2.0 |
| Privacy guarantee | 100% local | 100% local (self-hosted) | 100% local (self-hosted) |
Honest take: If your use case is "I have sensitive PDFs and I want to query them without any data leaving my laptop," LocalGPT does exactly that and nothing more. For anything involving multiple document types or a second person on the team, AnythingLLM is the better tool.
What LocalGPT actually is
LocalGPT is an open-source private RAG system by PromtEngineer, currently at 22.2k GitHub stars and licensed MIT. The concept is straightforward: upload your documents, run a local LLM against them, ask questions. Every part of that pipeline stays on your hardware.
The v2 architecture replaced the original llama.cpp/ChromaDB stack with something more practical: Ollama handles model serving, LanceDB handles vector storage (embedded, no separate database server needed), and a new hybrid search layer blends semantic similarity, keyword matching, and Late Chunking for long-context retrieval. An independent verification pass cross-checks answers before returning them.
Worth noting: LocalGPT has no formal versioned releases. Development happens on the localgpt-v2 branch. If you're the kind of person who needs a changelog before deploying something, the lack of release tags is a genuine friction point.
Who should use this
LocalGPT is built for a specific type of user: someone with sensitive documents — legal contracts, medical records, internal business data — who cannot accept those files passing through third-party infrastructure, even transiently.
If you've ever hesitated before uploading a PDF to ChatGPT or Claude, LocalGPT solves that problem. Every model call, every embedding, every retrieval step runs on your CPU or GPU with no outbound connections to external APIs.
It's not for teams. It's not for people who want a polished UI with workspace management and user permissions. It's not for users who work with Excel spreadsheets, Word documents, or PowerPoint slides — at least not yet.
Setting it up
Prerequisites before you start:
- Python 3.8+ (tested on 3.11.5)
- Node.js 16+ and npm (tested on v23)
- Ollama installed and running
- 8GB RAM minimum; 16GB recommended
That's a heavier dependency list than it first appears. You're not just running a Python script — the v2 stack has a frontend layer that requires Node, and Ollama needs to be running as a separate service before LocalGPT starts.
Clone and run:
git clone https://github.com/PromtEngineer/localGPT.git
cd localGPT
# Pull the default models via Ollama first
ollama pull qwen3:8b
ollama pull qwen3:0.6b
# Start the system
python run_system.py
Or via Docker if you prefer containers:
./start-docker.sh
The Docker path is simpler if you already have Docker configured. The manual path gives you more control but requires four separate terminal processes for the full stack.
Once running, you get a web UI for document upload and chat, plus an API endpoint for programmatic access.
Default models: LocalGPT ships with Qwen3:0.6b for fast responses and Qwen3:8b for higher-quality answers. Embeddings use Qwen/Qwen3-Embedding-0.6B, which runs comfortably on CPU — no GPU required for the embedding layer. You can swap to any model available in Ollama by editing the config.
Ingesting documents
You drop a PDF into the upload interface, LocalGPT chunks it, generates embeddings via the Qwen embedding model, and writes everything to LanceDB on disk. From that point forward, every query against that workspace searches the embedded chunks.
The hybrid search is the v2 addition worth paying attention to. Rather than pure cosine similarity on dense vectors, it blends:
- Semantic similarity — standard vector search
- Keyword matching — BM25-style sparse retrieval for exact terms
- Late Chunking — breaks text into long-context-aware segments rather than naive fixed-length chunks
In practice, this handles two common RAG failure modes better than simple vector search: documents with lots of proper nouns (names, codes, IDs) that don't embed distinctively, and documents where the answer context spans a section boundary.
The smart router is also worth noting. It decides per-query whether to use RAG (retrieve chunks, augment the prompt) or answer directly from the LLM's weights without retrieval. For questions clearly outside the documents, it skips retrieval entirely rather than fetching irrelevant chunks and hallucinating on top of them.
Hardware requirements
LocalGPT itself is lightweight. The RAM floor of 8GB covers the application layer. The real constraint is Ollama and the models you run through it.
Qwen3:8b requires approximately 6–7GB VRAM when loaded in 4-bit quantization. An RTX 3060 with 12GB VRAM handles it comfortably. An RTX 4060 Ti with 8GB can fit it if you use aggressive quantization.
CPU-only (no GPU) is fully supported and the main use case for privacy-sensitive environments that don't have a gaming GPU handy. Qwen3:8b on a modern CPU with 16GB RAM runs at roughly 3–6 tokens/second depending on the chip — slow for interactive chat but workable if you're running batch queries or can tolerate 30-second response times.
Qwen3:0.6b is the fast mode — it runs on essentially any hardware, including older laptops with no dedicated GPU, at 15–25 tokens/second on CPU. Quality suffers significantly at that model size, especially for complex multi-document questions, but it answers fast enough to feel interactive.
If you want GPU-accelerated inference without owning a GPU, RunPod gives you on-demand RTX 4090 access for testing — useful if you want to benchmark model quality before committing to a hardware purchase.
The privacy story
This is the point LocalGPT is built around. When you're running it correctly:
- The Ollama model server is local
- LanceDB stores embeddings on local disk
- No API calls leave your machine
- No telemetry, no analytics, no "phone home" behavior in the codebase
Contrast this with tools that offer a "local" mode as an afterthought while their primary workflow routes through cloud APIs. LocalGPT's architecture has no cloud path — there's nothing to accidentally misconfigure.
The verification pass (where the system independently checks its own answer) also happens locally. It uses the same Qwen3 model to run a second pass on the generated response before returning it, which catches some hallucinations. Not all of them, but it's a meaningful improvement over single-pass RAG.
When NOT to use LocalGPT
Your documents aren't PDFs. If you need to query Word documents, spreadsheets, PowerPoint decks, or email archives, LocalGPT doesn't support that yet. The README lists DOCX and other formats as planned — but planned is not the same as working. AnythingLLM handles 50+ document types includin
Top comments (0)