This article was originally published on aifoss.dev
RAG — retrieval-augmented generation — is the answer to the obvious problem with local LLMs: they don't know your documents. AnythingLLM is the easiest way to fix that without sending your files to OpenAI.
This review covers v1.12.1 (released April 22, 2026) across both the desktop app and Docker deployment. Short verdict: if you want a working RAG setup in an afternoon rather than a week, AnythingLLM gets you there. It's not perfect — the chunking control is limited and retrieval debugging is opaque — but for the majority of "I want to chat with my own documents privately" use cases, nothing else is as fast to set up.
What AnythingLLM actually is
AnythingLLM is a local RAG and AI agent platform built by Mintplex Labs. It wraps document ingestion, vector storage, and LLM connections into a single web UI — no LangChain boilerplate, no manual ChromaDB setup, no gluing together Python packages.
You point it at your documents (PDFs, text files, Word docs, web pages, YouTube transcripts). It chunks them, embeds them into a vector database, and makes them available for retrieval when you chat. The LLM backend can be anything: Ollama running locally, an OpenAI API key, Anthropic, Mistral, or 30+ other providers.
Multi-user support and workspaces are built in. You can have one workspace for your codebase docs and another for your research notes, with different models and RAG settings per workspace.
The license is MIT. Free to self-host. Cloud-managed plans start at $50/month, but you almost certainly don't need them.
Who it's for
AnythingLLM sits in a specific lane: technical users who want document-aware AI without writing Python. If you're comfortable with Docker or running a desktop app, and you want something that works without gluing libraries together yourself, this is the tool.
It's not for enterprise document management — no audit trails, no SSO, no fine-grained permissions at scale. It's not for researchers who need programmatic control over chunk size, embedding architecture, or retrieval strategy. For those use cases, you'd build with LlamaIndex or LangChain directly.
Install: desktop vs. Docker
The desktop app is the fastest path. Download from the official site, install, done. It uses LanceDB embedded — no separate vector database server required — and stores everything in a local app directory.
The Docker route gives you a server deployment accessible from other machines:
docker pull mintplexlabs/anythingllm:latest
docker run -d -p 3001:3001 \
-v ${PWD}/storage:/app/server/storage \
--name anythingllm \
mintplexlabs/anythingllm:latest
Then open http://localhost:3001 and run the setup wizard. You'll configure your LLM provider, your embedding model, and your vector database.
For a fully offline setup with Ollama:
- Install Ollama, pull a model:
ollama pull llama3.2 - Pull an embedding model:
ollama pull nomic-embed-text - In AnythingLLM, set LLM provider to Ollama at
http://localhost:11434 - Set embedding provider to Ollama, select
nomic-embed-text - Upload a document, embed it, start chatting
That's the entire path. Fifteen minutes if Ollama is already running. Nothing else requires configuration unless you want it to.
Hardware requirements
The desktop app itself is lightweight — Mintplex Labs lists 2GB RAM minimum. Docker is similar but 4GB RAM is the comfortable floor for stable operation under document-embedding load.
The actual constraint is your LLM backend:
- Cloud APIs only (OpenAI, Anthropic): no local GPU required
- Ollama locally: need enough VRAM for your model. A 7B quantized model needs ~5–6GB VRAM; a 14B needs ~10–12GB
-
Embedding:
nomic-embed-textruns comfortably on CPU — no GPU needed for the embedding layer
So AnythingLLM's own footprint is minimal. If you're GPU-constrained and your document corpus is large, embedding jobs via CPU will run slowly. Worth knowing before you try to index 10,000 PDFs.
Core features
Workspaces
Workspaces are AnythingLLM's structural backbone. Each workspace has its own document collection, its own RAG settings, and its own chat history. You can assign different LLM providers and models to different workspaces.
In practice: your "legal contracts" workspace uses a slow, careful model with conservative temperature. Your "daily notes" workspace uses a fast 7B model for quick lookups. Neither interferes with the other.
Document embedding: two modes
When you add a document to a workspace, you choose between:
- Embed (RAG mode): chunks the document, vectorizes it, and stores it in the vector database. All chats in the workspace can retrieve relevant chunks. This persists across sessions. Use this for anything you'll reference repeatedly.
- Attach: sends the full document text in the current chat's context window. Useful for one-off questions on a short document, but burns tokens and disappears when the conversation ends.
Default to embed. The attach mode exists for edge cases — don't use it as your primary document strategy.
Vector database options
The default embedded LanceDB works well for the desktop app and small deployments. For larger setups or when you need persistence independent of AnythingLLM, you can swap in:
| Vector DB | Type | Notes |
|---|---|---|
| LanceDB | Embedded | Default. No server. Good for single-user local. |
| Chroma | Local server | Familiar to developers, easy to self-host |
| Qdrant | Local or cloud | Fast, strong filtering support |
| Milvus | Local or cloud | Better at scale, more ops overhead |
| Pinecone | Cloud only | Managed — not private |
| Weaviate | Local or cloud | Solid for semantic search workflows |
For most people running locally: stick with LanceDB unless you have a specific reason to switch. It's embedded, requires zero ops, and performs fine for collections in the thousands of documents.
AI agents
As of v1.12.1, AnythingLLM's agent mode uses the native tool-calling capabilities of your LLM provider when available. The agent can search the web, execute code, cite document sources during retrieval, and call external APIs.
Practical scope: useful for "summarize this document and look up related current information" workflows. Not a replacement for a proper agent framework — the tool-chaining and orchestration depth isn't there. But for augmenting document chat with live data, it works.
The v1.12.1 release also added a Telegram bot integration so you can query your AnythingLLM instance from anywhere — a niche feature but a useful one for mobile access to your private knowledge base.
LLM provider flexibility
This is where AnythingLLM beats most comparable tools. Supported providers include Ollama, LM Studio, LocalAI, OpenAI, Anthropic, Mistral, Groq, Cohere, and custom OpenAI-compatible endpoints. You swap providers per workspace without touching your document embeddings.
When a better local model ships, you update the model selection in settings. Your indexed documents stay intact.
AnythingLLM vs. the alternatives
| Feature | AnythingLLM | Open WebUI | PrivateGPT |
|---|---|---|---|
| Primary focus | RAG + agents | Chat UI + RAG | Document Q&A |
| Setup complexity | Low | Medium | Low–Medium |
| Multi-user | Yes (built-in) | Yes | No |
| Vector DB options | 9+ | Limited | ChromaDB |
| Agent support | Yes (v1.12+) | Minimal | No |
| Local model support | Via Ollama/LM Studio | Via Ollama | Via llama.cpp |
| License | MIT | MIT | Apache 2.0 |
| Best for | Doc-first workflows | Chat-first UI | Simple doc Q&A |
Open WebUI is the right pick if your primary use case is a ChatGPT-style interface for local model chat, with documents as a secondary layer. AnythingLLM inverts those priorities: documents and retrieval are the core product, chat is how you access them.
PrivateGPT is simpler and lighter, but it hasn't kept pace with the broader ecosystem — no
Top comments (0)