I'll be honest: I spent weeks trying to make existing RAG tools work for my use case. AnythingLLM kept needing cloud APIs. RAGFlow was hard to self-host cleanly. Perplexity-style tools were completely off the table for anything with sensitive documents.
So I built my own.
RAG Enterprise is a 100% local RAG system — no data leaves your server, no external APIs, no hidden telemetry. It runs on your hardware with a single setup script. Here's how to get it running.
Why another RAG tool?
Because my clients have real constraints:
- Legal documents that can't touch US servers (hello, GDPR)
- IT departments that won't approve "just use OpenAI"
- Budgets that don't include $500/month SaaS subscriptions
I needed something that runs on-prem, handles PDFs and DOCX files well, supports multiple users with proper roles, and doesn't require a PhD to install.
After building and iterating on this for a few months, it now handles 10,000+ documents comfortably, supports 29 languages, and the whole stack is containerized.
What's under the hood
The architecture is pretty standard but well-wired:
React Frontend (Port 3000)
│
│ REST API
▼
FastAPI Backend (Port 8000)
- LangChain RAG pipeline
- JWT auth + RBAC
- Apache Tika + Tesseract OCR
- BAAI/bge-m3 embeddings
│
┌────┴────┐
▼ ▼
Qdrant Ollama
(vectors) (LLM inference)
The LLM runs via Ollama locally — by default Mistral 7B Q4 or Qwen2.5:14b depending on your VRAM. Embeddings use BAAI/bge-m3 which is multilingual and genuinely good.
Everything is Docker containers. No dependency hell.
Prerequisites
Before you start, make sure you have:
- Ubuntu 20.04+ (22.04 recommended)
- NVIDIA GPU with 8-16GB VRAM, drivers installed
- 16GB RAM minimum (32GB recommended)
- 50GB+ free disk space
- A decent internet connection for the initial download (~80 Mbit/s or faster)
The setup downloads Docker images, the LLM model, and the embedding model. On a fast connection it takes 15-20 minutes. On a slower one, about an hour. You do it once.
Installation
# 1. Clone the repo
git clone https://github.com/I3K-IT/RAG-Enterprise.git
cd RAG-Enterprise/rag-enterprise-structure
# 2. Run the setup script
./setup.sh standard
The script handles everything:
- Docker Engine + Docker Compose
- NVIDIA Container Toolkit
- Ollama with your chosen LLM
- Qdrant vector database
- Backend + frontend services
At one point during setup it'll ask you to log out and back in (for Docker group permissions). Just do it and re-run the script — it picks up where it left off.
First startup
After setup completes, the backend downloads the embedding model on first run. This takes a few minutes. Check progress with:
docker compose logs backend -f
When you see Application startup complete, open your browser at http://localhost:3000.
Get your admin password from the logs:
docker compose logs backend | grep "Password:"
Login with admin and that password.
Uploading documents
The role system works like this:
- User → can query, can't upload
- Super User → can upload and delete documents
- Admin → full access including user management
Login as Admin, go to the admin panel, create a Super User account. Then upload your documents.
Supported formats: PDF (with OCR), DOCX, PPTX, XLSX, TXT, MD, ODT, RTF, HTML, XML.
Processing takes 1-2 minutes per document. After that, you can start querying.
Querying your documents
Just type your question in plain language. The RAG pipeline:
- Embeds your query with bge-m3
- Searches Qdrant for semantically similar chunks
- Passes relevant context to the LLM
- Returns an answer grounded in your documents
Response time is 2-4 seconds. Generation speed around 80-100 tokens/second on an RTX 4070.
Switching the LLM model
Edit docker-compose.yml:
environment:
LLM_MODEL: qwen2.5:14b-instruct-q4_K_M # or mistral:7b-instruct-q4_K_M
EMBEDDING_MODEL: BAAI/bge-m3
RELEVANCE_THRESHOLD: "0.35"
Then restart the backend:
docker compose restart backend
If you're getting too few results, lower RELEVANCE_THRESHOLD to 0.3 or even 0.25.
Useful commands
# Check all services
docker compose ps
# Follow logs
docker compose logs -f
# Restart everything
docker compose restart
# Stop
docker compose down
# Health check
curl http://localhost:8000/health
If the backend shows "unhealthy" on first start, just wait — it's still downloading the embedding model.
What I'm working on next
The community edition uses Qdrant for vector search. The Pro version I'm building adds a hybrid SQL-Vector engine — combining traditional keyword search with semantic search for better precision on structured documents like contracts and regulatory texts. It also adds a 6-stage retrieval pipeline (query expansion → retrieval → reranking → fusion → filtering → generation).
But for most use cases, the community edition is more than enough.
Try it, break it, contribute
The repo is at github.com/I3K-IT/RAG-Enterprise. It's AGPL-3.0 — free to use, modify, and self-host. If you offer it as a service you need to share modifications, which I think is fair.
If you're building something on top of this, or hit issues during setup, open an issue or drop a comment here. Happy to help.
And if you're interested in the EU sovereignty angle — keeping AI infrastructure inside European jurisdiction — check out EuLLM, a project I'm building in parallel: a Rust-based alternative to Ollama with an EU-hosted model registry and built-in AI Act compliance. RAG Enterprise will integrate with it natively.
Built by Francesco Marchetti @ I3K Technologies, Milan.
Top comments (0)