I built a self-hosted RAG system that actually works — here's how to run it in one command

I'll be honest: I spent weeks trying to make existing RAG tools work for my use case. AnythingLLM kept needing cloud APIs. RAGFlow was hard to self-host cleanly. Perplexity-style tools were completely off the table for anything with sensitive documents.

So I built my own.

RAG Enterprise is a 100% local RAG system — no data leaves your server, no external APIs, no hidden telemetry. It runs on your hardware with a single setup script. Here's how to get it running.

Why another RAG tool?

Because my clients have real constraints:

Legal documents that can't touch US servers (hello, GDPR)
IT departments that won't approve "just use OpenAI"
Budgets that don't include $500/month SaaS subscriptions

I needed something that runs on-prem, handles PDFs and DOCX files well, supports multiple users with proper roles, and doesn't require a PhD to install.

After building and iterating on this for a few months, it now handles 10,000+ documents comfortably, supports 29 languages, and the whole stack is containerized.

What's under the hood

The architecture is pretty standard but well-wired:

React Frontend (Port 3000)
        │
        │ REST API
        ▼
FastAPI Backend (Port 8000)
   - LangChain RAG pipeline
   - JWT auth + RBAC
   - Apache Tika + Tesseract OCR
   - BAAI/bge-m3 embeddings
        │
   ┌────┴────┐
   ▼         ▼
Qdrant    Ollama
(vectors) (LLM inference)

The LLM runs via Ollama locally — by default Mistral 7B Q4 or Qwen2.5:14b depending on your VRAM. Embeddings use BAAI/bge-m3 which is multilingual and genuinely good.

Everything is Docker containers. No dependency hell.

Prerequisites

Before you start, make sure you have:

Ubuntu 20.04+ (22.04 recommended)
NVIDIA GPU with 8-16GB VRAM, drivers installed
16GB RAM minimum (32GB recommended)
50GB+ free disk space
A decent internet connection for the initial download (~80 Mbit/s or faster)

The setup downloads Docker images, the LLM model, and the embedding model. On a fast connection it takes 15-20 minutes. On a slower one, about an hour. You do it once.

Installation

# 1. Clone the repo
git clone https://github.com/I3K-IT/RAG-Enterprise.git
cd RAG-Enterprise/rag-enterprise-structure

# 2. Run the setup script
./setup.sh standard

The script handles everything:

Docker Engine + Docker Compose
NVIDIA Container Toolkit
Ollama with your chosen LLM
Qdrant vector database
Backend + frontend services

At one point during setup it'll ask you to log out and back in (for Docker group permissions). Just do it and re-run the script — it picks up where it left off.

First startup

After setup completes, the backend downloads the embedding model on first run. This takes a few minutes. Check progress with:

docker compose logs backend -f

When you see Application startup complete, open your browser at http://localhost:3000.

Get your admin password from the logs:

docker compose logs backend | grep "Password:"

Uploading documents

The role system works like this:

User → can query, can't upload
Super User → can upload and delete documents
Admin → full access including user management

Supported formats: PDF (with OCR), DOCX, PPTX, XLSX, TXT, MD, ODT, RTF, HTML, XML.

Processing takes 1-2 minutes per document. After that, you can start querying.

Querying your documents

Just type your question in plain language. The RAG pipeline:

Embeds your query with bge-m3
Searches Qdrant for semantically similar chunks
Passes relevant context to the LLM
Returns an answer grounded in your documents

Response time is 2-4 seconds. Generation speed around 80-100 tokens/second on an RTX 4070.

Switching the LLM model

Edit docker-compose.yml:

environment:
  LLM_MODEL: qwen2.5:14b-instruct-q4_K_M  # or mistral:7b-instruct-q4_K_M
  EMBEDDING_MODEL: BAAI/bge-m3
  RELEVANCE_THRESHOLD: "0.35"

Then restart the backend:

docker compose restart backend

If you're getting too few results, lower RELEVANCE_THRESHOLD to 0.3 or even 0.25.

Useful commands

# Check all services
docker compose ps

# Follow logs
docker compose logs -f

# Restart everything
docker compose restart

# Stop
docker compose down

# Health check
curl http://localhost:8000/health

If the backend shows "unhealthy" on first start, just wait — it's still downloading the embedding model.

What I'm working on next

The community edition uses Qdrant for vector search. The Pro version I'm building adds a hybrid SQL-Vector engine — combining traditional keyword search with semantic search for better precision on structured documents like contracts and regulatory texts. It also adds a 6-stage retrieval pipeline (query expansion → retrieval → reranking → fusion → filtering → generation).

But for most use cases, the community edition is more than enough.

Try it, break it, contribute

The repo is at github.com/I3K-IT/RAG-Enterprise. It's AGPL-3.0 — free to use, modify, and self-host. If you offer it as a service you need to share modifications, which I think is fair.

If you're building something on top of this, or hit issues during setup, open an issue or drop a comment here. Happy to help.

And if you're interested in the EU sovereignty angle — keeping AI infrastructure inside European jurisdiction — check out EuLLM, a project I'm building in parallel: a Rust-based alternative to Ollama with an EU-hosted model registry and built-in AI Act compliance. RAG Enterprise will integrate with it natively.

Built by Francesco Marchetti @ I3K Technologies, Milan.