DEV Community

Cover image for I built a self-hosted RAG system that actually works — here's how to run it in one command
Francesco Marchetti
Francesco Marchetti

Posted on

I built a self-hosted RAG system that actually works — here's how to run it in one command

I'll be honest: I spent weeks trying to make existing RAG tools work for my use case. AnythingLLM kept needing cloud APIs. RAGFlow was hard to self-host cleanly. Perplexity-style tools were completely off the table for anything with sensitive documents.

So I built my own.

RAG Enterprise is a 100% local RAG system — no data leaves your server, no external APIs, no hidden telemetry. It runs on your hardware with a single setup script. Here's how to get it running.


Why another RAG tool?

Because my clients have real constraints:

  • Legal documents that can't touch US servers (hello, GDPR)
  • IT departments that won't approve "just use OpenAI"
  • Budgets that don't include $500/month SaaS subscriptions

I needed something that runs on-prem, handles PDFs and DOCX files well, supports multiple users with proper roles, and doesn't require a PhD to install.

After building and iterating on this for a few months, it now handles 10,000+ documents comfortably, supports 29 languages, and the whole stack is containerized.


What's under the hood

The architecture is pretty standard but well-wired:

React Frontend (Port 3000)
        │
        │ REST API
        ▼
FastAPI Backend (Port 8000)
   - LangChain RAG pipeline
   - JWT auth + RBAC
   - Apache Tika + Tesseract OCR
   - BAAI/bge-m3 embeddings
        │
   ┌────┴────┐
   ▼         ▼
Qdrant    Ollama
(vectors) (LLM inference)
Enter fullscreen mode Exit fullscreen mode

The LLM runs via Ollama locally — by default Mistral 7B Q4 or Qwen2.5:14b depending on your VRAM. Embeddings use BAAI/bge-m3 which is multilingual and genuinely good.

Everything is Docker containers. No dependency hell.


Prerequisites

Before you start, make sure you have:

  • Ubuntu 20.04+ (22.04 recommended)
  • NVIDIA GPU with 8-16GB VRAM, drivers installed
  • 16GB RAM minimum (32GB recommended)
  • 50GB+ free disk space
  • A decent internet connection for the initial download (~80 Mbit/s or faster)

The setup downloads Docker images, the LLM model, and the embedding model. On a fast connection it takes 15-20 minutes. On a slower one, about an hour. You do it once.


Installation

# 1. Clone the repo
git clone https://github.com/I3K-IT/RAG-Enterprise.git
cd RAG-Enterprise/rag-enterprise-structure

# 2. Run the setup script
./setup.sh standard
Enter fullscreen mode Exit fullscreen mode

The script handles everything:

  • Docker Engine + Docker Compose
  • NVIDIA Container Toolkit
  • Ollama with your chosen LLM
  • Qdrant vector database
  • Backend + frontend services

At one point during setup it'll ask you to log out and back in (for Docker group permissions). Just do it and re-run the script — it picks up where it left off.


First startup

After setup completes, the backend downloads the embedding model on first run. This takes a few minutes. Check progress with:

docker compose logs backend -f
Enter fullscreen mode Exit fullscreen mode

When you see Application startup complete, open your browser at http://localhost:3000.

Get your admin password from the logs:

docker compose logs backend | grep "Password:"
Enter fullscreen mode Exit fullscreen mode

Login with admin and that password.


Uploading documents

The role system works like this:

  • User → can query, can't upload
  • Super User → can upload and delete documents
  • Admin → full access including user management

Login as Admin, go to the admin panel, create a Super User account. Then upload your documents.

Supported formats: PDF (with OCR), DOCX, PPTX, XLSX, TXT, MD, ODT, RTF, HTML, XML.

Processing takes 1-2 minutes per document. After that, you can start querying.


Querying your documents

Just type your question in plain language. The RAG pipeline:

  1. Embeds your query with bge-m3
  2. Searches Qdrant for semantically similar chunks
  3. Passes relevant context to the LLM
  4. Returns an answer grounded in your documents

Response time is 2-4 seconds. Generation speed around 80-100 tokens/second on an RTX 4070.


Switching the LLM model

Edit docker-compose.yml:

environment:
  LLM_MODEL: qwen2.5:14b-instruct-q4_K_M  # or mistral:7b-instruct-q4_K_M
  EMBEDDING_MODEL: BAAI/bge-m3
  RELEVANCE_THRESHOLD: "0.35"
Enter fullscreen mode Exit fullscreen mode

Then restart the backend:

docker compose restart backend
Enter fullscreen mode Exit fullscreen mode

If you're getting too few results, lower RELEVANCE_THRESHOLD to 0.3 or even 0.25.


Useful commands

# Check all services
docker compose ps

# Follow logs
docker compose logs -f

# Restart everything
docker compose restart

# Stop
docker compose down

# Health check
curl http://localhost:8000/health
Enter fullscreen mode Exit fullscreen mode

If the backend shows "unhealthy" on first start, just wait — it's still downloading the embedding model.


What I'm working on next

The community edition uses Qdrant for vector search. The Pro version I'm building adds a hybrid SQL-Vector engine — combining traditional keyword search with semantic search for better precision on structured documents like contracts and regulatory texts. It also adds a 6-stage retrieval pipeline (query expansion → retrieval → reranking → fusion → filtering → generation).

But for most use cases, the community edition is more than enough.


Try it, break it, contribute

The repo is at github.com/I3K-IT/RAG-Enterprise. It's AGPL-3.0 — free to use, modify, and self-host. If you offer it as a service you need to share modifications, which I think is fair.

If you're building something on top of this, or hit issues during setup, open an issue or drop a comment here. Happy to help.

And if you're interested in the EU sovereignty angle — keeping AI infrastructure inside European jurisdiction — check out EuLLM, a project I'm building in parallel: a Rust-based alternative to Ollama with an EU-hosted model registry and built-in AI Act compliance. RAG Enterprise will integrate with it natively.


Built by Francesco Marchetti @ I3K Technologies, Milan.

Top comments (0)