<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Francesco Marchetti</title>
    <description>The latest articles on DEV Community by Francesco Marchetti (@primoco).</description>
    <link>https://dev.to/primoco</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3855438%2F05886499-c8fc-4aff-b54a-6684927ade98.jpg</url>
      <title>DEV Community: Francesco Marchetti</title>
      <link>https://dev.to/primoco</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/primoco"/>
    <language>en</language>
    <item>
      <title>I built a self-hosted RAG system that actually works — here's how to run it in one command</title>
      <dc:creator>Francesco Marchetti</dc:creator>
      <pubDate>Wed, 01 Apr 2026 11:30:57 +0000</pubDate>
      <link>https://dev.to/primoco/i-built-a-self-hosted-rag-system-that-actually-works-heres-how-to-run-it-in-one-command-38p2</link>
      <guid>https://dev.to/primoco/i-built-a-self-hosted-rag-system-that-actually-works-heres-how-to-run-it-in-one-command-38p2</guid>
      <description>&lt;p&gt;I'll be honest: I spent weeks trying to make existing RAG tools work for my use case. AnythingLLM kept needing cloud APIs. RAGFlow was hard to self-host cleanly. Perplexity-style tools were completely off the table for anything with sensitive documents.&lt;/p&gt;

&lt;p&gt;So I built my own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG Enterprise&lt;/strong&gt; is a 100% local RAG system — no data leaves your server, no external APIs, no hidden telemetry. It runs on your hardware with a single setup script. Here's how to get it running.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why another RAG tool?
&lt;/h2&gt;

&lt;p&gt;Because my clients have real constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Legal documents that can't touch US servers (hello, GDPR)&lt;/li&gt;
&lt;li&gt;IT departments that won't approve "just use OpenAI"
&lt;/li&gt;
&lt;li&gt;Budgets that don't include $500/month SaaS subscriptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I needed something that runs on-prem, handles PDFs and DOCX files well, supports multiple users with proper roles, and doesn't require a PhD to install.&lt;/p&gt;

&lt;p&gt;After building and iterating on this for a few months, it now handles 10,000+ documents comfortably, supports 29 languages, and the whole stack is containerized.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's under the hood
&lt;/h2&gt;

&lt;p&gt;The architecture is pretty standard but well-wired:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;React Frontend (Port 3000)
        │
        │ REST API
        ▼
FastAPI Backend (Port 8000)
   - LangChain RAG pipeline
   - JWT auth + RBAC
   - Apache Tika + Tesseract OCR
   - BAAI/bge-m3 embeddings
        │
   ┌────┴────┐
   ▼         ▼
Qdrant    Ollama
(vectors) (LLM inference)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM runs via &lt;strong&gt;Ollama&lt;/strong&gt; locally — by default Mistral 7B Q4 or Qwen2.5:14b depending on your VRAM. Embeddings use &lt;code&gt;BAAI/bge-m3&lt;/code&gt; which is multilingual and genuinely good.&lt;/p&gt;

&lt;p&gt;Everything is Docker containers. No dependency hell.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before you start, make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ubuntu 20.04+ (22.04 recommended)&lt;/li&gt;
&lt;li&gt;NVIDIA GPU with 8-16GB VRAM, drivers installed&lt;/li&gt;
&lt;li&gt;16GB RAM minimum (32GB recommended)&lt;/li&gt;
&lt;li&gt;50GB+ free disk space&lt;/li&gt;
&lt;li&gt;A decent internet connection for the initial download (~80 Mbit/s or faster)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The setup downloads Docker images, the LLM model, and the embedding model. On a fast connection it takes 15-20 minutes. On a slower one, about an hour. You do it once.&lt;/p&gt;




&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone the repo&lt;/span&gt;
git clone https://github.com/I3K-IT/RAG-Enterprise.git
&lt;span class="nb"&gt;cd &lt;/span&gt;RAG-Enterprise/rag-enterprise-structure

&lt;span class="c"&gt;# 2. Run the setup script&lt;/span&gt;
./setup.sh standard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script handles everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker Engine + Docker Compose&lt;/li&gt;
&lt;li&gt;NVIDIA Container Toolkit&lt;/li&gt;
&lt;li&gt;Ollama with your chosen LLM&lt;/li&gt;
&lt;li&gt;Qdrant vector database&lt;/li&gt;
&lt;li&gt;Backend + frontend services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At one point during setup it'll ask you to log out and back in (for Docker group permissions). Just do it and re-run the script — it picks up where it left off.&lt;/p&gt;




&lt;h2&gt;
  
  
  First startup
&lt;/h2&gt;

&lt;p&gt;After setup completes, the backend downloads the embedding model on first run. This takes a few minutes. Check progress with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose logs backend &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you see &lt;code&gt;Application startup complete&lt;/code&gt;, open your browser at &lt;code&gt;http://localhost:3000&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Get your admin password from the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose logs backend | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Password:"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Login with &lt;code&gt;admin&lt;/code&gt; and that password.&lt;/p&gt;




&lt;h2&gt;
  
  
  Uploading documents
&lt;/h2&gt;

&lt;p&gt;The role system works like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User&lt;/strong&gt; → can query, can't upload&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Super User&lt;/strong&gt; → can upload and delete documents
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admin&lt;/strong&gt; → full access including user management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Login as Admin, go to the admin panel, create a Super User account. Then upload your documents.&lt;/p&gt;

&lt;p&gt;Supported formats: PDF (with OCR), DOCX, PPTX, XLSX, TXT, MD, ODT, RTF, HTML, XML.&lt;/p&gt;

&lt;p&gt;Processing takes 1-2 minutes per document. After that, you can start querying.&lt;/p&gt;




&lt;h2&gt;
  
  
  Querying your documents
&lt;/h2&gt;

&lt;p&gt;Just type your question in plain language. The RAG pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Embeds your query with bge-m3&lt;/li&gt;
&lt;li&gt;Searches Qdrant for semantically similar chunks&lt;/li&gt;
&lt;li&gt;Passes relevant context to the LLM&lt;/li&gt;
&lt;li&gt;Returns an answer grounded in your documents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Response time is 2-4 seconds. Generation speed around 80-100 tokens/second on an RTX 4070.&lt;/p&gt;




&lt;h2&gt;
  
  
  Switching the LLM model
&lt;/h2&gt;

&lt;p&gt;Edit &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;LLM_MODEL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen2.5:14b-instruct-q4_K_M&lt;/span&gt;  &lt;span class="c1"&gt;# or mistral:7b-instruct-q4_K_M&lt;/span&gt;
  &lt;span class="na"&gt;EMBEDDING_MODEL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BAAI/bge-m3&lt;/span&gt;
  &lt;span class="na"&gt;RELEVANCE_THRESHOLD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.35"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then restart the backend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose restart backend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're getting too few results, lower &lt;code&gt;RELEVANCE_THRESHOLD&lt;/code&gt; to &lt;code&gt;0.3&lt;/code&gt; or even &lt;code&gt;0.25&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Useful commands
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check all services&lt;/span&gt;
docker compose ps

&lt;span class="c"&gt;# Follow logs&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt;

&lt;span class="c"&gt;# Restart everything&lt;/span&gt;
docker compose restart

&lt;span class="c"&gt;# Stop&lt;/span&gt;
docker compose down

&lt;span class="c"&gt;# Health check&lt;/span&gt;
curl http://localhost:8000/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the backend shows "unhealthy" on first start, just wait — it's still downloading the embedding model.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'm working on next
&lt;/h2&gt;

&lt;p&gt;The community edition uses Qdrant for vector search. The Pro version I'm building adds a &lt;strong&gt;hybrid SQL-Vector engine&lt;/strong&gt; — combining traditional keyword search with semantic search for better precision on structured documents like contracts and regulatory texts. It also adds a 6-stage retrieval pipeline (query expansion → retrieval → reranking → fusion → filtering → generation).&lt;/p&gt;

&lt;p&gt;But for most use cases, the community edition is more than enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it, break it, contribute
&lt;/h2&gt;

&lt;p&gt;The repo is at &lt;a href="https://github.com/I3K-IT/RAG-Enterprise" rel="noopener noreferrer"&gt;github.com/I3K-IT/RAG-Enterprise&lt;/a&gt;. It's AGPL-3.0 — free to use, modify, and self-host. If you offer it as a service you need to share modifications, which I think is fair.&lt;/p&gt;

&lt;p&gt;If you're building something on top of this, or hit issues during setup, open an issue or drop a comment here. Happy to help.&lt;/p&gt;

&lt;p&gt;And if you're interested in the EU sovereignty angle — keeping AI infrastructure inside European jurisdiction — check out &lt;strong&gt;&lt;a href="https://github.com/eullm/eullm" rel="noopener noreferrer"&gt;EuLLM&lt;/a&gt;&lt;/strong&gt;, a project I'm building in parallel: a Rust-based alternative to Ollama with an EU-hosted model registry and built-in AI Act compliance. RAG Enterprise will integrate with it natively.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://www.linkedin.com/in/francesco-marchetti-4a7b8149/" rel="noopener noreferrer"&gt;Francesco Marchetti&lt;/a&gt; @ &lt;a href="https://www.i3k.eu" rel="noopener noreferrer"&gt;I3K Technologies&lt;/a&gt;, Milan.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
