<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Bogle</title>
    <description>The latest articles on DEV Community by Alex Bogle (@saintchris_21).</description>
    <link>https://dev.to/saintchris_21</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3972649%2F80b3c688-b326-41b9-9a9b-9d5e74a1c575.jpg</url>
      <title>DEV Community: Alex Bogle</title>
      <link>https://dev.to/saintchris_21</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/saintchris_21"/>
    <language>en</language>
    <item>
      <title>I Built a Production RAG System on My M1 Mac for $0</title>
      <dc:creator>Alex Bogle</dc:creator>
      <pubDate>Wed, 10 Jun 2026 04:09:35 +0000</pubDate>
      <link>https://dev.to/saintchris_21/i-built-a-production-rag-system-on-my-m1-mac-for-0-5cn8</link>
      <guid>https://dev.to/saintchris_21/i-built-a-production-rag-system-on-my-m1-mac-for-0-5cn8</guid>
      <description>&lt;p&gt;I Built a Production RAG System on My M1 Mac for $0&lt;/p&gt;

&lt;p&gt;Most RAG tutorials stop at "it answers questions." But answering questions is table stakes. The real question is: &lt;strong&gt;are the answers actually correct?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built a RAG pipeline that doesn't just retrieve and generate — it &lt;strong&gt;evaluates&lt;/strong&gt; whether its own answers are faithful to the source material and relevant to the question. All running locally on an M1 Mac with 16GB RAM. Zero cloud cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A full RAG system with three layers most tutorials skip:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval&lt;/strong&gt; — Upload PDFs or paste text, get chunked and embedded into a local vector store&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generation&lt;/strong&gt; — Ask questions, get streaming answers backed by retrieved context with source citations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation&lt;/strong&gt; — Run an automated test suite that scores every answer for faithfulness and relevance using an LLM-as-judge, then logs all metrics to MLflow for experiment tracking&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why Evaluation Matters
&lt;/h2&gt;

&lt;p&gt;Anyone can build a RAG that produces plausible-sounding answers. The hard part is knowing whether those answers are &lt;strong&gt;grounded&lt;/strong&gt; in the source material or just confident hallucinations.&lt;/p&gt;

&lt;p&gt;My evaluation pipeline measures three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness&lt;/strong&gt; (1-5): Is the answer actually supported by the retrieved chunks? Or is the LLM making stuff up?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relevance&lt;/strong&gt; (1-5): Does the answer match the ground truth reference for the given question?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval Accuracy&lt;/strong&gt; (%): What percentage of key terms from the reference answer actually appear in the retrieved chunks?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three metrics get logged to MLflow, so you can compare different chunk sizes, overlap values, and model choices across runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The whole stack runs locally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; backend with async streaming responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChromaDB&lt;/strong&gt; as the vector store (embedded, no separate server)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;all-MiniLM-L6-v2&lt;/strong&gt; (sentence-transformers) for embeddings — 80MB, fast on CPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;qwen3:4b&lt;/strong&gt; (Ollama) for generation — 2.5GB, no API keys needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLflow&lt;/strong&gt; for experiment tracking — params, metrics, all logged&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vanilla HTML/CSS/JS&lt;/strong&gt; frontend — dark theme, chat UI, PDF upload, eval controls&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Not Ollama Embeddings?
&lt;/h2&gt;

&lt;p&gt;I started with Ollama's &lt;code&gt;/api/embeddings&lt;/code&gt; endpoint using nomic-embed-text. On my M1 Mac, it was painfully slow — 30+ seconds per embedding call. Switched to sentence-transformers running the same all-MiniLM-L6-v2 model locally in-process. First call loads the model in ~2 seconds. Subsequent embeddings: ~18ms each. Night and day.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;The entire project is open source. Key files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;rag_engine.py&lt;/code&gt; — Chunking, embedding, vector storage, retrieval&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llm_client.py&lt;/code&gt; — Ollama client with sync and streaming generation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;evaluator.py&lt;/code&gt; — LLM-as-judge scoring for faithfulness and relevance&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tracker.py&lt;/code&gt; — MLflow experiment tracking wrapper&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;app.py&lt;/code&gt; — FastAPI routes tying it all together&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;seed.py&lt;/code&gt; — Sample data seeding script&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tests are fully mocked — no Ollama or model files required. Full suite runs in ~30 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation is not optional.&lt;/strong&gt; If you can't measure answer quality, you're shipping a chatbot, not a system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local-first is viable.&lt;/strong&gt; Everything runs on a $1,200 laptop. No cloud credits, no API keys, no vendor lock-in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming matters for UX.&lt;/strong&gt; SSE streaming makes the UI feel responsive even when the LLM takes a few seconds to generate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mock your external dependencies.&lt;/strong&gt; Tests that call real LLMs are slow and flaky. Mock the HTTP layer, test your logic.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The repo includes a seed script with sample data so you can go from clone to running in under 5 minutes. Just need Python 3.9+ and Ollama with qwen3:4b.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/SaintChris/rag-eval-system.git
&lt;span class="nb"&gt;cd &lt;/span&gt;rag-eval-system/backend
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python3 seed.py
python3 run.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8001/docs&lt;/code&gt; for the interactive API docs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is Project 1 of 5 in my portfolio rebuild. Next up: a Hermes MLOps case study showing how I run a 6-agent AI system with 22K+ requests, 52 tests, and $0 cloud spend — all on this same M1 Mac.&lt;/p&gt;

&lt;p&gt;If you're building RAG systems and not evaluating them, you're flying blind. Build the eval pipeline first. Everything else follows.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>mlops</category>
      <category>ai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
