<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: shaikhadibbb</title>
    <description>The latest articles on DEV Community by shaikhadibbb (@shaikhadibbb).</description>
    <link>https://dev.to/shaikhadibbb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3957218%2F1c75dc00-a2c5-4d37-87f6-f8f6b319ab3a.png</url>
      <title>DEV Community: shaikhadibbb</title>
      <link>https://dev.to/shaikhadibbb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shaikhadibbb"/>
    <language>en</language>
    <item>
      <title>How I rescued a RAG assistant from memory leaks and got it running on a 512MB RAM free tier</title>
      <dc:creator>shaikhadibbb</dc:creator>
      <pubDate>Fri, 29 May 2026 09:02:07 +0000</pubDate>
      <link>https://dev.to/shaikhadibbb/how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb-ram-free-tier-4co9</link>
      <guid>https://dev.to/shaikhadibbb/how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb-ram-free-tier-4co9</guid>
      <description>&lt;p&gt;A few weeks ago, I had a classic "works on my machine" moment. I had built a nice RAG prototype locally using Ollama and PyTorch. But when I tried to deploy it for staging on a Render free-tier instance (which has a brutal 512MB RAM limit), the server instantly crashed with Out-Of-Memory (OOM) errors. This post is a step-by-step breakdown of how I re-engineered the pipeline—moving from heavy PyTorch models to FastEmbed, baking models into Docker images, implementing hybrid search, and setting up automated evaluations with MLflow—to get a production-ready RAG assistant live.&lt;/p&gt;

&lt;p&gt;In the industrial domain, AI holds massive promise. In Germany's heavy manufacturing sector—spanning giants like Siemens, Bosch, and BMW—accessing the right maintenance instructions quickly can mean the difference between a minor schedule adjustments and a multi-million-euro line stoppage. However, applying standard Academic Retrieval-Augmented Generation (RAG) directly to complex technical manuals typically fails.&lt;/p&gt;

&lt;p&gt;This article details how I transformed a broken, slow RAG prototype into a hardened, high-performance, production-grade assistant specifically optimized for German manufacturing compliance and speed requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Challenge: Why Standard RAG Fails on Technical Manuals
&lt;/h2&gt;

&lt;p&gt;Standard RAG pipelines follow a basic procedure: chunk a document, run standard vector search, pass top chunks to an LLM, and output the result. &lt;/p&gt;

&lt;p&gt;When applied to a &lt;strong&gt;200-page compressor manual&lt;/strong&gt;, this naive approach collapses due to three factors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Domain-Specific Terminology:&lt;/strong&gt; Heavy equipment manuals contain dense technical terminology (e.g., "star-delta starters", "high-pressure warning transducers", "LOTO procedures"). Dense embeddings alone struggle to align generic search queries with highly technical, localized instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Fragmentation &amp;amp; Truncation:&lt;/strong&gt; Technical instructions are highly structured, featuring tables, lists, and reference sections. Standard fixed-size chunking slices tables in half, leading to &lt;strong&gt;low context recall&lt;/strong&gt; and &lt;strong&gt;hallucinations&lt;/strong&gt; (low faithfulness).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rigorous Compliance Requirements:&lt;/strong&gt; Under European frameworks like the &lt;strong&gt;EU AI Act&lt;/strong&gt;, safety-critical systems must offer transparency. A RAG assistant giving advice without exact, page-level citation tracing is legally unviable in a German manufacturing workspace.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  System Architecture: The Multi-Stage Retrieval Engine
&lt;/h2&gt;

&lt;p&gt;To solve these challenges, I built a multi-stage retrieval and generation architecture using &lt;strong&gt;LlamaIndex&lt;/strong&gt;, &lt;strong&gt;Qdrant&lt;/strong&gt;, and &lt;strong&gt;Mistral-7B&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    Query[User Query] --&amp;gt;|HyDE Transformation| HyDE[Hypothetical Doc]
    HyDE --&amp;gt;|Dense Search| VectorStore[(Qdrant Vector Store)]
    Query --&amp;gt;|Keyword Search| BM25[BM25 Retriever]
    VectorStore --&amp;gt;|Top K Chunks| RRF[RRF Hybrid Fusion]
    BM25 --&amp;gt;|Top K Chunks| RRF
    RRF --&amp;gt;|Combined Chunks| Reranker[Cross-Encoder Reranker]
    Reranker --&amp;gt;|Top 3 Chunks| Deduplicator[SHA-256 Deduplication]
    Deduplicator --&amp;gt;|Ground Truth Chunks| LLM[Mistral-7B Generator]
    LLM --&amp;gt;|Stream Response| Response[SSE Stream Client]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1. Query Expansion via HyDE
&lt;/h3&gt;

&lt;p&gt;Technical queries can be highly variable. A technician might ask &lt;em&gt;"What should be done if the compressor's high-pressure warning transducer value approaches the limit?"&lt;/em&gt; while the manual describes the issue using passive engineering specifications.&lt;br&gt;
I implemented &lt;strong&gt;Hypothetical Document Embeddings (HyDE)&lt;/strong&gt;. The user's query is passed to the LLM to generate a hypothetical "ideal" answer. This hypothetical answer, rich in technical syntax, is then embedded and used for dense vector search, drastically increasing our retrieval recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Reciprocal Rank Fusion (RRF) Hybrid Search
&lt;/h3&gt;

&lt;p&gt;Vector search (dense retrieval) is excellent for conceptual matching but struggles with specific numbers or parts (e.g., "5 kW", "Model-X"). &lt;br&gt;
I built a &lt;strong&gt;Hybrid Retriever&lt;/strong&gt; combining dense vector search (via Qdrant) and sparse keyword retrieval (BM25). The results from both retrievers are merged using &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;$$RRF(d) = \sum_{m \in M} \frac{1}{k + r_m(d)}$$&lt;/p&gt;

&lt;p&gt;where $k = 60$ is a constant, and $r_m(d)$ is the rank of document $d$ in retriever $m$. This fuses semantic alignment with exact keyword precision.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cross-Encoder Reranking
&lt;/h3&gt;

&lt;p&gt;Retrieving 6-10 chunks covers the necessary context but introduces noise and consumes precious context window tokens, slowing down LLM generation. &lt;br&gt;
I integrated a custom &lt;strong&gt;Cross-Encoder Reranker&lt;/strong&gt; (&lt;code&gt;ms-marco-MiniLM-L-6-v2&lt;/code&gt;). While Bi-encoders (like BGE) embed queries and documents separately, a Cross-Encoder performs full self-attention over the query and chunk simultaneously, scoring their precise relationship. This allows us to reduce our context from 6 down to the top 3 highly relevant chunks without losing critical facts.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. SHA-256 Content Deduplication
&lt;/h3&gt;

&lt;p&gt;In manuals, certain tables or notices (such as safety warnings) repeat on multiple pages. Fusing duplicate chunks wastes context capacity and creates repetitive LLM answers.&lt;br&gt;
I implemented a postprocessor that normalizes chunk text and deduplicates based on a normalized &lt;strong&gt;SHA-256 hash&lt;/strong&gt; and Jaccard text similarity (threshold = 0.85).&lt;/p&gt;




&lt;h2&gt;
  
  
  Quality Engineering: MLOps and the RAGAS Loop
&lt;/h2&gt;

&lt;p&gt;You cannot optimize what you do not measure. Rather than relying on sporadic manual "vibe checks," I established a rigorous, automated &lt;strong&gt;LLM-as-a-Judge&lt;/strong&gt; evaluation loop using &lt;strong&gt;RAGAS&lt;/strong&gt; and &lt;strong&gt;MLflow&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The 50+ Q&amp;amp;A Evaluation Dataset
&lt;/h3&gt;

&lt;p&gt;I curated a production-grade evaluation dataset of &lt;strong&gt;50+ Q&amp;amp;A pairs&lt;/strong&gt; directly from real industrial manuals, distributed across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Troubleshooting (40%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Safety Procedures (25%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part Identification (20%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Maintenance Schedules (15%)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Context Window Tuning (&lt;code&gt;num_ctx&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;During baseline evaluations, I noticed a critical bottleneck: the local Mistral model was hallucinating safety regulations because of context window truncation.&lt;/p&gt;

&lt;p&gt;I designed an experiment comparing &lt;code&gt;num_ctx&lt;/code&gt; window sizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context Window (&lt;code&gt;num_ctx&lt;/code&gt;)&lt;/th&gt;
&lt;th&gt;Faithfulness&lt;/th&gt;
&lt;th&gt;Context Recall&lt;/th&gt;
&lt;th&gt;p95 Latency&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;512 (Baseline)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.583&lt;/td&gt;
&lt;td&gt;0.554&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.9s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ High context truncation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2048 (Optimal)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.724&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.712&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~3.2s&lt;/td&gt;
&lt;td&gt;✅ Low truncation, high accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4096 (Wasteful)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.731&lt;/td&gt;
&lt;td&gt;0.718&lt;/td&gt;
&lt;td&gt;~5.9s&lt;/td&gt;
&lt;td&gt;❌ Too slow for production&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;By moving to &lt;strong&gt;&lt;code&gt;num_ctx: 2048&lt;/code&gt;&lt;/strong&gt;, the retrieved context fit perfectly, boosting &lt;strong&gt;Faithfulness to 0.724&lt;/strong&gt; (well above our 0.70 threshold) and &lt;strong&gt;Context Recall to 0.712&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Software Engineering: Production Hardening and Performance
&lt;/h2&gt;

&lt;p&gt;To transition from a developer script to a production service, I re-engineered the FastAPI web service to support high concurrency, real-time streaming, and robust security.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Fully Asynchronous Pipeline &amp;amp; Connection Pooling
&lt;/h3&gt;

&lt;p&gt;Standard python web apps block on I/O. I rewrote all FastAPI endpoints to be fully async. I pooled the remote &lt;code&gt;QdrantClient&lt;/code&gt; thread-safely via a global singleton and instantiated an &lt;code&gt;AsyncQdrantClient&lt;/code&gt; connection pool, ensuring concurrent database handles are shared efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. High-Performance Caching
&lt;/h3&gt;

&lt;p&gt;To achieve a p95 latency under the strict &lt;strong&gt;2.0-second limit&lt;/strong&gt;, I implemented two layers of caching:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Embedding Cache:&lt;/strong&gt; Monkeypatched the Hugging Face &lt;code&gt;BGEEmbedder&lt;/code&gt; to cache calculated query embedding vectors in a local LRU cache, preventing repetitive tensor computations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LRU-TTL Query Cache:&lt;/strong&gt; Built a thread-safe in-memory cache with a 1-hour Time-To-Live (TTL) that intercepts duplicate queries and returns them in under &lt;strong&gt;10ms&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  3. Server-Sent Events (SSE) Streaming
&lt;/h3&gt;

&lt;p&gt;For long-running generations, keeping a user waiting for a full payload ruins the experience. I created the &lt;code&gt;/query/stream&lt;/code&gt; endpoint returning a real-time token stream using &lt;strong&gt;Server-Sent Events (SSE)&lt;/strong&gt;. The UI immediately renders the text delta as it generates.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Sliding-Window Rate Limiter &amp;amp; X-API-Key Security
&lt;/h3&gt;

&lt;p&gt;To secure the public endpoint, I built:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Key Verification:&lt;/strong&gt; An &lt;code&gt;X-API-Key&lt;/code&gt; validation check on all sensitive endpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sliding-Window Rate Limiter:&lt;/strong&gt; A thread-safe, in-memory sliding-window limiter that restricts requests to &lt;strong&gt;10 requests per minute per IP&lt;/strong&gt;, returning HTTP 429 and &lt;code&gt;Retry-After&lt;/code&gt; headers to prevent resource exhaustion.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Edge-Case Resiliency: Exponential Backoff &amp;amp; OCR Fallback
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exponential Backoff:&lt;/strong&gt; If the remote Qdrant database experiences a network blip, the connection manager retries up to 5 times with exponential delays ($1\text{s}, 2\text{s}, 4\text{s}, 8\text{s}, 16\text{s}$) before falling back to local SQLite/disk storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OCR Parser Fallback:&lt;/strong&gt; For scanned, image-only manuals, if PyMuPDF text extraction returns empty characters, the parser falls back natively to rendering the page to PNG and running &lt;strong&gt;Tesseract OCR&lt;/strong&gt; to guarantee zero text loss.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Containerization &amp;amp; Deployment Orchestration
&lt;/h2&gt;

&lt;p&gt;To guarantee "it works on my machine" translates perfectly to a cloud environment, I containerized the entire pipeline.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dockerfile:&lt;/strong&gt; A multi-stage, slim Python-based image that runs as a dedicated non-root execution user (&lt;code&gt;UID=1000&lt;/code&gt;) and includes a strict health check monitoring local API latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nginx Reverse Proxy:&lt;/strong&gt; Placed an Nginx container in front of the FastAPI app to manage HTTP security headers (X-Frame-Options, CSP, XSS-Protection), limit maximum uploads to 50MB, and buffer streams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker-compose.prod.yml:&lt;/strong&gt; Fuses the App, Nginx proxy, Qdrant cluster, and Ollama server within a bridge network with shared persistent volumes.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Achievements &amp;amp; Lessons Learned
&lt;/h2&gt;

&lt;p&gt;This project demonstrates the transition from a simple machine learning model to a robust, compliant enterprise-grade system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Honest Engineering:&lt;/strong&gt; The transition from claiming 99% accuracy in the README to measuring it via RAGAS, documenting experiments, and achieving &lt;strong&gt;72.4% Faithfulness&lt;/strong&gt; and &lt;strong&gt;71.2% Context Recall&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design for Compliance:&lt;/strong&gt; Building exact, page-level citation tracing into the generation prompts, satisfying the European standard for human-in-the-loop explainability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLOps First:&lt;/strong&gt; Grounding all optimizations in DVC data tracking and SQLite MLflow metrics, proving that production AI is a discipline of measurement.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>devops</category>
      <category>python</category>
    </item>
  </channel>
</rss>
