<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Damian</title>
    <description>The latest articles on DEV Community by Damian (@damiansiuta).</description>
    <link>https://dev.to/damiansiuta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3672023%2F6b6209c7-e48c-4c30-9c99-7fa20d270592.png</url>
      <title>DEV Community: Damian</title>
      <link>https://dev.to/damiansiuta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/damiansiuta"/>
    <language>en</language>
    <item>
      <title>The Missing Step in RAG: Why Your Vector DB is Bloated (and how to fix it locally)</title>
      <dc:creator>Damian</dc:creator>
      <pubDate>Sat, 20 Dec 2025 16:07:59 +0000</pubDate>
      <link>https://dev.to/damiansiuta/the-missing-step-in-rag-why-your-vector-db-is-bloated-and-how-to-fix-it-locally-2fjg</link>
      <guid>https://dev.to/damiansiuta/the-missing-step-in-rag-why-your-vector-db-is-bloated-and-how-to-fix-it-locally-2fjg</guid>
      <description>&lt;p&gt;We spend countless hours optimizing LLM prompts, tweaking retrieval parameters (k-NN), and choosing the best embedding models. But we often ignore the elephant in the room: The Data Quality.&lt;/p&gt;

&lt;p&gt;If you are building a RAG (Retrieval-Augmented Generation) pipeline using internal company data-logs, tickets, documentation, or emails you have likely encountered the Semantic Duplicate Problem.&lt;/p&gt;

&lt;p&gt;The Problem: Different Words, Same Meaning&lt;/p&gt;

&lt;p&gt;Standard deduplication tools (like Pandas drop_duplicates() or SQL DISTINCT) work on a string level. They look for exact matches.&lt;/p&gt;

&lt;p&gt;Consider these two log entries:&lt;/p&gt;

&lt;p&gt;Error: Connection to database timed out after 3000ms.&lt;/p&gt;

&lt;p&gt;DB Connection Failure: Timeout limit reached (3s).&lt;/p&gt;

&lt;p&gt;To a standard script, these are two unique rows.&lt;br&gt;
To an LLM (and to a human), they are identical.&lt;/p&gt;

&lt;p&gt;If you ingest 10,000 such rows into your Vector Database (Pinecone, Milvus, Weaviate):&lt;/p&gt;

&lt;p&gt;💸 You waste money: Storing vectors isn't free.&lt;/p&gt;

&lt;p&gt;📉 You hurt retrieval: When a user asks "Why did the DB fail?", the retriever fills the context window with 5 variations of the same error, crowding out other relevant information.&lt;/p&gt;

&lt;p&gt;😵 Model Hallucinations: The LLM gets repetitive context and degrades in quality.&lt;/p&gt;

&lt;p&gt;The Solution: Semantic Deduplication&lt;/p&gt;

&lt;p&gt;To fix this, we need to deduplicate based on meaning (vectors), not just syntax (text).&lt;/p&gt;

&lt;p&gt;I couldn't find a lightweight tool that does this efficiently on a local machine (Privacy First!) without spinning up a Spark cluster or sending sensitive data to OpenAI APIs. So, I engineered one.&lt;/p&gt;

&lt;p&gt;Meet EntropyGuard.&lt;/p&gt;

&lt;p&gt;🛡️ EntropyGuard: A Local-First ETL Engine&lt;/p&gt;

&lt;p&gt;EntropyGuard is an open-source CLI tool written in Python. It acts as a sanitation layer before your data hits the Vector Database.&lt;/p&gt;

&lt;p&gt;It solves three critical problems:&lt;/p&gt;

&lt;p&gt;Semantic Deduplication: Uses sentence-transformers and FAISS to find duplicates by cosine similarity.&lt;/p&gt;

&lt;p&gt;Sanitization: Strips PII (emails, phones) and HTML noise.&lt;/p&gt;

&lt;p&gt;Privacy: Runs 100% locally on CPU. No data exfiltration.&lt;/p&gt;

&lt;p&gt;The Tech Stack (Hard Tech)&lt;/p&gt;

&lt;p&gt;I wanted this tool to be robust enough for Enterprise data but light enough to run on a laptop.&lt;/p&gt;

&lt;p&gt;Engine: Built on Polars LazyFrame. This allows streaming execution. You can process a 10GB CSV on a laptop with 16GB RAM because the data isn't loaded into memory all at once.&lt;/p&gt;

&lt;p&gt;Vector Search: Uses FAISS (Facebook AI Similarity Search) for blazing-fast vector comparisons on CPU.&lt;/p&gt;

&lt;p&gt;Chunking: Implemented a native recursive chunker (paragraphs -&amp;gt; sentences) to prepare documents for embedding, avoiding the bloat of heavy frameworks like LangChain.&lt;/p&gt;

&lt;p&gt;Ingestion: Supports Excel (.xlsx), Parquet, CSV, and JSONL natively.&lt;/p&gt;

&lt;p&gt;How it works (The Code)&lt;/p&gt;

&lt;p&gt;Using it is as simple as running a CLI command.&lt;/p&gt;

&lt;p&gt;Installation:&lt;/p&gt;

&lt;p&gt;pip install "git+&lt;a href="https://github.com/DamianSiuta/entropyguard.git" rel="noopener noreferrer"&gt;https://github.com/DamianSiuta/entropyguard.git&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;Running the Audit:&lt;br&gt;
You can run a "Dry Run" to generate a JSON audit log showing exactly which rows would be dropped and why. This is crucial for compliance teams.&lt;/p&gt;

&lt;p&gt;entropyguard \&lt;br&gt;
  --input raw_data.jsonl \&lt;br&gt;
  --output clean_data.jsonl \&lt;br&gt;
  --dedup-threshold 0.85 \&lt;br&gt;
  --audit-log audit_report.json&lt;/p&gt;

&lt;p&gt;The Result:&lt;br&gt;
The tool generates embeddings locally (using a small model like all-MiniLM-L6-v2), clusters them using FAISS, and removes neighbors that are closer than the threshold (e.g., 0.85 similarity).&lt;/p&gt;

&lt;p&gt;Benchmark: 99.5% Noise Reduction&lt;/p&gt;

&lt;p&gt;I ran a stress test using a synthetic dataset of 10,000 rows generated by multiplying 50 unique signals with heavy noise (HTML tags, rephrasing, typos).&lt;/p&gt;

&lt;p&gt;Raw Data: 10,000 rows.&lt;/p&gt;

&lt;p&gt;Cleaned Data: ~50 rows.&lt;/p&gt;

&lt;p&gt;Execution Time: &amp;lt; 2 minutes on a standard laptop CPU.&lt;/p&gt;

&lt;p&gt;The tool successfully identified that &lt;/p&gt;System Error and Critical Failure: System were the same event, collapsing them into one canonical entry.

&lt;p&gt;Why Open Source?&lt;/p&gt;

&lt;p&gt;I believe data hygiene is a fundamental problem that shouldn't require expensive SaaS subscriptions. I released EntropyGuard under the MIT License so anyone can use it—even in commercial/air-gapped environments.&lt;/p&gt;

&lt;p&gt;Check out the repo here:&lt;br&gt;
👉 github.com/DamianSiuta/entropyguard&lt;/p&gt;

&lt;p&gt;I’m actively looking for feedback from the Data Engineering community. If you are struggling with dirty RAG datasets, give it a spin and let me know if it helps!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>rag</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
