We spend countless hours optimizing LLM prompts, tweaking retrieval parameters (k-NN), and choosing the best embedding models. But we often ignore the elephant in the room: The Data Quality.
If you are building a RAG (Retrieval-Augmented Generation) pipeline using internal company data-logs, tickets, documentation, or emails you have likely encountered the Semantic Duplicate Problem.
The Problem: Different Words, Same Meaning
Standard deduplication tools (like Pandas drop_duplicates() or SQL DISTINCT) work on a string level. They look for exact matches.
Consider these two log entries:
Error: Connection to database timed out after 3000ms.
DB Connection Failure: Timeout limit reached (3s).
To a standard script, these are two unique rows.
To an LLM (and to a human), they are identical.
If you ingest 10,000 such rows into your Vector Database (Pinecone, Milvus, Weaviate):
💸 You waste money: Storing vectors isn't free.
📉 You hurt retrieval: When a user asks "Why did the DB fail?", the retriever fills the context window with 5 variations of the same error, crowding out other relevant information.
😵 Model Hallucinations: The LLM gets repetitive context and degrades in quality.
The Solution: Semantic Deduplication
To fix this, we need to deduplicate based on meaning (vectors), not just syntax (text).
I couldn't find a lightweight tool that does this efficiently on a local machine (Privacy First!) without spinning up a Spark cluster or sending sensitive data to OpenAI APIs. So, I engineered one.
Meet EntropyGuard.
🛡️ EntropyGuard: A Local-First ETL Engine
EntropyGuard is an open-source CLI tool written in Python. It acts as a sanitation layer before your data hits the Vector Database.
It solves three critical problems:
Semantic Deduplication: Uses sentence-transformers and FAISS to find duplicates by cosine similarity.
Sanitization: Strips PII (emails, phones) and HTML noise.
Privacy: Runs 100% locally on CPU. No data exfiltration.
The Tech Stack (Hard Tech)
I wanted this tool to be robust enough for Enterprise data but light enough to run on a laptop.
Engine: Built on Polars LazyFrame. This allows streaming execution. You can process a 10GB CSV on a laptop with 16GB RAM because the data isn't loaded into memory all at once.
Vector Search: Uses FAISS (Facebook AI Similarity Search) for blazing-fast vector comparisons on CPU.
Chunking: Implemented a native recursive chunker (paragraphs -> sentences) to prepare documents for embedding, avoiding the bloat of heavy frameworks like LangChain.
Ingestion: Supports Excel (.xlsx), Parquet, CSV, and JSONL natively.
How it works (The Code)
Using it is as simple as running a CLI command.
Installation:
pip install "git+https://github.com/DamianSiuta/entropyguard.git"
Running the Audit:
You can run a "Dry Run" to generate a JSON audit log showing exactly which rows would be dropped and why. This is crucial for compliance teams.
entropyguard \
--input raw_data.jsonl \
--output clean_data.jsonl \
--dedup-threshold 0.85 \
--audit-log audit_report.json
The Result:
The tool generates embeddings locally (using a small model like all-MiniLM-L6-v2), clusters them using FAISS, and removes neighbors that are closer than the threshold (e.g., 0.85 similarity).
Benchmark: 99.5% Noise Reduction
I ran a stress test using a synthetic dataset of 10,000 rows generated by multiplying 50 unique signals with heavy noise (HTML tags, rephrasing, typos).
Raw Data: 10,000 rows.
Cleaned Data: ~50 rows.
Execution Time: < 2 minutes on a standard laptop CPU.
The tool successfully identified that
System Error and Critical Failure: System were the same event, collapsing them into one canonical entry.Why Open Source?
I believe data hygiene is a fundamental problem that shouldn't require expensive SaaS subscriptions. I released EntropyGuard under the MIT License so anyone can use it—even in commercial/air-gapped environments.
Check out the repo here:
👉 github.com/DamianSiuta/entropyguard
I’m actively looking for feedback from the Data Engineering community. If you are struggling with dirty RAG datasets, give it a spin and let me know if it helps!
Top comments (0)