I built ClawRAG because I was tired of losing arguments as she is always right ;) Not just any arguments – the ones where my sister is convinced she's right about some old receipt, warranty or contract clause
I needed my entire paper trail with me at all times. But more importantly, I needed a way to query it from WhatsApp while on the move
"Standard" RAG Fails
Most RAG setups work fine for simple text. But real-world documents are messy. They have legacy tables, complex layouts, and exact terms that get lost in vector-only search
If I'm arguing about a warranty, I need the exact page and section
Pillar 1: Structure-First Ingestion with Docling
Most PDF parsers treat a page like a flat bag of words. I chose Docling because it understands document structure. If there's a table on page 12, Docling extracts it as Markdown, preserving the rows and columns that a standard parser would turn into junk.
How we configure the pipeline:
# Extract from backend/src/core/docling_loader.py
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
# We force table and figure extraction to preserve structure
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.generate_table_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert(file_path)
markdown_content = result.document.export_to_markdown()
Pillar 2: Hybrid Search Accuracy
Vector search (semantic) is the hype, but keyword search (BM25) is the truth. If I search for "Article 4.2", a vector model might give me "something similar". BM25 finds exactly "Article 4.2".
ClawRAG uses Reciprocal Rank Fusion (RRF) to combine both worlds.
The Fusion Logic:
# Extract from backend/src/core/retrievers/hybrid_retriever.py
def _fuse_results(self, results_per_retriever):
fused_scores = {}
for results in results_per_retriever:
for rank, node_with_score in enumerate(results):
node_id = node_with_score.node.id_
if node_id not in fused_scores:
fused_scores[node_id] = {'node': node_with_score.node, 'score': 0.0}
# Reciprocal Rank Fusion formula (default k=60)
fused_scores[node_id]['score'] += 1.0 / (rank + 60)
# Sort candidates by combined score
return sorted(fused_scores.values(), key=lambda x: x['score'], reverse=True)
Pillar 3: Remote Access via MCP
The reason my sister hates it? I have it on my phone. ClawRAG implements the MCP, allowing me to connect it to OpenClaw (ClawBot)
My agent "sees" the knowledge base as a tool. When I ask a question on WhatsApp, the agent calls ClawRAG, retrieves the context, and answers me in seconds
Example Conversation:
- Me: "Hey, is the fridge still covered? Sis says warranty is over."
- ClawRAG: "Actually, checking
Invoice_Kitchen_2023.pdf: You have a 3-year 'Premium' extension. It expires June 2026. Your sister is likely thinking of the standard 12-month manufacturer warranty."
Why I Open Sourced It
ClawRAG isn't just a toy; it's the core extracted from our professional RAG platform. It's built for developers who need:
- Production Readiness: No mocks, no fakes
- Privacy: Runs 100% local with Ollama and ChromaDB
- Extensibility: MIT licensed and easy to hook into existing agents
Check it out, star it, and tell me where it breaks:
github.com/2dogsandanerd/ClawRag
P.S. If you're reading this, Sis: I'm still waiting for that new flatscreen you bet me
Top comments (0)