DEV Community

2 dogs and a nerd
2 dogs and a nerd

Posted on

My Sister Hates My New Openclaw (Because I'm Finally Right)

I built ClawRAG because I was tired of losing arguments as she is always right ;) Not just any arguments – the ones where my sister is convinced she's right about some old receipt, warranty or contract clause

I needed my entire paper trail with me at all times. But more importantly, I needed a way to query it from WhatsApp while on the move

"Standard" RAG Fails
Most RAG setups work fine for simple text. But real-world documents are messy. They have legacy tables, complex layouts, and exact terms that get lost in vector-only search

If I'm arguing about a warranty, I need the exact page and section

Pillar 1: Structure-First Ingestion with Docling

Most PDF parsers treat a page like a flat bag of words. I chose Docling because it understands document structure. If there's a table on page 12, Docling extracts it as Markdown, preserving the rows and columns that a standard parser would turn into junk.

How we configure the pipeline:

# Extract from backend/src/core/docling_loader.py
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions

# We force table and figure extraction to preserve structure
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.generate_table_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert(file_path)
markdown_content = result.document.export_to_markdown()
Enter fullscreen mode Exit fullscreen mode

Pillar 2: Hybrid Search Accuracy

Vector search (semantic) is the hype, but keyword search (BM25) is the truth. If I search for "Article 4.2", a vector model might give me "something similar". BM25 finds exactly "Article 4.2".

ClawRAG uses Reciprocal Rank Fusion (RRF) to combine both worlds.

The Fusion Logic:

# Extract from backend/src/core/retrievers/hybrid_retriever.py
def _fuse_results(self, results_per_retriever):
    fused_scores = {}
    for results in results_per_retriever:
        for rank, node_with_score in enumerate(results):
            node_id = node_with_score.node.id_
            if node_id not in fused_scores:
                fused_scores[node_id] = {'node': node_with_score.node, 'score': 0.0}

            # Reciprocal Rank Fusion formula (default k=60)
            fused_scores[node_id]['score'] += 1.0 / (rank + 60)

    # Sort candidates by combined score
    return sorted(fused_scores.values(), key=lambda x: x['score'], reverse=True)
Enter fullscreen mode Exit fullscreen mode

Pillar 3: Remote Access via MCP

The reason my sister hates it? I have it on my phone. ClawRAG implements the MCP, allowing me to connect it to OpenClaw (ClawBot)

My agent "sees" the knowledge base as a tool. When I ask a question on WhatsApp, the agent calls ClawRAG, retrieves the context, and answers me in seconds

Example Conversation:

  • Me: "Hey, is the fridge still covered? Sis says warranty is over."
  • ClawRAG: "Actually, checking Invoice_Kitchen_2023.pdf: You have a 3-year 'Premium' extension. It expires June 2026. Your sister is likely thinking of the standard 12-month manufacturer warranty."

Why I Open Sourced It

ClawRAG isn't just a toy; it's the core extracted from our professional RAG platform. It's built for developers who need:

  1. Production Readiness: No mocks, no fakes
  2. Privacy: Runs 100% local with Ollama and ChromaDB
  3. Extensibility: MIT licensed and easy to hook into existing agents

Check it out, star it, and tell me where it breaks:
github.com/2dogsandanerd/ClawRag

P.S. If you're reading this, Sis: I'm still waiting for that new flatscreen you bet me

Top comments (0)