Haji Rufai

Posted on May 24

Building a RAG Document Q&A System with Hybrid Retrieval (No Embeddings API Needed)

#ai #python #rag #machinelearning

Building a production-quality RAG (Retrieval-Augmented Generation) system taught me one thing: the retrieval step matters more than the LLM you pick. In this post, I'll walk through how I built DocuMind — a document Q&A system that uses hybrid retrieval (TF-IDF + BM25) to find the right context before generating answers.

No GPUs required. No paid embedding APIs. Just scikit-learn, numpy, and a free LLM tier.

GitHub: github.com/hajirufai/documind

The Problem with Naive RAG

Most RAG tutorials follow this pattern:

Chunk documents
Embed chunks with OpenAI/Cohere
Store in Pinecone/ChromaDB
Retrieve top-K by cosine similarity
Feed to GPT-4

This works — but it has real weaknesses:

Embedding APIs cost money at scale (and add latency)
Pure semantic search misses exact keywords — ask "What is the ROI?" and semantic search might return chunks about "return on investment" but miss the one that literally says "ROI is 45%"
Vector databases add infrastructure you need to manage

DocuMind takes a different approach: hybrid retrieval that combines the strengths of both semantic and keyword search, using only free, local libraries.

Architecture Overview

Document → Parse → Chunk → Index (TF-IDF + BM25)
                                    ↓
Question → Hybrid Search → Top-K Chunks → LLM → Cited Answer

The pipeline has five stages:

Parse — Extract text from PDF, Markdown, TXT, or CSV
Chunk — Recursively split into overlapping pieces
Index — Build dual indices (TF-IDF vectors + BM25 token index)
Retrieve — Score chunks with both methods, combine with weighted fusion
Generate — Send context + question to any OpenAI-compatible LLM

Let me break down each piece with actual code.

Smart Chunking: Not Just Fixed-Size Splits

Most tutorials split text every N characters. That breaks mid-sentence, loses context, and produces bad retrieval results. DocuMind uses recursive splitting — it tries paragraph breaks first, then sentences, then words:

def recursive_split(
    text: str,
    chunk_size: int = 800,
    chunk_overlap: int = 200,
    separators: list[str] | None = None,
) -> list[str]:
    if separators is None:
        separators = ["\n\n", "\n", ". ", "! ", "? ", "; ", ", ", " "]

    if len(text) <= chunk_size:
        return [text.strip()] if text.strip() else []

    for sep in separators:
        parts = text.split(sep)
        if len(parts) <= 1:
            continue

        chunks = []
        current = ""
        for part in parts:
            candidate = (current + sep + part) if current else part
            if len(candidate) <= chunk_size:
                current = candidate
            else:
                if current:
                    chunks.append(current.strip())
                if len(part) > chunk_size:
                    # Recurse with finer-grained separators
                    remaining = separators[separators.index(sep) + 1:]
                    sub_chunks = recursive_split(part, chunk_size, chunk_overlap, remaining)
                    chunks.extend(sub_chunks)
                    current = ""
                else:
                    current = part
        if current.strip():
            chunks.append(current.strip())
        if chunks:
            return _add_overlap(chunks, chunk_overlap, text)

    # Last resort: hard split
    return [text[i:i+chunk_size].strip() 
            for i in range(0, len(text), chunk_size - chunk_overlap)]

The overlap between chunks (200 chars by default) ensures context isn't lost at boundaries. And by splitting on natural boundaries first, each chunk is more semantically coherent.

The Hybrid Retrieval Engine

This is the core innovation. Instead of picking one retrieval method, DocuMind uses both:

TF-IDF (Semantic-ish Search)

TF-IDF with bigrams captures term co-occurrence patterns. It's not "true" semantic search like dense embeddings, but with sublinear_tf=True and ngram_range=(1,2), it handles synonyms and related terms surprisingly well:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

self.tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,
    stop_words="english",
    ngram_range=(1, 2),   # Unigrams + bigrams
    sublinear_tf=True,     # Logarithmic TF scaling
)
self.tfidf_matrix = self.tfidf_vectorizer.fit_transform(texts)

# At query time:
query_vec = self.tfidf_vectorizer.transform([query])
scores = cosine_similarity(query_vec, self.tfidf_matrix).flatten()

BM25 (Keyword Search)

BM25 is the algorithm behind Elasticsearch. It excels at exact keyword matching with smart document-length normalization:

from rank_bm25 import BM25Okapi

tokenized = [re.findall(r"\w+", text.lower()) for text in texts]
self.bm25 = BM25Okapi(tokenized)

# At query time:
tokens = re.findall(r"\w+", query.lower())
scores = self.bm25.get_scores(tokens)

Combining Both: Weighted Fusion

The hybrid search normalizes both score sets to [0, 1] and combines them:

def search(self, query: str, top_k: int = 5) -> list[RetrievalResult]:
    semantic_results = self.search_semantic(query, top_k=len(self.chunks))
    keyword_results = self.search_keyword(query, top_k=len(self.chunks))

    # Normalize scores
    norm_semantic = normalize(semantic_scores)
    norm_keyword = normalize(keyword_scores)

    # Weighted combination
    for chunk in self.chunks:
        combined[cid] = alpha * sem + (1 - alpha) * kw  # alpha=0.6 default

    return sorted(combined, reverse=True)[:top_k]

With alpha=0.6, retrieval is 60% semantic and 40% keyword. This is configurable — bump up keyword weight for technical docs with lots of jargon, or increase semantic weight for conversational documents.

Why Does This Work?

Query	TF-IDF Finds	BM25 Finds	Hybrid Finds
"machine learning performance"	Chunks about ML accuracy, model evaluation	Chunks literally containing "performance"	Both — best coverage
"ROI of the Q3 campaign"	General marketing chunks	Exact ROI mention	The specific ROI chunk + context
"How do I test Python code?"	Testing methodology chunks	Chunks with "pytest", "unittest"	Complete testing guidance

Pluggable LLM Generation

DocuMind works with any OpenAI-compatible API. The default is Groq's free tier (Llama 3.3 70B at 300+ tokens/sec):

def generate_answer(question, results, conversation, config):
    context = "\n\n".join(
        f"[Source {i+1}] {r.chunk.text}" 
        for i, r in enumerate(results)
    )

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        *conversation[-6:],
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ]

    response = httpx.post(
        f"{config.api_base}/chat/completions",
        headers={"Authorization": f"Bearer {config.api_key}"},
        json={"model": config.model, "messages": messages, "temperature": 0.1}
    )
    return response.json()["choices"][0]["message"]["content"]

Zero-cost mode: When no API key is set, DocuMind returns the most relevant chunks directly as an extractive answer. Still useful — and completely free.

The CLI Experience

I wanted DocuMind to feel professional from the terminal:

# Ingest documents
$ documind ingest report.pdf notes.md data.csv
📄 Ingested report.pdf → 23 chunks (4,521 words) in 89ms
📄 Ingested notes.md → 8 chunks (1,203 words) in 12ms
📄 Ingested data.csv → 45 chunks (2,890 words) in 34ms

# Ask questions
$ documind ask "What were the key findings?"
🔍 Retrieved 5 relevant chunks (hybrid search, 14ms)

The key findings include:
1. Revenue grew 23% YoY driven by...
2. Customer retention improved to 94%...

Sources:
  [1] report.pdf (p.3, score: 0.89)
  [2] report.pdf (p.7, score: 0.76)
  [3] notes.md (score: 0.61)

# Interactive chat with memory
$ documind chat

Built with Rich for tables, progress bars, and colored output.

Web UI

The web interface uses Tailwind CSS + Alpine.js — no build step, no npm, just HTML:

Drag-and-drop document upload
Real-time chat with streaming responses
Source cards showing which chunks were used
Dark mode
Mobile responsive

All served from a single Python file (web.py) using the built-in http.server module. Zero extra dependencies for the frontend.

Testing Without API Keys

Every test runs without any API key. The test suite uses extractive mode:

@pytest.fixture
def pipeline(tmp_path):
    config = Config(data_dir=str(tmp_path), api_key="")  # No LLM
    return DocuMindPipeline(config)

def test_ingest_and_query(pipeline, sample_doc):
    result = pipeline.ingest(sample_doc)
    assert result.chunks_created > 0

    answer = pipeline.query("What is this about?")
    assert len(answer.sources) > 0
    assert answer.answer  # Extractive answer from chunks

20 tests covering chunking, ingestion, retrieval, and the full pipeline — all passing in under 2 seconds.

What I Learned

Retrieval quality > LLM quality. A mediocre LLM with great context beats a powerful LLM with bad context. Spend your optimization budget on retrieval.
Hybrid search is worth the complexity. The code is only ~50 lines more than pure semantic search, but retrieval quality improves noticeably on mixed queries.
You don't need embeddings APIs. TF-IDF with bigrams handles 90% of use cases for document Q&A. Save the embedding APIs for when you genuinely need cross-lingual or deep semantic matching.
Chunking strategy matters. Recursive splitting with overlap produces dramatically better results than naive fixed-size splits. The extra code is worth it.
Make it work without the LLM. The extractive fallback means anyone can clone and immediately use DocuMind. No signup, no API key, no cost. That lowers the barrier to trying it — and trying it is what gets stars.

Try It

git clone https://github.com/hajirufai/documind.git
cd documind
pip install -r requirements.txt
documind ingest sample_docs/*.md sample_docs/*.csv
documind ask "What are Python testing best practices?"

Or with Docker:

docker compose up
# Open http://localhost:8080

The full source is on GitHub: hajirufai/documind

Building projects that actually work > collecting tutorials. If you're learning RAG, build one from scratch — you'll understand every tradeoff.

DEV Community