Building a production-quality RAG (Retrieval-Augmented Generation) system taught me one thing: the retrieval step matters more than the LLM you pick. In this post, I'll walk through how I built DocuMind — a document Q&A system that uses hybrid retrieval (TF-IDF + BM25) to find the right context before generating answers.
No GPUs required. No paid embedding APIs. Just scikit-learn, numpy, and a free LLM tier.
GitHub: github.com/hajirufai/documind
The Problem with Naive RAG
Most RAG tutorials follow this pattern:
- Chunk documents
- Embed chunks with OpenAI/Cohere
- Store in Pinecone/ChromaDB
- Retrieve top-K by cosine similarity
- Feed to GPT-4
This works — but it has real weaknesses:
- Embedding APIs cost money at scale (and add latency)
- Pure semantic search misses exact keywords — ask "What is the ROI?" and semantic search might return chunks about "return on investment" but miss the one that literally says "ROI is 45%"
- Vector databases add infrastructure you need to manage
DocuMind takes a different approach: hybrid retrieval that combines the strengths of both semantic and keyword search, using only free, local libraries.
Architecture Overview
Document → Parse → Chunk → Index (TF-IDF + BM25)
↓
Question → Hybrid Search → Top-K Chunks → LLM → Cited Answer
The pipeline has five stages:
- Parse — Extract text from PDF, Markdown, TXT, or CSV
- Chunk — Recursively split into overlapping pieces
- Index — Build dual indices (TF-IDF vectors + BM25 token index)
- Retrieve — Score chunks with both methods, combine with weighted fusion
- Generate — Send context + question to any OpenAI-compatible LLM
Let me break down each piece with actual code.
Smart Chunking: Not Just Fixed-Size Splits
Most tutorials split text every N characters. That breaks mid-sentence, loses context, and produces bad retrieval results. DocuMind uses recursive splitting — it tries paragraph breaks first, then sentences, then words:
def recursive_split(
text: str,
chunk_size: int = 800,
chunk_overlap: int = 200,
separators: list[str] | None = None,
) -> list[str]:
if separators is None:
separators = ["\n\n", "\n", ". ", "! ", "? ", "; ", ", ", " "]
if len(text) <= chunk_size:
return [text.strip()] if text.strip() else []
for sep in separators:
parts = text.split(sep)
if len(parts) <= 1:
continue
chunks = []
current = ""
for part in parts:
candidate = (current + sep + part) if current else part
if len(candidate) <= chunk_size:
current = candidate
else:
if current:
chunks.append(current.strip())
if len(part) > chunk_size:
# Recurse with finer-grained separators
remaining = separators[separators.index(sep) + 1:]
sub_chunks = recursive_split(part, chunk_size, chunk_overlap, remaining)
chunks.extend(sub_chunks)
current = ""
else:
current = part
if current.strip():
chunks.append(current.strip())
if chunks:
return _add_overlap(chunks, chunk_overlap, text)
# Last resort: hard split
return [text[i:i+chunk_size].strip()
for i in range(0, len(text), chunk_size - chunk_overlap)]
The overlap between chunks (200 chars by default) ensures context isn't lost at boundaries. And by splitting on natural boundaries first, each chunk is more semantically coherent.
The Hybrid Retrieval Engine
This is the core innovation. Instead of picking one retrieval method, DocuMind uses both:
TF-IDF (Semantic-ish Search)
TF-IDF with bigrams captures term co-occurrence patterns. It's not "true" semantic search like dense embeddings, but with sublinear_tf=True and ngram_range=(1,2), it handles synonyms and related terms surprisingly well:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
self.tfidf_vectorizer = TfidfVectorizer(
max_features=10000,
stop_words="english",
ngram_range=(1, 2), # Unigrams + bigrams
sublinear_tf=True, # Logarithmic TF scaling
)
self.tfidf_matrix = self.tfidf_vectorizer.fit_transform(texts)
# At query time:
query_vec = self.tfidf_vectorizer.transform([query])
scores = cosine_similarity(query_vec, self.tfidf_matrix).flatten()
BM25 (Keyword Search)
BM25 is the algorithm behind Elasticsearch. It excels at exact keyword matching with smart document-length normalization:
from rank_bm25 import BM25Okapi
tokenized = [re.findall(r"\w+", text.lower()) for text in texts]
self.bm25 = BM25Okapi(tokenized)
# At query time:
tokens = re.findall(r"\w+", query.lower())
scores = self.bm25.get_scores(tokens)
Combining Both: Weighted Fusion
The hybrid search normalizes both score sets to [0, 1] and combines them:
def search(self, query: str, top_k: int = 5) -> list[RetrievalResult]:
semantic_results = self.search_semantic(query, top_k=len(self.chunks))
keyword_results = self.search_keyword(query, top_k=len(self.chunks))
# Normalize scores
norm_semantic = normalize(semantic_scores)
norm_keyword = normalize(keyword_scores)
# Weighted combination
for chunk in self.chunks:
combined[cid] = alpha * sem + (1 - alpha) * kw # alpha=0.6 default
return sorted(combined, reverse=True)[:top_k]
With alpha=0.6, retrieval is 60% semantic and 40% keyword. This is configurable — bump up keyword weight for technical docs with lots of jargon, or increase semantic weight for conversational documents.
Why Does This Work?
| Query | TF-IDF Finds | BM25 Finds | Hybrid Finds |
|---|---|---|---|
| "machine learning performance" | Chunks about ML accuracy, model evaluation | Chunks literally containing "performance" | Both — best coverage |
| "ROI of the Q3 campaign" | General marketing chunks | Exact ROI mention | The specific ROI chunk + context |
| "How do I test Python code?" | Testing methodology chunks | Chunks with "pytest", "unittest" | Complete testing guidance |
Pluggable LLM Generation
DocuMind works with any OpenAI-compatible API. The default is Groq's free tier (Llama 3.3 70B at 300+ tokens/sec):
def generate_answer(question, results, conversation, config):
context = "\n\n".join(
f"[Source {i+1}] {r.chunk.text}"
for i, r in enumerate(results)
)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
*conversation[-6:],
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
response = httpx.post(
f"{config.api_base}/chat/completions",
headers={"Authorization": f"Bearer {config.api_key}"},
json={"model": config.model, "messages": messages, "temperature": 0.1}
)
return response.json()["choices"][0]["message"]["content"]
Zero-cost mode: When no API key is set, DocuMind returns the most relevant chunks directly as an extractive answer. Still useful — and completely free.
The CLI Experience
I wanted DocuMind to feel professional from the terminal:
# Ingest documents
$ documind ingest report.pdf notes.md data.csv
📄 Ingested report.pdf → 23 chunks (4,521 words) in 89ms
📄 Ingested notes.md → 8 chunks (1,203 words) in 12ms
📄 Ingested data.csv → 45 chunks (2,890 words) in 34ms
# Ask questions
$ documind ask "What were the key findings?"
🔍 Retrieved 5 relevant chunks (hybrid search, 14ms)
The key findings include:
1. Revenue grew 23% YoY driven by...
2. Customer retention improved to 94%...
Sources:
[1] report.pdf (p.3, score: 0.89)
[2] report.pdf (p.7, score: 0.76)
[3] notes.md (score: 0.61)
# Interactive chat with memory
$ documind chat
Built with Rich for tables, progress bars, and colored output.
Web UI
The web interface uses Tailwind CSS + Alpine.js — no build step, no npm, just HTML:
- Drag-and-drop document upload
- Real-time chat with streaming responses
- Source cards showing which chunks were used
- Dark mode
- Mobile responsive
All served from a single Python file (web.py) using the built-in http.server module. Zero extra dependencies for the frontend.
Testing Without API Keys
Every test runs without any API key. The test suite uses extractive mode:
@pytest.fixture
def pipeline(tmp_path):
config = Config(data_dir=str(tmp_path), api_key="") # No LLM
return DocuMindPipeline(config)
def test_ingest_and_query(pipeline, sample_doc):
result = pipeline.ingest(sample_doc)
assert result.chunks_created > 0
answer = pipeline.query("What is this about?")
assert len(answer.sources) > 0
assert answer.answer # Extractive answer from chunks
20 tests covering chunking, ingestion, retrieval, and the full pipeline — all passing in under 2 seconds.
What I Learned
Retrieval quality > LLM quality. A mediocre LLM with great context beats a powerful LLM with bad context. Spend your optimization budget on retrieval.
Hybrid search is worth the complexity. The code is only ~50 lines more than pure semantic search, but retrieval quality improves noticeably on mixed queries.
You don't need embeddings APIs. TF-IDF with bigrams handles 90% of use cases for document Q&A. Save the embedding APIs for when you genuinely need cross-lingual or deep semantic matching.
Chunking strategy matters. Recursive splitting with overlap produces dramatically better results than naive fixed-size splits. The extra code is worth it.
Make it work without the LLM. The extractive fallback means anyone can clone and immediately use DocuMind. No signup, no API key, no cost. That lowers the barrier to trying it — and trying it is what gets stars.
Try It
git clone https://github.com/hajirufai/documind.git
cd documind
pip install -r requirements.txt
documind ingest sample_docs/*.md sample_docs/*.csv
documind ask "What are Python testing best practices?"
Or with Docker:
docker compose up
# Open http://localhost:8080
The full source is on GitHub: hajirufai/documind
Building projects that actually work > collecting tutorials. If you're learning RAG, build one from scratch — you'll understand every tradeoff.
Top comments (0)