Everyone reaching for a vector database when building RAG is solving the wrong problem first. For most domain-specific corpora — technical documentation, company knowledge bases, article archives — BM25 retrieval is competitive with semantic search, costs a fraction of the compute, and is dramatically simpler to operate. This tutorial shows you how to build a full RAG pipeline using Meilisearch as the retrieval backend, stream responses from an LLM API, and evaluate hit rate without a single embedding model.
Why RAG, and why not a vector database
Retrieval-Augmented Generation solves a fundamental problem: LLMs have a knowledge cutoff and a finite context window. You want answers grounded in your documents, not hallucinated from pre-training.
The standard advice is to use a vector database (Pinecone, Weaviate, Chroma). Vector search is powerful for open-domain retrieval where semantic similarity matters. But on a domain-specific corpus with consistent terminology — think a cybersecurity knowledge base or a medical reference — BM25 with typo tolerance typically achieves 85–95% of the recall you'd get from embeddings, with zero GPU cost, sub-10ms latency, and no embedding pipeline to maintain.
Meilisearch gives you BM25 out of the box, plus typo tolerance, faceted filtering, and a simple REST API. It's what I use to power the search across 1,600+ articles at AYI NEDJIMI Consultants.
Setup
pip install meilisearch openai httpx
Run Meilisearch locally:
docker run -d -p 7700:7700 getmeili/meilisearch:latest
Step 1: Index your documents
Your documents need an id, searchable content, and any filter attributes you want to use at query time.
import meilisearch
import hashlib
import json
MEILI_URL = "http://127.0.0.1:7700"
MEILI_KEY = "your_master_key" # or "" for local dev
INDEX_NAME = "knowledge_base"
client = meilisearch.Client(MEILI_URL, MEILI_KEY)
def get_or_create_index():
try:
index = client.get_index(INDEX_NAME)
except meilisearch.errors.MeilisearchApiError:
task = client.create_index(INDEX_NAME, {"primaryKey": "id"})
client.wait_for_task(task.task_uid)
index = client.get_index(INDEX_NAME)
# Configure searchable attributes and filters
index.update_settings({
"searchableAttributes": ["title", "content", "tags"],
"filterableAttributes": ["category", "doc_type"],
"rankingRules": [
"words", "typo", "proximity", "attribute", "sort", "exactness"
],
"typoTolerance": {
"enabled": True,
"minWordSizeForTypos": {"oneTypo": 4, "twoTypos": 8}
}
})
return index
def index_documents(documents: list[dict]):
"""
Each document: {"id": str, "title": str, "content": str,
"tags": list[str], "category": str, "doc_type": str}
"""
index = get_or_create_index()
# Add stable IDs if not present
for doc in documents:
if "id" not in doc:
doc["id"] = hashlib.sha256(doc["content"].encode()).hexdigest()[:16]
task = index.add_documents(documents, primary_key="id")
client.wait_for_task(task.task_uid)
print(f"Indexed {len(documents)} documents.")
# Example: load from a JSONL file
def load_and_index(filepath: str):
docs = []
with open(filepath) as f:
for line in f:
docs.append(json.loads(line.strip()))
index_documents(docs)
Step 2: Retrieve top-k documents
def retrieve(query: str, top_k: int = 5, filters: str = "") -> list[dict]:
"""
Returns top_k documents matching the query.
filters example: "category = 'security' AND doc_type = 'guide'"
"""
index = client.get_index(INDEX_NAME)
search_params = {
"limit": top_k,
"attributesToRetrieve": ["id", "title", "content", "category"],
"attributesToHighlight": ["content"],
"highlightPreTag": "**",
"highlightPostTag": "**",
}
if filters:
search_params["filter"] = filters
results = index.search(query, search_params)
return results["hits"]
Step 3: Construct the prompt
The prompt structure is critical. You want the model to be explicitly grounded — it should cite only what's in the retrieved chunks, not hallucinate.
def build_prompt(query: str, retrieved_docs: list[dict]) -> list[dict]:
context_blocks = []
for i, doc in enumerate(retrieved_docs, 1):
context_blocks.append(
f"[Source {i}] {doc['title']}\n{doc['content'][:1200]}"
)
context = "\n\n---\n\n".join(context_blocks)
system_prompt = (
"You are a technical assistant. Answer the user's question using ONLY "
"the provided sources. If the answer is not in the sources, say so explicitly. "
"Cite sources by number, e.g. [Source 1]."
)
user_message = f"""Sources:
{context}
---
Question: {query}"""
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
Step 4: Stream the LLM response
Never buffer the full response before sending it to the user. Streaming is essential for UX on long answers.
from openai import OpenAI # generic llm_client — swap for any compatible SDK
llm_client = OpenAI(
api_key="your_api_key",
base_url="https://api.your-llm-provider.com/v1", # adjust per provider
)
def rag_stream(query: str, category_filter: str = ""):
"""Generator that yields text chunks as they arrive from the LLM."""
filters = f"category = '{category_filter}'" if category_filter else ""
docs = retrieve(query, top_k=5, filters=filters)
if not docs:
yield "No relevant documents found in the knowledge base."
return
messages = build_prompt(query, docs)
stream = llm_client.chat.completions.create(
model="gpt-4o-mini", # or your preferred model
messages=messages,
stream=True,
temperature=0.2, # lower temp for factual retrieval tasks
max_tokens=800,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
yield delta.content
Step 5: Wire it together — a minimal CLI
import sys
def main():
query = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Query: ")
print(f"\nQuery: {query}\n{'='*60}\n")
for token in rag_stream(query):
print(token, end="", flush=True)
print("\n")
if __name__ == "__main__":
main()
Usage:
python rag.py "What are the key requirements of NIS 2 for SMEs?"
Step 6: Evaluate hit rate
Before deploying, measure whether your retrieval is actually finding the right documents. You need a small golden dataset: query → expected document ID.
def evaluate_hit_rate(golden_set: list[dict], top_k: int = 5) -> float:
"""
golden_set: [{"query": "...", "expected_id": "doc_id"}, ...]
Returns hit rate @ top_k.
"""
hits = 0
for item in golden_set:
results = retrieve(item["query"], top_k=top_k)
retrieved_ids = {r["id"] for r in results}
if item["expected_id"] in retrieved_ids:
hits += 1
hit_rate = hits / len(golden_set)
print(f"Hit rate @{top_k}: {hit_rate:.2%} ({hits}/{len(golden_set)})")
return hit_rate
# Example usage
golden = [
{"query": "NIS 2 SME requirements", "expected_id": "nis2-guide-001"},
{"query": "ISO 27001 certification steps", "expected_id": "iso27001-checklist"},
{"query": "penetration testing methodology", "expected_id": "pentest-guide-002"},
]
evaluate_hit_rate(golden, top_k=5)
On a 1,600-article cybersecurity corpus, this setup achieves roughly 91% hit rate at k=5 — without a single embedding model call.
Production considerations
Chunking strategy: For long documents, chunk at 512–800 tokens with 10% overlap. Store doc_id and chunk_index so you can reconstruct the full document if needed.
Re-ranking: If your hit rate plateaus below 85%, add a lightweight cross-encoder re-ranker as a second stage. cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers works locally and adds ~30ms latency.
Context window budget: At 5 docs × 1,200 chars, you're using roughly 1,500 tokens of context. Adjust top_k and content truncation to stay within your model's window while leaving room for the answer.
Caching: Cache retrieval results for identical queries with a TTL of 5–15 minutes using Redis or even a simple in-memory dict. LLM call results can be cached longer for factual queries.
This pipeline — retrieval with Meilisearch, prompt construction, streaming output — is what I run in production. No embedding pipeline, no vector database operational overhead. For domain-specific retrieval, BM25 is frequently the pragmatic choice. Reach for semantic search when your query vocabulary genuinely diverges from your document vocabulary; otherwise, ship the simpler thing.
Top comments (0)