Ayi NEDJIMI

Posted on May 22

How to build a production RAG pipeline in Python (without a vector database)

#ai #llm #python #tutorial

Everyone reaching for a vector database when building RAG is solving the wrong problem first. For most domain-specific corpora — technical documentation, company knowledge bases, article archives — BM25 retrieval is competitive with semantic search, costs a fraction of the compute, and is dramatically simpler to operate. This tutorial shows you how to build a full RAG pipeline using Meilisearch as the retrieval backend, stream responses from an LLM API, and evaluate hit rate without a single embedding model.

Why RAG, and why not a vector database

Retrieval-Augmented Generation solves a fundamental problem: LLMs have a knowledge cutoff and a finite context window. You want answers grounded in your documents, not hallucinated from pre-training.

The standard advice is to use a vector database (Pinecone, Weaviate, Chroma). Vector search is powerful for open-domain retrieval where semantic similarity matters. But on a domain-specific corpus with consistent terminology — think a cybersecurity knowledge base or a medical reference — BM25 with typo tolerance typically achieves 85–95% of the recall you'd get from embeddings, with zero GPU cost, sub-10ms latency, and no embedding pipeline to maintain.

Meilisearch gives you BM25 out of the box, plus typo tolerance, faceted filtering, and a simple REST API. It's what I use to power the search across 1,600+ articles at AYI NEDJIMI Consultants.

Setup

pip install meilisearch openai httpx

Run Meilisearch locally:

docker run -d -p 7700:7700 getmeili/meilisearch:latest

Step 1: Index your documents

Your documents need an id, searchable content, and any filter attributes you want to use at query time.

import meilisearch
import hashlib
import json

MEILI_URL = "http://127.0.0.1:7700"
MEILI_KEY = "your_master_key"  # or "" for local dev
INDEX_NAME = "knowledge_base"

client = meilisearch.Client(MEILI_URL, MEILI_KEY)

def get_or_create_index():
    try:
        index = client.get_index(INDEX_NAME)
    except meilisearch.errors.MeilisearchApiError:
        task = client.create_index(INDEX_NAME, {"primaryKey": "id"})
        client.wait_for_task(task.task_uid)
        index = client.get_index(INDEX_NAME)

    # Configure searchable attributes and filters
    index.update_settings({
        "searchableAttributes": ["title", "content", "tags"],
        "filterableAttributes": ["category", "doc_type"],
        "rankingRules": [
            "words", "typo", "proximity", "attribute", "sort", "exactness"
        ],
        "typoTolerance": {
            "enabled": True,
            "minWordSizeForTypos": {"oneTypo": 4, "twoTypos": 8}
        }
    })
    return index

def index_documents(documents: list[dict]):
    """
    Each document: {"id": str, "title": str, "content": str,
                    "tags": list[str], "category": str, "doc_type": str}
    """
    index = get_or_create_index()

    # Add stable IDs if not present
    for doc in documents:
        if "id" not in doc:
            doc["id"] = hashlib.sha256(doc["content"].encode()).hexdigest()[:16]

    task = index.add_documents(documents, primary_key="id")
    client.wait_for_task(task.task_uid)
    print(f"Indexed {len(documents)} documents.")

# Example: load from a JSONL file
def load_and_index(filepath: str):
    docs = []
    with open(filepath) as f:
        for line in f:
            docs.append(json.loads(line.strip()))
    index_documents(docs)

Step 2: Retrieve top-k documents

def retrieve(query: str, top_k: int = 5, filters: str = "") -> list[dict]:
    """
    Returns top_k documents matching the query.
    filters example: "category = 'security' AND doc_type = 'guide'"
    """
    index = client.get_index(INDEX_NAME)

    search_params = {
        "limit": top_k,
        "attributesToRetrieve": ["id", "title", "content", "category"],
        "attributesToHighlight": ["content"],
        "highlightPreTag": "**",
        "highlightPostTag": "**",
    }

    if filters:
        search_params["filter"] = filters

    results = index.search(query, search_params)
    return results["hits"]

Step 3: Construct the prompt

The prompt structure is critical. You want the model to be explicitly grounded — it should cite only what's in the retrieved chunks, not hallucinate.

def build_prompt(query: str, retrieved_docs: list[dict]) -> list[dict]:
    context_blocks = []
    for i, doc in enumerate(retrieved_docs, 1):
        context_blocks.append(
            f"[Source {i}] {doc['title']}\n{doc['content'][:1200]}"
        )

    context = "\n\n---\n\n".join(context_blocks)

    system_prompt = (
        "You are a technical assistant. Answer the user's question using ONLY "
        "the provided sources. If the answer is not in the sources, say so explicitly. "
        "Cite sources by number, e.g. [Source 1]."
    )

    user_message = f"""Sources:
{context}

---

Question: {query}"""

    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message},
    ]

Step 4: Stream the LLM response

Never buffer the full response before sending it to the user. Streaming is essential for UX on long answers.

from openai import OpenAI  # generic llm_client — swap for any compatible SDK

llm_client = OpenAI(
    api_key="your_api_key",
    base_url="https://api.your-llm-provider.com/v1",  # adjust per provider
)

def rag_stream(query: str, category_filter: str = ""):
    """Generator that yields text chunks as they arrive from the LLM."""
    filters = f"category = '{category_filter}'" if category_filter else ""
    docs = retrieve(query, top_k=5, filters=filters)

    if not docs:
        yield "No relevant documents found in the knowledge base."
        return

    messages = build_prompt(query, docs)

    stream = llm_client.chat.completions.create(
        model="gpt-4o-mini",  # or your preferred model
        messages=messages,
        stream=True,
        temperature=0.2,  # lower temp for factual retrieval tasks
        max_tokens=800,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            yield delta.content

Step 5: Wire it together — a minimal CLI

import sys

def main():
    query = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Query: ")
    print(f"\nQuery: {query}\n{'='*60}\n")

    for token in rag_stream(query):
        print(token, end="", flush=True)

    print("\n")

if __name__ == "__main__":
    main()

Usage:

python rag.py "What are the key requirements of NIS 2 for SMEs?"

Step 6: Evaluate hit rate

Before deploying, measure whether your retrieval is actually finding the right documents. You need a small golden dataset: query → expected document ID.

def evaluate_hit_rate(golden_set: list[dict], top_k: int = 5) -> float:
    """
    golden_set: [{"query": "...", "expected_id": "doc_id"}, ...]
    Returns hit rate @ top_k.
    """
    hits = 0
    for item in golden_set:
        results = retrieve(item["query"], top_k=top_k)
        retrieved_ids = {r["id"] for r in results}
        if item["expected_id"] in retrieved_ids:
            hits += 1

    hit_rate = hits / len(golden_set)
    print(f"Hit rate @{top_k}: {hit_rate:.2%} ({hits}/{len(golden_set)})")
    return hit_rate

# Example usage
golden = [
    {"query": "NIS 2 SME requirements", "expected_id": "nis2-guide-001"},
    {"query": "ISO 27001 certification steps", "expected_id": "iso27001-checklist"},
    {"query": "penetration testing methodology", "expected_id": "pentest-guide-002"},
]

evaluate_hit_rate(golden, top_k=5)

On a 1,600-article cybersecurity corpus, this setup achieves roughly 91% hit rate at k=5 — without a single embedding model call.

Production considerations

Chunking strategy: For long documents, chunk at 512–800 tokens with 10% overlap. Store doc_id and chunk_index so you can reconstruct the full document if needed.

Re-ranking: If your hit rate plateaus below 85%, add a lightweight cross-encoder re-ranker as a second stage. cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers works locally and adds ~30ms latency.

Context window budget: At 5 docs × 1,200 chars, you're using roughly 1,500 tokens of context. Adjust top_k and content truncation to stay within your model's window while leaving room for the answer.

Caching: Cache retrieval results for identical queries with a TTL of 5–15 minutes using Redis or even a simple in-memory dict. LLM call results can be cached longer for factual queries.

This pipeline — retrieval with Meilisearch, prompt construction, streaming output — is what I run in production. No embedding pipeline, no vector database operational overhead. For domain-specific retrieval, BM25 is frequently the pragmatic choice. Reach for semantic search when your query vocabulary genuinely diverges from your document vocabulary; otherwise, ship the simpler thing.

DEV Community