Austin Reiter

Posted on Mar 14

Building a Production RAG System in Python — What Tutorials Don't Teach You

#ai #rag #python #tutorial

Most RAG (Retrieval-Augmented Generation) tutorials show you how to throw documents into a vector store, retrieve the top-K results, and send them to an LLM.

That works for a demo. It falls apart in production.

Here's what actually matters when you're building a RAG system that real users depend on — and the patterns I've settled on after 20 years of building ML systems.

1. Chunking Strategy Is Everything

The default approach — splitting text into fixed-size chunks — is the worst option for most use cases.

Why it fails: Fixed chunking splits sentences mid-thought, breaks paragraphs apart, and creates chunks with no semantic coherence. Your retrieval quality tanks because the embeddings represent fragments, not ideas.

What to do instead:

Use recursive chunking — split on natural boundaries first (double newlines → single newlines → sentences → words), and only fall back to fixed splitting when a section is still too large.

def recursive_chunks(text, chunk_size=512, overlap=50):
    separators = ["\n\n\n", "\n\n", "\n", ". ", " "]

    def split(text, sep_index=0):
        if len(text.split()) <= chunk_size:
            return [text] if text.strip() else []
        if sep_index >= len(separators):
            return fixed_chunks(text, chunk_size)

        parts = text.split(separators[sep_index])
        chunks, current = [], ""

        for part in parts:
            candidate = f"{current}{separators[sep_index]}{part}" if current else part
            if len(candidate.split()) <= chunk_size:
                current = candidate
            else:
                if current:
                    chunks.extend(split(current, sep_index + 1))
                current = part
        if current:
            chunks.extend(split(current, sep_index + 1))
        return chunks

    return split(text)

For documents with clear structure (legal docs, technical manuals), semantic chunking — splitting on paragraph boundaries and merging small paragraphs — works even better.

2. Similarity Threshold > Top-K

Most implementations just return the top-K results. This is a mistake.

If a user asks a question that has nothing to do with your documents, top-K will still return K results — they'll just be irrelevant. Your LLM then hallucinates an answer based on unrelated context.

Fix: Apply a similarity threshold. Only return results above a minimum score.

def search(self, query, top_k=5, threshold=0.7):
    results = self.vector_store.search(query, top_k=top_k)
    return [r for r in results if r["score"] >= threshold]

This one change dramatically reduces hallucinations. If nothing passes the threshold, tell the user you don't have enough information — that's a better outcome than a confident wrong answer.

3. Re-Ranking Is a Force Multiplier

Embedding-based retrieval is fast but imprecise. It finds semantically similar content, but "similar" doesn't always mean "relevant to the specific question."

The pattern: Over-fetch (3x your target), then re-rank.

A full cross-encoder re-ranker (like ms-marco-MiniLM) gives the best results, but even a simple term-overlap re-ranker helps:

def rerank(query, results, top_k=3):
    query_terms = set(query.lower().split())
    for r in results:
        content_terms = set(r["content"].lower().split())
        overlap = len(query_terms & content_terms)
        density = overlap / max(len(content_terms), 1)
        r["rerank_score"] = (r["score"] * 0.7) + (density * 0.3)

    results.sort(key=lambda x: x["rerank_score"], reverse=True)
    return results[:top_k]

Retrieve 15, re-rank to 5. Your answer quality jumps significantly.

4. Stream Your Responses

If your RAG system takes 3-5 seconds to respond (embedding + retrieval + LLM generation), users will think it's broken.

Streaming sends tokens as they're generated. The user sees the answer forming in real-time, which feels fast even if the total time is the same.

async def stream_response(query, context):
    stream = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ],
        stream=True,
    )
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            yield f"data: {chunk.choices[0].delta.content}\n\n"

5. Configuration via Environment, Not Code

Hardcoding your chunk size, model name, similarity threshold, and vector store choice is fine for a prototype. In production, you need to tune these without redeploying.

Use Pydantic Settings:

class Settings(BaseSettings):
    chunk_size: int = 512
    chunk_overlap: int = 50
    embedding_model: str = "BAAI/bge-small-en-v1.5"
    vector_store_type: str = "chroma"  # or "faiss"
    top_k: int = 5
    similarity_threshold: float = 0.7
    rerank: bool = True
    llm_model: str = "gpt-4o-mini"

    model_config = {"env_file": ".env", "env_prefix": "RAG_"}

Change any parameter by editing .env or setting an environment variable. No code changes, no redeployment.

Putting It All Together

A production RAG system isn't much more code than a tutorial one — it's just better code in the right places:

Recursive chunking instead of fixed splitting
Similarity thresholds to prevent hallucinations
Re-ranking to improve relevance
Streaming for perceived performance
Environment-based configuration for operational flexibility

I packaged all of these patterns (and more — Docker configs, file upload endpoints, multiple vector store backends) into a ready-to-use template.

If you want to skip the boilerplate and start with production-quality code: Production AI/ML Toolkit — 4 Ready-to-Ship Templates

It includes a complete RAG system plus templates for LLM fine-tuning, model serving with A/B testing, and an AI agent framework. $39.

What patterns have you found essential for production RAG? Drop them in the comments.

DEV Community