Serhii Kalyna

Posted on May 8 • Originally published at kalyna.pro

RAG Tutorial with Python: Build a Retrieval-Augmented Generation System

#ai #python #rag #tutorial

LLMs are powerful but they only know what was in their training data. RAG (Retrieval-Augmented Generation) solves this: before generating an answer, retrieve the most relevant documents from your own knowledge base and include them in the prompt. The result is an LLM that answers accurately from your data, not guesses.

This tutorial builds a complete RAG system from scratch — embeddings, vector search, and Claude for generation.

How RAG Works

Index — convert your documents into embedding vectors and store them
Retrieve — embed the user's query and find the most similar documents by cosine similarity
Generate — pass the retrieved documents as context to the LLM and ask it to answer

The LLM never sees your entire knowledge base — only the top-K most relevant chunks. This keeps prompts focused and costs predictable.

Prerequisites

pip install anthropic sentence-transformers numpy

sentence-transformers provides a local embedding model (~80MB) — no external API key needed for the retrieval step. anthropic handles generation.

Step 1: Define Your Knowledge Base

In production, this comes from chunked PDFs, database exports, or wiki pages. For this tutorial we use a list of strings:

# Your knowledge base — in production this comes from PDFs, DBs, wikis, etc.documents = [    "Claude is an AI assistant made by Anthropic. It supports tool use, prompt caching, and vision.",    "The Claude API uses token-based pricing. Sonnet costs $3/MTok input and $15/MTok output.",    "Prompt caching caches prompt prefixes for 5 minutes, saving up to 90% on repeated calls.",    "The Batch API processes requests asynchronously within 24 hours at a 50% discount.",    "Claude Haiku is the fastest and cheapest model, ideal for classification and extraction.",    "Model Context Protocol (MCP) is an open standard by Anthropic for connecting AI to tools.",]

Chunking strategy matters: chunks should be semantically coherent (one idea per chunk) and roughly 300–500 tokens each.

Step 2: Build the Vector Store

Embed every document once at startup and store the vectors as a NumPy array:

from sentence_transformers import SentenceTransformerimport numpy as npmodel = SentenceTransformer("all-MiniLM-L6-v2")  # ~80MB, runs locally, no API key needed# Build the vector store once at startupdoc_embeddings = model.encode(documents, normalize_embeddings=True)# Shape: (6, 384) — one 384-dim vector per document

all-MiniLM-L6-v2 is fast and compact (384-dim embeddings). For production, consider bge-large-en-v1.5 or OpenAI's text-embedding-3-small for higher quality.

Step 3: Retrieve Relevant Documents

Cosine similarity between normalized vectors is just a dot product — fast and dependency-free with NumPy:

def retrieve(query: str, top_k: int = 3) -> list[str]:    query_vec = model.encode([query], normalize_embeddings=True)    # Cosine similarity = dot product on L2-normalized vectors    scores = (doc_embeddings @ query_vec.T).flatten()    top_indices = np.argsort(scores)[::-1][:top_k]    return [documents[i] for i in top_indices]# Test retrievalresults = retrieve("How much does Claude Sonnet cost?")for doc in results:    print("-", doc)

Step 4: Generate the Answer with Claude

Pass the retrieved documents as context and let Claude synthesize the answer:

import anthropicclient = anthropic.Anthropic()def answer(question: str) -> str:    context_docs = retrieve(question, top_k=3)    context = "\n\n".join(f"- {doc}" for doc in context_docs)    response = client.messages.create(        model="claude-sonnet-4-6",        max_tokens=512,        system="Answer questions using ONLY the provided context. If the answer is not in the context, say so.",        messages=[{            "role": "user",            "content": f"Context:\n{context}\n\nQuestion: {question}"        }]    )    return response.content[0].text

The system prompt instruction — "answer ONLY from the provided context" — is what keeps the model grounded. Without it, Claude may mix retrieval context with its own training knowledge.

Step 5: Run It

questions = [    "What is prompt caching and how much can it save?",    "Which Claude model is best for classification?",    "What is MCP?",]for q in questions:    print(f"Q: {q}")    print(f"A: {answer(q)}")    print()

Optimize with Prompt Caching

If your system prompt is long and repeated across many queries, cache it to save up to 90% on input costs:

def answer_with_cache(question: str) -> str:    context_docs = retrieve(question, top_k=3)    context = "\n\n".join(f"- {doc}" for doc in context_docs)    response = client.messages.create(        model="claude-sonnet-4-6",        max_tokens=512,        system=[{            "type": "text",            "text": "Answer questions using ONLY the provided context. If the answer is not in the context, say so.",            "cache_control": {"type": "ephemeral"}  # cache the system prompt prefix        }],        messages=[{            "role": "user",            "content": f"Context:\n{context}\n\nQuestion: {question}"        }]    )    return response.content[0].text

Production Considerations

Vector database — replace NumPy with ChromaDB, Pinecone, or pgvector for persistence and scale
Chunking — use overlapping chunks (e.g., 400 tokens with 50-token overlap) to avoid splitting context at boundaries
Reranking — after top-K retrieval, apply a cross-encoder reranker for higher precision
Metadata filtering — filter by date, category, or source before semantic search
Evaluation — measure retrieval recall and answer faithfulness with frameworks like RAGAS

Summary

RAG in its simplest form is three steps: embed → retrieve → generate. The implementation above is production-ready for small to medium corpora. As your knowledge base grows, swap NumPy for a proper vector database — the retrieval and generation logic stays identical.

Originally published at kalyna.pro

DEV Community