DEV Community

Serhii Kalyna
Serhii Kalyna

Posted on • Originally published at kalyna.pro

RAG Tutorial with Python: Build a Retrieval-Augmented Generation System

LLMs are powerful but they only know what was in their training data. RAG (Retrieval-Augmented Generation) solves this: before generating an answer, retrieve the most relevant documents from your own knowledge base and include them in the prompt. The result is an LLM that answers accurately from your data, not guesses.

This tutorial builds a complete RAG system from scratch — embeddings, vector search, and Claude for generation.

How RAG Works

  • Index — convert your documents into embedding vectors and store them
  • Retrieve — embed the user's query and find the most similar documents by cosine similarity
  • Generate — pass the retrieved documents as context to the LLM and ask it to answer

The LLM never sees your entire knowledge base — only the top-K most relevant chunks. This keeps prompts focused and costs predictable.

Prerequisites

pip install anthropic sentence-transformers numpy
Enter fullscreen mode Exit fullscreen mode

sentence-transformers provides a local embedding model (~80MB) — no external API key needed for the retrieval step. anthropic handles generation.

Step 1: Define Your Knowledge Base

In production, this comes from chunked PDFs, database exports, or wiki pages. For this tutorial we use a list of strings:

# Your knowledge base — in production this comes from PDFs, DBs, wikis, etc.documents = [    "Claude is an AI assistant made by Anthropic. It supports tool use, prompt caching, and vision.",    "The Claude API uses token-based pricing. Sonnet costs $3/MTok input and $15/MTok output.",    "Prompt caching caches prompt prefixes for 5 minutes, saving up to 90% on repeated calls.",    "The Batch API processes requests asynchronously within 24 hours at a 50% discount.",    "Claude Haiku is the fastest and cheapest model, ideal for classification and extraction.",    "Model Context Protocol (MCP) is an open standard by Anthropic for connecting AI to tools.",]
Enter fullscreen mode Exit fullscreen mode

Chunking strategy matters: chunks should be semantically coherent (one idea per chunk) and roughly 300–500 tokens each.

Step 2: Build the Vector Store

Embed every document once at startup and store the vectors as a NumPy array:

from sentence_transformers import SentenceTransformerimport numpy as npmodel = SentenceTransformer("all-MiniLM-L6-v2")  # ~80MB, runs locally, no API key needed# Build the vector store once at startupdoc_embeddings = model.encode(documents, normalize_embeddings=True)# Shape: (6, 384) — one 384-dim vector per document
Enter fullscreen mode Exit fullscreen mode

all-MiniLM-L6-v2 is fast and compact (384-dim embeddings). For production, consider bge-large-en-v1.5 or OpenAI's text-embedding-3-small for higher quality.

Step 3: Retrieve Relevant Documents

Cosine similarity between normalized vectors is just a dot product — fast and dependency-free with NumPy:

def retrieve(query: str, top_k: int = 3) -> list[str]:    query_vec = model.encode([query], normalize_embeddings=True)    # Cosine similarity = dot product on L2-normalized vectors    scores = (doc_embeddings @ query_vec.T).flatten()    top_indices = np.argsort(scores)[::-1][:top_k]    return [documents[i] for i in top_indices]# Test retrievalresults = retrieve("How much does Claude Sonnet cost?")for doc in results:    print("-", doc)
Enter fullscreen mode Exit fullscreen mode

Step 4: Generate the Answer with Claude

Pass the retrieved documents as context and let Claude synthesize the answer:

import anthropicclient = anthropic.Anthropic()def answer(question: str) -> str:    context_docs = retrieve(question, top_k=3)    context = "\n\n".join(f"- {doc}" for doc in context_docs)    response = client.messages.create(        model="claude-sonnet-4-6",        max_tokens=512,        system="Answer questions using ONLY the provided context. If the answer is not in the context, say so.",        messages=[{            "role": "user",            "content": f"Context:\n{context}\n\nQuestion: {question}"        }]    )    return response.content[0].text
Enter fullscreen mode Exit fullscreen mode

The system prompt instruction — "answer ONLY from the provided context" — is what keeps the model grounded. Without it, Claude may mix retrieval context with its own training knowledge.

Step 5: Run It

questions = [    "What is prompt caching and how much can it save?",    "Which Claude model is best for classification?",    "What is MCP?",]for q in questions:    print(f"Q: {q}")    print(f"A: {answer(q)}")    print()
Enter fullscreen mode Exit fullscreen mode

Optimize with Prompt Caching

If your system prompt is long and repeated across many queries, cache it to save up to 90% on input costs:

def answer_with_cache(question: str) -> str:    context_docs = retrieve(question, top_k=3)    context = "\n\n".join(f"- {doc}" for doc in context_docs)    response = client.messages.create(        model="claude-sonnet-4-6",        max_tokens=512,        system=[{            "type": "text",            "text": "Answer questions using ONLY the provided context. If the answer is not in the context, say so.",            "cache_control": {"type": "ephemeral"}  # cache the system prompt prefix        }],        messages=[{            "role": "user",            "content": f"Context:\n{context}\n\nQuestion: {question}"        }]    )    return response.content[0].text
Enter fullscreen mode Exit fullscreen mode

Production Considerations

  • Vector database — replace NumPy with ChromaDB, Pinecone, or pgvector for persistence and scale
  • Chunking — use overlapping chunks (e.g., 400 tokens with 50-token overlap) to avoid splitting context at boundaries
  • Reranking — after top-K retrieval, apply a cross-encoder reranker for higher precision
  • Metadata filtering — filter by date, category, or source before semantic search
  • Evaluation — measure retrieval recall and answer faithfulness with frameworks like RAGAS

Summary

RAG in its simplest form is three steps: embed → retrieve → generate. The implementation above is production-ready for small to medium corpora. As your knowledge base grows, swap NumPy for a proper vector database — the retrieval and generation logic stays identical.


Originally published at kalyna.pro

Top comments (0)