DEV Community

myougaTheAxo
myougaTheAxo

Posted on

RAG Implementation Patterns with Claude Code: Building Retrieval-Augmented Generation Systems

What is RAG and Why It Matters Now

RAG (Retrieval-Augmented Generation) is an architectural pattern that supplements LLM knowledge limitations with external documents. When building production systems with Claude Code, it's an essential technique for leveraging internal documentation, codebases, and real-time data.

The basic flow:

  1. Split documents into chunks
  2. Convert each chunk into vector embeddings
  3. Store in a vector database
  4. At query time, retrieve similar chunks
  5. Pass retrieved chunks as context to the LLM

Document Chunking Strategy

from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_chunks(text: str) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
        separators=["\n\n", "\n", ".", " ", ""],
    )
    return splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

For code files, splitting by function/class boundaries gives more semantically accurate results:

import ast

def split_python_file(source: str) -> list[dict]:
    tree = ast.parse(source)
    chunks = []
    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
            start = node.lineno - 1
            end = node.end_lineno
            lines = source.split("\n")[start:end]
            chunks.append({
                "content": "\n".join(lines),
                "type": type(node).__name__,
                "name": node.name,
            })
    return chunks
Enter fullscreen mode Exit fullscreen mode

Embedding Model Selection

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def embed_texts(texts: list[str]) -> list[list[float]]:
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
        encoding_format="float",
    )
    return [item.embedding for item in response.data]
Enter fullscreen mode Exit fullscreen mode

For zero-cost local embedding:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-small")

def embed_local(texts: list[str]):
    prefixed = [f"query: {t}" for t in texts]
    return model.encode(prefixed, normalize_embeddings=True)
Enter fullscreen mode Exit fullscreen mode

Hybrid Search with pgvector

async def hybrid_search(conn, query_text, query_embedding, limit=5, semantic_weight=0.7):
    rows = await conn.fetch("""
        WITH semantic AS (
            SELECT id, content, metadata,
                   1 - (embedding <=> $1::vector) AS sem_score
            FROM documents
            ORDER BY embedding <=> $1::vector
            LIMIT 20
        ),
        keyword AS (
            SELECT id, content, metadata,
                   ts_rank(to_tsvector('english', content),
                           plainto_tsquery('english', $2)) AS kw_score
            FROM documents
            WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)
            LIMIT 20
        )
        SELECT s.id, s.content, s.metadata,
               ($3 * COALESCE(s.sem_score, 0) + (1-$3) * COALESCE(k.kw_score, 0)) AS score
        FROM semantic s
        LEFT JOIN keyword k USING (id)
        ORDER BY score DESC
        LIMIT $4
    """, query_embedding, query_text, semantic_weight, limit)
    return [dict(row) for row in rows]
Enter fullscreen mode Exit fullscreen mode

Context Injection into LLM

import anthropic

async def rag_query(query: str, retrieved_docs: list[dict]) -> str:
    client = anthropic.AsyncAnthropic()
    context_parts = []
    for i, doc in enumerate(retrieved_docs, 1):
        source = doc.get("metadata", {}).get("source", "unknown")
        context_parts.append(f"[Document {i}] (source: {source})\n{doc['content']}")
    context = "\n\n---\n\n".join(context_parts)

    response = await client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"""Answer based on the reference documents below.

<context>
{context}
</context>

<question>{query}</question>

If information is not in the documents, please state so clearly."""
        }],
    )
    return response.content[0].text
Enter fullscreen mode Exit fullscreen mode

Evaluating RAG Quality

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

def evaluate_rag(test_cases: list[dict]) -> dict:
    dataset = Dataset.from_list(test_cases)
    return evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )
Enter fullscreen mode Exit fullscreen mode

RAG systems require continuous monitoring of retrieval logs, score distributions, and user feedback to keep improving chunk strategy and model selection.


This article is from the Claude Code Complete Guide (7 chapters) on note.com.
myouga (@myougatheaxo) - VTuber axolotl. Sharing practical AI development tips.

Top comments (0)