DEV Community

Cover image for RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase
WonderLab
WonderLab

Posted on

RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase

The Difference Between Code and Documents

Split a Python file into 1000-character chunks with RecursiveCharacterTextSplitter, embed them, run vector search — this is the most common "code RAG" implementation. The problem is that it treats code as text:

def evaluate_rag(questions, answers, contexts):
    """Evaluate RAG system quality"""
    ...50 lines of code...
Enter fullscreen mode Exit fullscreen mode

Character-based chunking will:

  • Split functions in half (first half in chunk A, second half in chunk B)
  • Lose function boundary information (this IS evaluate_rag, not random text)
  • Ignore call relationships (what this function calls, who calls it)
  • Destroy structural hierarchy (this is a method of RAGPipeline)

Code carries three layers of information: semantics (what it does), structure (function/class/module boundaries), call relationships (who calls whom). Good code RAG models all three.

This article uses llm-in-action as the target and builds a code RAG system capable of answering "how is this function used?" and "show me all call chains through this function."


Parse Code with AST, Not Character Offsets

Python's ast module parses source files into syntax trees. A function definition is a node (ast.FunctionDef) with its exact start line, end line, and decorator list. Chunking at AST boundaries guarantees splits at function edges:

class _FuncExtractor(ast.NodeVisitor):
    def __init__(self, source: str, rel_path: str):
        self._lines       = source.splitlines()
        self._rel_path    = rel_path
        self._class_stack: list[str] = []
        self.units:        list[CodeUnit] = []

    def visit_ClassDef(self, node: ast.ClassDef):
        # Track current class so methods know their parent_class
        self._class_stack.append(node.name)
        self.generic_visit(node)
        self._class_stack.pop()

    def _visit_func(self, node):
        # Extract source by line number, not character offset
        src = "\n".join(self._lines[node.lineno - 1 : node.end_lineno])
        unit = CodeUnit(
            name         = node.name,
            kind         = "method" if self._class_stack else "function",
            file         = self._rel_path,
            start_line   = node.lineno,
            end_line     = node.end_lineno,
            source       = src,
            docstring    = ast.get_docstring(node) or "",
            parent_class = self._class_stack[-1] if self._class_stack else "",
            calls        = self._extract_calls(node),
        )
        self.units.append(unit)
        self.generic_visit(node)

    visit_FunctionDef      = _visit_func
    visit_AsyncFunctionDef = _visit_func
Enter fullscreen mode Exit fullscreen mode

Call relationships are extracted from ast.Call nodes:

def _extract_calls(self, node) -> list[str]:
    calls: set[str] = set()
    for child in ast.walk(node):
        if isinstance(child, ast.Call):
            if isinstance(child.func, ast.Name):
                calls.add(child.func.id)           # direct call: foo()
            elif isinstance(child.func, ast.Attribute):
                calls.add(child.func.attr)          # attribute call: self.foo()
    return sorted(calls)
Enter fullscreen mode Exit fullscreen mode

Extraction results on llm-in-action

Scanned: /mnt/hdd/Database/03_Projects/LLM/llm-in-action
Time: 0.13 seconds

Python files:   22
Functions:      188 (top-level)
Methods:         37 (class methods)
Total units:    225
Article dirs:    18
Enter fullscreen mode Exit fullscreen mode

0.13 seconds to scan the entire codebase. AST parsing doesn't execute code, so there are zero side effects.


Call Graph: Understanding Who Calls Whom

Once function call relationships are extracted, build a bidirectional adjacency map — queryable in both directions:

class CallGraph:
    def __init__(self, units: list[CodeUnit]):
        self.callees: dict[str, set[str]] = defaultdict(set)  # caller → called
        self.callers: dict[str, set[str]] = defaultdict(set)  # callee → caller

        known = {u.name for u in units}
        for u in units:
            for callee in u.calls:
                if callee in known:           # intra-repo edges only
                    self.callees[u.name].add(callee)
                    self.callers[callee].add(u.name)

    def downstream(self, name: str, depth: int = 4) -> list[str]:
        """All functions transitively called by name (BFS)."""
        return self._bfs(name, self.callees, depth)

    def upstream(self, name: str, depth: int = 4) -> list[str]:
        """All functions that transitively call name (BFS)."""
        return self._bfs(name, self.callers, depth)

    def shortest_path(self, start: str, end: str) -> Optional[list[str]]:
        """Shortest call path from start → end."""
        queue: deque[list[str]] = deque([[start]])
        visited: set[str] = {start}
        while queue:
            path = queue.popleft()
            if path[-1] == end:
                return path
            for nxt in self.callees.get(path[-1], set()):
                if nxt not in visited:
                    visited.add(nxt)
                    queue.append(path + [nxt])
        return None
Enter fullscreen mode Exit fullscreen mode

Call graph analysis results

Call graph statistics:
  Functions with outgoing edges:  78  (they call others)
  Functions with incoming edges:  92  (they are called)
  Total edges:                   168
Enter fullscreen mode Exit fullscreen mode

Most-called functions (the codebase's core utilities):

get               ← called from 48 places  (cache reads throughout all articles)
set               ← called from 10 places  (cache writes)
split_documents   ← called from  5 places  (shared chunking helper)
build_embeddings  ← called from  4 places
query             ← called from  4 places
Enter fullscreen mode Exit fullscreen mode

get appearing 48 times reflects Python duck typing — cache .get() calls across SemanticCache, InMemoryCache, and similar types all collapse to the same name in static analysis.

Functions with the most outgoing calls (orchestrators):

main                → 54 direct calls
build_self_rag_graph →  6 direct calls
build_index          →  5 direct calls
build_ragas_dataset  →  5 direct calls
Enter fullscreen mode Exit fullscreen mode

main calling 54 functions is the signature of an entry point — it orchestrates the full pipeline by calling every sub-step.

Call chain traversal

build_self_rag_graph (14-self-rag/self_rag.py) full downstream:

build_self_rag_graph
  ├── make_retrieve_node
  ├── make_filter_node
  ├── make_decide_node
  ├── make_support_node
  ├── make_rag_generate_node
  └── make_direct_generate_node
Enter fullscreen mode Exit fullscreen mode

This is exactly the Self-RAG StateGraph builder pattern: one factory function assembles all graph nodes, each node is an independent small function. The call graph makes this structure immediately visible.

build_index (08-ragas-eval/rag_pipeline.py) downstream chain:

build_index
  → load_documents
  → build_llm
  → build_embeddings
  → split_documents
  → get  (cache)
Enter fullscreen mode Exit fullscreen mode

A canonical RAG initialization sequence: load docs → build LLM → build embeddings → chunk → cache.


Vector Store: Semantic Code Search

Code vectorization has one engineering constraint: function source can be long (50–200 lines), but embedding APIs commonly have a 512-token limit.

Solution: separate the retrieval unit from the Q&A context.

  • Embedding content: function name + docstring (short, semantically precise, fits in token budget)
  • Metadata: complete source code (stored in Chroma's metadata field, read at Q&A time for LLM context)
sig_line      = u.source.splitlines()[0]
embed_content = f"{full_name}: {u.docstring or sig_line}"[:400]

Document(
    page_content = embed_content,         # vectorized — used for retrieval
    metadata = {
        "name":        u.name,
        "file":        u.file,
        "start_line":  u.start_line,
        "source_code": u.source[:2000],   # not vectorized — used for LLM context
    },
)
Enter fullscreen mode Exit fullscreen mode

At Q&A time, retrieval finds relevant functions, then the full source is read from metadata:

docs    = vs.similarity_search(question, k=4)
context = "\n\n---\n\n".join(
    d.metadata.get("source_code", d.page_content)[:600] for d in docs
)
Enter fullscreen mode Exit fullscreen mode

Semantic search results

Query: "RAGAS evaluation metrics calculation"
  0.488  RAGPipeline.build_index   (08-ragas-eval/rag_pipeline.py:95)
  0.476  create_ragas_embeddings   (08-ragas-eval/evaluate.py:50)
  0.467  RAGPipeline.query         (08-ragas-eval/rag_pipeline.py:141)

Query: "rate limiting and access control in enterprise RAG"
  0.504  RAGPipeline.__init__      (08-ragas-eval/rag_pipeline.py:78)
  0.497  RateLimiter.__init__      (20-enterprise-rag/enterprise_rag.py:118)

Query: "incremental document indexing with record manager"
  0.296  generate_testset          (08-ragas-eval/generate_qa.py:51)

Query: "conversational history aware retriever"
  0.400  make_ds                   (18-conversational-rag/conversational_rag.py:428)
Enter fullscreen mode Exit fullscreen mode

RAGAS and enterprise RAG rate limiting queries found the right files. Incremental update didn't — because the functions in 19-incremental-update/ don't mention "record manager" in their docstrings, only in their source code bodies. This is the core limitation of docstring-only embedding: search quality is bounded by docstring quality.


Choosing a Code Embedding Model

General-purpose text embedding models (BGE, text-embedding-3) are "adequate but not great" for code. They can retrieve by docstring, but don't understand that for i in range(n): acc += arr[i] is an accumulation.

Specialized code embedding models:

Model Characteristics
microsoft/codebert-base Code + documentation dual-tower; understands variable names and signatures
Salesforce/codet5-base Generative model; suited for code completion + retrieval
nomic-ai/nomic-embed-text-v1.5 General model with strong code performance; 8192-token limit
voyage-code-2 Voyage AI's code-specialized model; among the best available

Recommended: if token limits aren't a concern (e.g., nomic-embed-text-v1.5 supports 8192 tokens), embed the complete function source directly — no need to split docstrings from source.


The Complete Code RAG Pipeline

# Build a code RAG system

# 1. AST extraction: all functions and methods
units = extract_repo(repo_dir)

# 2. Call graph: bidirectional adjacency
cg = CallGraph(units)

# 3. Vector store: docstrings for retrieval, source_code in metadata for Q&A
vs = build_vectorstore(units, embeddings)

# Three query modes

# A: Semantic search — find functions by meaning
hits = vs.similarity_search("embedding caching", k=5)

# B: Call chain — given a function name, find all upstream/downstream
callers  = cg.upstream("build_embeddings")    # → who calls it
callees  = cg.downstream("main")              # → what it calls
path     = cg.shortest_path("main", "get")    # → how main reaches get

# C: LLM Q&A — retrieve context, generate answer
answer = llm_code_qa("How is incremental update implemented?", vs, llm)
Enter fullscreen mode Exit fullscreen mode

Results Summary

Metric Value
Python files 22
Code units extracted 225 (188 functions + 37 methods)
AST parse time 0.13 seconds
Call graph edges 168
Vectorization time 5.8 seconds
Most-called function get (48 places)
Widest caller main (54 direct calls)

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/24-code-rag

Key file:

  • code_rag.py — AST extraction, call graph, vectorization, search, report

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 24-code-rag
cp .env.example .env
pip install -r requirements.txt
python code_rag.py
Enter fullscreen mode Exit fullscreen mode

Summary

The core difference between code RAG and document RAG:

Document RAG Code RAG
Chunk unit Fixed-size text blocks Functions/methods (AST boundaries)
Structure None Class hierarchy, module hierarchy
Call relationships None Call graph (bidirectional)
Embedding content Full text Docstring + signature (or full source)
Query types Semantic search Semantic search + call chain traversal

Three key tradeoffs:

  1. AST vs text chunking: AST cuts at function boundaries and preserves complete units. Text chunking is faster but destroys structure. For production code RAG, use AST — there's no reason not to.
  2. Docstring vs full source embedding: Under token constraints, embed docstrings (short and semantically focused) — but quality depends on docstring completeness. With a long-context embedding model, embed the full source directly.
  3. Call graph vs pure vector retrieval: Vector retrieval finds semantically similar functions; the call graph answers "what does X call?" and "who uses X?" — they're complementary, not interchangeable.

This is the final article in the RAG series. Twenty-four articles covering the complete path from "what is RAG?" to "how do you teach AI to understand a codebase?" All code is open-sourced at llm-in-action — every article has a runnable demo and a real benchmark report.


References

Top comments (0)