The Difference Between Code and Documents
Split a Python file into 1000-character chunks with RecursiveCharacterTextSplitter, embed them, run vector search — this is the most common "code RAG" implementation. The problem is that it treats code as text:
def evaluate_rag(questions, answers, contexts):
"""Evaluate RAG system quality"""
...(50 lines of code)...
Character-based chunking will:
- Split functions in half (first half in chunk A, second half in chunk B)
- Lose function boundary information (this IS
evaluate_rag, not random text) - Ignore call relationships (what this function calls, who calls it)
- Destroy structural hierarchy (this is a method of
RAGPipeline)
Code carries three layers of information: semantics (what it does), structure (function/class/module boundaries), call relationships (who calls whom). Good code RAG models all three.
This article uses llm-in-action as the target and builds a code RAG system capable of answering "how is this function used?" and "show me all call chains through this function."
Parse Code with AST, Not Character Offsets
Python's ast module parses source files into syntax trees. A function definition is a node (ast.FunctionDef) with its exact start line, end line, and decorator list. Chunking at AST boundaries guarantees splits at function edges:
class _FuncExtractor(ast.NodeVisitor):
def __init__(self, source: str, rel_path: str):
self._lines = source.splitlines()
self._rel_path = rel_path
self._class_stack: list[str] = []
self.units: list[CodeUnit] = []
def visit_ClassDef(self, node: ast.ClassDef):
# Track current class so methods know their parent_class
self._class_stack.append(node.name)
self.generic_visit(node)
self._class_stack.pop()
def _visit_func(self, node):
# Extract source by line number, not character offset
src = "\n".join(self._lines[node.lineno - 1 : node.end_lineno])
unit = CodeUnit(
name = node.name,
kind = "method" if self._class_stack else "function",
file = self._rel_path,
start_line = node.lineno,
end_line = node.end_lineno,
source = src,
docstring = ast.get_docstring(node) or "",
parent_class = self._class_stack[-1] if self._class_stack else "",
calls = self._extract_calls(node),
)
self.units.append(unit)
self.generic_visit(node)
visit_FunctionDef = _visit_func
visit_AsyncFunctionDef = _visit_func
Call relationships are extracted from ast.Call nodes:
def _extract_calls(self, node) -> list[str]:
calls: set[str] = set()
for child in ast.walk(node):
if isinstance(child, ast.Call):
if isinstance(child.func, ast.Name):
calls.add(child.func.id) # direct call: foo()
elif isinstance(child.func, ast.Attribute):
calls.add(child.func.attr) # attribute call: self.foo()
return sorted(calls)
Extraction results on llm-in-action
Scanned: /mnt/hdd/Database/03_Projects/LLM/llm-in-action
Time: 0.13 seconds
Python files: 22
Functions: 188 (top-level)
Methods: 37 (class methods)
Total units: 225
Article dirs: 18
0.13 seconds to scan the entire codebase. AST parsing doesn't execute code, so there are zero side effects.
Call Graph: Understanding Who Calls Whom
Once function call relationships are extracted, build a bidirectional adjacency map — queryable in both directions:
class CallGraph:
def __init__(self, units: list[CodeUnit]):
self.callees: dict[str, set[str]] = defaultdict(set) # caller → called
self.callers: dict[str, set[str]] = defaultdict(set) # callee → caller
known = {u.name for u in units}
for u in units:
for callee in u.calls:
if callee in known: # intra-repo edges only
self.callees[u.name].add(callee)
self.callers[callee].add(u.name)
def downstream(self, name: str, depth: int = 4) -> list[str]:
"""All functions transitively called by name (BFS)."""
return self._bfs(name, self.callees, depth)
def upstream(self, name: str, depth: int = 4) -> list[str]:
"""All functions that transitively call name (BFS)."""
return self._bfs(name, self.callers, depth)
def shortest_path(self, start: str, end: str) -> Optional[list[str]]:
"""Shortest call path from start → end."""
queue: deque[list[str]] = deque([[start]])
visited: set[str] = {start}
while queue:
path = queue.popleft()
if path[-1] == end:
return path
for nxt in self.callees.get(path[-1], set()):
if nxt not in visited:
visited.add(nxt)
queue.append(path + [nxt])
return None
Call graph analysis results
Call graph statistics:
Functions with outgoing edges: 78 (they call others)
Functions with incoming edges: 92 (they are called)
Total edges: 168
Most-called functions (the codebase's core utilities):
get ← called from 48 places (cache reads throughout all articles)
set ← called from 10 places (cache writes)
split_documents ← called from 5 places (shared chunking helper)
build_embeddings ← called from 4 places
query ← called from 4 places
get appearing 48 times reflects Python duck typing — cache .get() calls across SemanticCache, InMemoryCache, and similar types all collapse to the same name in static analysis.
Functions with the most outgoing calls (orchestrators):
main → 54 direct calls
build_self_rag_graph → 6 direct calls
build_index → 5 direct calls
build_ragas_dataset → 5 direct calls
main calling 54 functions is the signature of an entry point — it orchestrates the full pipeline by calling every sub-step.
Call chain traversal
build_self_rag_graph (14-self-rag/self_rag.py) full downstream:
build_self_rag_graph
├── make_retrieve_node
├── make_filter_node
├── make_decide_node
├── make_support_node
├── make_rag_generate_node
└── make_direct_generate_node
This is exactly the Self-RAG StateGraph builder pattern: one factory function assembles all graph nodes, each node is an independent small function. The call graph makes this structure immediately visible.
build_index (08-ragas-eval/rag_pipeline.py) downstream chain:
build_index
→ load_documents
→ build_llm
→ build_embeddings
→ split_documents
→ get (cache)
A canonical RAG initialization sequence: load docs → build LLM → build embeddings → chunk → cache.
Vector Store: Semantic Code Search
Code vectorization has one engineering constraint: function source can be long (50–200 lines), but embedding APIs commonly have a 512-token limit.
Solution: separate the retrieval unit from the Q&A context.
- Embedding content: function name + docstring (short, semantically precise, fits in token budget)
- Metadata: complete source code (stored in Chroma's metadata field, read at Q&A time for LLM context)
sig_line = u.source.splitlines()[0]
embed_content = f"{full_name}: {u.docstring or sig_line}"[:400]
Document(
page_content = embed_content, # vectorized — used for retrieval
metadata = {
"name": u.name,
"file": u.file,
"start_line": u.start_line,
"source_code": u.source[:2000], # not vectorized — used for LLM context
},
)
At Q&A time, retrieval finds relevant functions, then the full source is read from metadata:
docs = vs.similarity_search(question, k=4)
context = "\n\n---\n\n".join(
d.metadata.get("source_code", d.page_content)[:600] for d in docs
)
Semantic search results
Query: "RAGAS evaluation metrics calculation"
0.488 RAGPipeline.build_index (08-ragas-eval/rag_pipeline.py:95)
0.476 create_ragas_embeddings (08-ragas-eval/evaluate.py:50)
0.467 RAGPipeline.query (08-ragas-eval/rag_pipeline.py:141)
Query: "rate limiting and access control in enterprise RAG"
0.504 RAGPipeline.__init__ (08-ragas-eval/rag_pipeline.py:78)
0.497 RateLimiter.__init__ (20-enterprise-rag/enterprise_rag.py:118)
Query: "incremental document indexing with record manager"
0.296 generate_testset (08-ragas-eval/generate_qa.py:51)
Query: "conversational history aware retriever"
0.400 make_ds (18-conversational-rag/conversational_rag.py:428)
RAGAS and enterprise RAG rate limiting queries found the right files. Incremental update didn't — because the functions in 19-incremental-update/ don't mention "record manager" in their docstrings, only in their source code bodies. This is the core limitation of docstring-only embedding: search quality is bounded by docstring quality.
Choosing a Code Embedding Model
General-purpose text embedding models (BGE, text-embedding-3) are "adequate but not great" for code. They can retrieve by docstring, but don't understand that for i in range(n): acc += arr[i] is an accumulation.
Specialized code embedding models:
| Model | Characteristics |
|---|---|
microsoft/codebert-base |
Code + documentation dual-tower; understands variable names and signatures |
Salesforce/codet5-base |
Generative model; suited for code completion + retrieval |
nomic-ai/nomic-embed-text-v1.5 |
General model with strong code performance; 8192-token limit |
voyage-code-2 |
Voyage AI's code-specialized model; among the best available |
Recommended: if token limits aren't a concern (e.g., nomic-embed-text-v1.5 supports 8192 tokens), embed the complete function source directly — no need to split docstrings from source.
The Complete Code RAG Pipeline
# Build a code RAG system
# 1. AST extraction: all functions and methods
units = extract_repo(repo_dir)
# 2. Call graph: bidirectional adjacency
cg = CallGraph(units)
# 3. Vector store: docstrings for retrieval, source_code in metadata for Q&A
vs = build_vectorstore(units, embeddings)
# Three query modes
# A: Semantic search — find functions by meaning
hits = vs.similarity_search("embedding caching", k=5)
# B: Call chain — given a function name, find all upstream/downstream
callers = cg.upstream("build_embeddings") # → who calls it
callees = cg.downstream("main") # → what it calls
path = cg.shortest_path("main", "get") # → how main reaches get
# C: LLM Q&A — retrieve context, generate answer
answer = llm_code_qa("How is incremental update implemented?", vs, llm)
Results Summary
| Metric | Value |
|---|---|
| Python files | 22 |
| Code units extracted | 225 (188 functions + 37 methods) |
| AST parse time | 0.13 seconds |
| Call graph edges | 168 |
| Vectorization time | 5.8 seconds |
| Most-called function |
get (48 places) |
| Widest caller |
main (54 direct calls) |
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/24-code-rag
Key file:
-
code_rag.py— AST extraction, call graph, vectorization, search, report
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 24-code-rag
cp .env.example .env
pip install -r requirements.txt
python code_rag.py
Summary
The core difference between code RAG and document RAG:
| Document RAG | Code RAG | |
|---|---|---|
| Chunk unit | Fixed-size text blocks | Functions/methods (AST boundaries) |
| Structure | None | Class hierarchy, module hierarchy |
| Call relationships | None | Call graph (bidirectional) |
| Embedding content | Full text | Docstring + signature (or full source) |
| Query types | Semantic search | Semantic search + call chain traversal |
Three key tradeoffs:
- AST vs text chunking: AST cuts at function boundaries and preserves complete units. Text chunking is faster but destroys structure. For production code RAG, use AST — there's no reason not to.
- Docstring vs full source embedding: Under token constraints, embed docstrings (short and semantically focused) — but quality depends on docstring completeness. With a long-context embedding model, embed the full source directly.
- Call graph vs pure vector retrieval: Vector retrieval finds semantically similar functions; the call graph answers "what does X call?" and "who uses X?" — they're complementary, not interchangeable.
This is the final article in the RAG series. Twenty-four articles covering the complete path from "what is RAG?" to "how do you teach AI to understand a codebase?" All code is open-sourced at llm-in-action — every article has a runnable demo and a real benchmark report.
Top comments (0)