The AI Code Assistant Dream
You've seen the demos: paste a GitHub URL, ask a question in plain English, and get an intelligent answer about the codebase. Tools like GitHub Copilot Chat and the viral "Google Maps for Codebases" concept promise to revolutionize how we understand unfamiliar code. But how do these systems actually work under the hood? More importantly, how could you build a simplified version yourself?
This isn't just about calling an API. It's about understanding the architecture that makes code-aware AI possible. Today, we'll deconstruct the problem and build a practical, local-first code query engine using open-source tools. By the end, you'll have a working prototype and the architectural knowledge to adapt these patterns to your own projects.
Deconstructing the Problem: It's Not Just One AI Call
At first glance, "ask a question about a codebase" seems like a job for a large language model (LLM). But raw codebases are too large for most LLM context windows, and LLMs lack inherent knowledge of your specific project. The magic happens in the retrieval-augmented generation (RAG) pattern, adapted for code.
The system needs to:
- Parse & Index: Break down the codebase into searchable chunks.
- Retrieve: Find the most relevant code snippets for a given question.
- Reason & Generate: Use an LLM to synthesize an answer from those snippets.
Building Our Engine: A Three-Tier Architecture
Let's build a system called CodeExplainer. We'll use Python and focus on local, open-source components where possible.
Tier 1: The Code Indexer
We need to parse code into meaningful chunks. A simple file splitter isn't enough—we should chunk by logical structures (functions, classes).
import ast
from pathlib import Path
from typing import List, Dict
import hashlib
class CodeIndexer:
def __init__(self, repo_path: str):
self.repo_path = Path(repo_path)
self.chunks = []
def parse_python_file(self, file_path: Path) -> List[Dict]:
"""Parse a Python file into function/class chunks."""
with open(file_path, 'r') as f:
tree = ast.parse(f.read())
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
start_line = node.lineno
end_line = node.end_lineno
# Get the actual source code
with open(file_path, 'r') as f:
lines = f.readlines()
code_snippet = ''.join(lines[start_line-1:end_line])
chunk_id = hashlib.md5(f"{file_path}:{node.name}:{start_line}".encode()).hexdigest()[:8]
chunks.append({
'id': chunk_id,
'file_path': str(file_path.relative_to(self.repo_path)),
'name': node.name,
'type': type(node).__name__,
'code': code_snippet,
'start_line': start_line,
'metadata': {
'repo_path': str(self.repo_path),
'language': 'python'
}
})
return chunks
def index_repository(self):
"""Walk through repo and index all Python files."""
for py_file in self.repo_path.rglob('*.py'):
if '.git' in str(py_file):
continue
chunks = self.parse_python_file(py_file)
self.chunks.extend(chunks)
print(f"Indexed {len(self.chunks)} code chunks from {self.repo_path}")
return self.chunks
Tier 2: The Semantic Retriever
Now we need to find relevant chunks for a question. We'll use sentence embeddings and vector search.
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class CodeRetriever:
def __init__(self, model_name='all-MiniLM-L6-v2'):
self.embedder = SentenceTransformer(model_name)
self.chunks = []
self.embeddings = None
def index_chunks(self, chunks: List[Dict]):
"""Create embeddings for all code chunks."""
self.chunks = chunks
# Create text representations for embedding
texts = []
for chunk in chunks:
# Combine metadata and code for better retrieval
text = f"{chunk['type']} {chunk['name']} in {chunk['file_path']}:\n{chunk['code']}"
texts.append(text)
self.embeddings = self.embedder.encode(texts, show_progress_bar=True)
def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
"""Find most relevant code chunks for a query."""
query_embedding = self.embedder.encode([query])
similarities = cosine_similarity(query_embedding, self.embeddings)[0]
# Get top_k indices
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = []
for idx in top_indices:
chunk = self.chunks[idx].copy()
chunk['similarity'] = float(similarities[idx])
results.append(chunk)
return results
Tier 3: The Reasoning Engine
Finally, we use an LLM to generate answers from retrieved chunks. We'll use Ollama to run local models.
import subprocess
import json
class CodeExplainer:
def __init__(self, retriever: CodeRetriever, model: str = "llama3.2"):
self.retriever = retriever
self.model = model
def generate_answer(self, question: str) -> str:
# Retrieve relevant code
relevant_chunks = self.retriever.retrieve(question, top_k=3)
# Build context for LLM
context_parts = []
for i, chunk in enumerate(relevant_chunks):
context_parts.append(f"[Chunk {i+1} from {chunk['file_path']}]")
context_parts.append(f"```
{% endraw %}
python\n{chunk['code']}\n
{% raw %}
```")
context_parts.append("")
context = "\n".join(context_parts)
# Create prompt
prompt = f"""You are a helpful code assistant. Answer the question based only on the provided code context.
Code Context:
{context}
Question: {question}
Answer the question clearly and concisely. If the context doesn't contain relevant information, say so.
If referring to code, mention the file and function/class name.
"""
# Call local LLM via Ollama
response = self._query_ollama(prompt)
return response
def _query_ollama(self, prompt: str) -> str:
"""Query local Ollama instance."""
cmd = [
"ollama", "run", self.model",
prompt
]
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=30
)
return result.stdout.strip()
except Exception as e:
return f"Error querying LLM: {str(e)}"
Putting It All Together
Here's the complete workflow:
def main():
# 1. Index the repository
indexer = CodeIndexer("/path/to/your/repo")
chunks = indexer.index_repository()
# 2. Setup retriever
retriever = CodeRetriever()
retriever.index_chunks(chunks)
# 3. Create explainer
explainer = CodeExplainer(retriever)
# 4. Ask questions!
questions = [
"How does this project handle database connections?",
"Show me the main entry point function",
"What authentication system is used?",
]
for question in questions:
print(f"\n{'='*60}")
print(f"Question: {question}")
print(f"{'='*60}")
answer = explainer.generate_answer(question)
print(f"Answer:\n{answer}")
if __name__ == "__main__":
main()
Production Considerations & Enhancements
Our prototype works, but production systems need more:
- Multi-language Support: Use Tree-sitter for robust parsing across languages
- Hierarchical Chunking: Index at multiple levels (file, class, function)
- Hybrid Search: Combine semantic search with keyword matching for better recall
- Caching: Store embeddings and common queries
- Cross-reference Analysis: Build call graphs to understand relationships
# Example enhancement: Hybrid search
class HybridRetriever(CodeRetriever):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.bm25_index = None # Add traditional search index
def hybrid_retrieve(self, query: str, top_k: int = 5, alpha: float = 0.7):
semantic_results = super().retrieve(query, top_k * 2)
keyword_results = self.keyword_retrieve(query, top_k * 2)
# Combine scores (simplified)
combined = self._merge_results(semantic_results, keyword_results, alpha)
return combined[:top_k]
The Takeaway: AI as Your Code Compass
Building a code query engine demystifies the "AI magic" and reveals a tractable engineering problem. The real value isn't in having an oracle that knows everything—it's in creating a system that can quickly surface the right context for an LLM to reason about.
This architecture pattern extends beyond code. Any domain with structured text—documentation, logs, research papers—can benefit from the RAG approach.
Your Challenge: Fork the example implementation and extend it. Add support for JavaScript, implement call graph analysis, or create a web interface. The tools are now in your hands.
What will you build when you can ask your codebase anything?
Resources to dive deeper:
Top comments (0)