Midas126

Posted on Apr 10

Building Your Own "Google Maps for Codebases": A Guide to Codebase Q&A with LLMs

#ai #machinelearning #llm #programming

From Overwhelm to Insight: Navigating the Modern Code Jungle

You clone a promising repository, ready to contribute or understand its magic. You’re greeted by dozens of directories, hundreds of files, and a README that ends at "Getting Started." Sound familiar? Navigating a complex, unfamiliar codebase remains one of software development's most universal and time-consuming challenges. The recent popularity of tools that act as a "Google Maps for codebases"—where you paste a GitHub URL and ask questions—highlights our collective desire to cut through this complexity.

But what if you could build the core of this capability yourself? In this guide, we'll move from being a user of these AI-powered navigators to an architect. We'll build a foundational system that ingests a codebase, understands its structure and content, and answers your questions in plain English. We'll use open-source tools, focusing on the practical pipeline rather than black-box APIs.

The Architectural Blueprint: RAG for Code

The magic behind these tools is not a single, monolithic AI model that magically understands everything. It's a clever application of the Retrieval-Augmented Generation (RAG) pattern, perfectly suited for code.

Indexing: The codebase is broken down, analyzed, and stored in a queryable format.
Retrieval: When you ask a question, the system finds the most relevant code snippets and documentation.
Augmentation & Generation: These snippets are fed to a Large Language Model (LLM) with your question, instructing it to synthesize an answer.

We'll implement this pipeline using tree-sitter for robust parsing, Chroma for vector storage/retrieval, and Ollama with the codellama model for local, offline generation.

Prerequisites

Ensure you have Python 3.10+ and pip ready. We'll install everything in a virtual environment.

# Create and activate a virtual environment
python -m venv code_rag_env
source code_rag_env/bin/activate  # On Windows: `code_rag_env\Scripts\activate`

# Install core dependencies
pip install tree-sitter chromadb pydantic ollama requests

Phase 1: Parsing and Indexing the Codebase

We can't just dump raw code files into the LLM. We need structure. tree-sitter allows us to parse source code into syntax trees, letting us extract functions, classes, imports, and comments with high accuracy.

First, let's create a robust document model and parser.

# document_model.py
from pydantic import BaseModel
from pathlib import Path
from typing import Optional

class CodeDocument(BaseModel):
    """Represents a chunk of code with its metadata."""
    id: str
    text: str
    filepath: str
    language: Optional[str] = None
    # Metadata for richer context
    symbol_name: Optional[str] = None  # e.g., function or class name
    symbol_type: Optional[str] = None  # e.g., 'function', 'class', 'module'
    line_start: Optional[int] = None

# code_parser.py
import tree_sitter
from tree_sitter import Language, Parser
from pathlib import Path
import os

# Build Tree-sitter languages (simplified - in practice, you'd build .so files)
# This example assumes you have the language libraries available.
# We'll focus on Python for this guide.

class CodebaseParser:
    def __init__(self):
        self.parser = Parser()
        # Point to your built tree-sitter language library for Python
        # Example: Language('build/my-languages.so', 'python')
        # For simplicity, we'll use a placeholder and fallback text extraction.
        self.language = None

    def parse_file(self, filepath: Path) -> list[CodeDocument]:
        """Parse a single file into a list of logical CodeDocuments."""
        documents = []
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                content = f.read()
        except:
            return documents

        # Fallback: If parsing fails, chunk by lines for demonstration.
        # A real implementation would use tree-sitter queries.
        lines = content.split('\n')
        chunk_size = 20
        for i in range(0, len(lines), chunk_size):
            chunk_lines = lines[i:i+chunk_size]
            doc = CodeDocument(
                id=f"{filepath}:{i}",
                text='\n'.join(chunk_lines),
                filepath=str(filepath),
                language=filepath.suffix,
                line_start=i+1
            )
            documents.append(doc)
        return documents

    def walk_directory(self, root_path: str) -> list[CodeDocument]:
        """Recursively parse all files in a directory."""
        all_docs = []
        for ext in ['.py', '.js', '.ts', '.java', '.go', '.rs', '.md', '.txt']: # Add as needed
            for filepath in Path(root_path).rglob(f"*{ext}"):
                if any(part.startswith('.') for part in filepath.parts):
                    continue  # Skip hidden directories
                all_docs.extend(self.parse_file(filepath))
        return all_docs

Phase 2: Storing and Retrieving with Vector Search

Parsed documents are converted into numerical vectors (embeddings) and stored. When we query, we convert the question into a vector and find the most similar code snippets.

We'll use Chroma, a lightweight, open-source vector database.

# vector_store.py
import chromadb
from chromadb.config import Settings
import hashlib

class CodeVectorStore:
    def __init__(self, persist_directory: str = "./code_rag_chroma_db"):
        self.client = chromadb.PersistentClient(
            path=persist_directory,
            settings=Settings(anonymized_telemetry=False)
        )
        self.collection = self.client.get_or_create_collection(
            name="codebase_documents",
            metadata={"hnsw:space": "cosine"} # Cosine similarity for text
        )

    def generate_id(self, doc_text: str, filepath: str) -> str:
        """Create a deterministic ID for a document."""
        unique_string = f"{filepath}:{doc_text[:50]}"
        return hashlib.md5(unique_string.encode()).hexdigest()

    def add_documents(self, documents: list[CodeDocument]):
        """Add a list of CodeDocuments to the vector store."""
        ids = []
        texts = []
        metadatas = []

        for doc in documents:
            ids.append(self.generate_id(doc.text, doc.filepath))
            texts.append(doc.text)
            metadatas.append({
                "filepath": doc.filepath,
                "language": doc.language or "",
                "symbol_name": doc.symbol_name or "",
                "symbol_type": doc.symbol_type or "",
                "line_start": str(doc.line_start) if doc.line_start else ""
            })

        # In a production system, you would generate embeddings here.
        # For simplicity, Chroma will use its default embedding function.
        self.collection.add(
            ids=ids,
            documents=texts,
            metadatas=metadatas
        )
        print(f"Added {len(documents)} documents to the vector store.")

    def query(self, question: str, n_results: int = 5) -> list[dict]:
        """Query the vector store for relevant code snippets."""
        results = self.collection.query(
            query_texts=[question],
            n_results=n_results
        )
        # Format results
        retrieved_docs = []
        if results['documents']:
            for i in range(len(results['documents'][0])):
                retrieved_docs.append({
                    'text': results['documents'][0][i],
                    'filepath': results['metadatas'][0][i]['filepath'],
                    'line_start': results['metadatas'][0][i].get('line_start', 'N/A')
                })
        return retrieved_docs

Phase 3: Generating Answers with a Local LLM

We'll use Ollama to run the excellent codellama model locally. It's fine-tuned for code and reasoning.

# First, pull the model (ensure Ollama is installed and running)
ollama pull codellama:7b-instruct

# query_engine.py
import ollama
from vector_store import CodeVectorStore

class CodebaseQAEngine:
    def __init__(self, vector_store: CodeVectorStore):
        self.vector_store = vector_store

    def ask(self, question: str) -> str:
        # 1. RETRIEVE relevant context
        context_docs = self.vector_store.query(question, n_results=5)

        if not context_docs:
            return "I couldn't find any relevant code to answer your question."

        # 2. FORMAT context for the LLM prompt
        context_text = "\n\n---\n\n".join([
            f"File: {doc['filepath']} (around line {doc['line_start']})\n```
{% endraw %}
\n{doc['text']}\n
{% raw %}
```"
            for doc in context_docs
        ])

        # 3. GENERATE answer using the augmented prompt
        prompt = f"""You are an expert software engineer analyzing a codebase.
Use the following retrieved code snippets to answer the user's question.
If the answer cannot be found in the context, say so.

Context from codebase:
{context_text}

Question: {question}

Provide a concise, accurate answer based only on the context above. Mention relevant file names.
Answer:"""

        response = ollama.chat(model='codellama:7b-instruct', messages=[
            {'role': 'user', 'content': prompt}
        ])
        return response['message']['content']

Bringing It All Together: The Main Pipeline

Let's create a simple script to index a codebase and start a Q&A loop.

# main.py
from code_parser import CodebaseParser
from vector_store import CodeVectorStore
from query_engine import CodebaseQAEngine
import sys

def main():
    if len(sys.argv) < 2:
        print("Usage: python main.py <path_to_codebase>")
        sys.exit(1)

    codebase_path = sys.argv[1]

    print("🧠 Parsing codebase...")
    parser = CodebaseParser()
    documents = parser.walk_directory(codebase_path)
    print(f"   Parsed {len(documents)} code chunks.")

    print("📦 Indexing into vector database...")
    vector_store = CodeVectorStore()
    vector_store.add_documents(documents)

    print("✅ Setup complete! Starting Q&A session.")
    print("Type 'exit' to quit.\n")

    qa_engine = CodebaseQAEngine(vector_store)

    while True:
        try:
            question = input("\n🔍 Your question about the codebase: ").strip()
            if question.lower() in ['exit', 'quit']:
                break
            if not question:
                continue

            print("\n🤖 Thinking...")
            answer = qa_engine.ask(question)
            print(f"\n{answer}")

        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"\n⚠️  An error occurred: {e}")

if __name__ == "__main__":
    main()

Run it against a local project:

python main.py /path/to/your/github/clone

Leveling Up: Next Steps for a Production System

Our prototype works, but a robust system needs more:

Better Chunking: Use tree-sitter queries to chunk at the function/method/class level, preserving logical boundaries.
Specialized Embeddings: Use a model fine-tuned for code (like BGE or SentenceTransformers with code datasets) instead of a generic text embedder.
Cross-Reference Graph: Build a graph of imports, function calls, and class hierarchies. This allows answering questions like "What functions call this method?"
Git Awareness: Index commit messages and git blame data to answer "why" questions about code changes.
Web Interface: Wrap the engine in a simple FastAPI server and a React frontend to mimic the "paste a URL" experience.

The Takeaway: Demystifying the AI Navigator

Building the core of a codebase Q&A tool is an accessible and enlightening project. It demystifies the "AI magic" into a tangible pipeline of parsing, retrieval, and augmentation. By understanding and implementing this RAG pattern for code, you gain a powerful framework that can be extended to documentations, internal wikis, or any structured knowledge base.

Your challenge this week: Clone a moderately complex open-source repository and run our prototype against it. Start by asking, "What is the entry point of this application?" and "How does the main data flow work?" Then, try extending the parser to properly extract functions using tree-sitter. Share what you discover.

The future of developer tools isn't just about using AI—it's about building and shaping it for our specific needs. Start building.

DEV Community