Midas126

Posted on Apr 10

Building Your Own "Google Maps for Codebases": A Guide to Codebase Q&A with AI

#ai #python #machinelearning #opensource

From Lost to Found: Navigating the Modern Code Jungle

You clone a repository. It’s 50,000 lines of code you didn't write. You need to add a feature, but first, you have to answer a simple question: "Where does the user authentication logic live, and how does it integrate with the payment module?" Your options are grep, frantic file hopping, or bothering a busy senior dev. What if you could just ask?

This is the promise behind tools like the one described in "Google Maps for Codebases"—paste a repo URL, ask in plain English, and get an answer. It’s a killer application of modern AI, moving beyond simple chat to grounded, context-aware assistance. In this guide, we’ll deconstruct this concept and build a functional, local version from scratch using open-source tools. You'll learn the core techniques of retrieval-augmented generation (RAG) as applied to code, turning a sprawling codebase into a queryable knowledge base.

The Core Architecture: It's All About RAG

The magic isn't just a large language model (LLM) with a good memory. A raw LLM trained on general code would hallucinate details about your specific project. The solution is Retrieval-Augmented Generation (RAG).

Indexing: Your codebase is broken into meaningful chunks, converted into numerical vectors (embeddings), and stored for fast search.
Retrieval: When you ask a question, the system finds the code chunks most semantically similar to your query.
Augmentation & Generation: These relevant chunks are fed as context to an LLM, which synthesizes an answer grounded in the actual code.

This ensures answers are factual and specific to the project at hand.

Building the Engine: A Step-by-Step Implementation

We'll use Python, the chromadb vector database, sentence-transformers for embeddings, and the llama-cpp-python library to run a local, open-source LLM.

Step 1: Setting Up the Environment

pip install chromadb sentence-transformers llama-cpp-python

We'll use the all-MiniLM-L6-v2 model for embeddings (lightweight and effective) and the Mistral-7B-Instruct model for the LLM, quantized to run efficiently on a developer's machine.

Step 2: Chunking the Code Intelligently

Unlike text, code has structure. A naive split by lines or characters would break functions and classes. We need a code-aware chunker.

import ast
from pathlib import Path

def chunk_code_file(file_path: Path, max_chunk_size: 1000):
    """Parse a Python file and chunk by function/class definitions."""
    chunks = []
    try:
        with open(file_path, 'r') as f:
            tree = ast.parse(f.read(), filename=str(file_path))

        for node in ast.walk(tree):
            # Capture functions and classes as natural chunks
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
                start_lineno = node.lineno - 1  # ast lines are 1-indexed
                end_lineno = node.end_lineno if hasattr(node, 'end_lineno') else start_lineno + 10

                with open(file_path, 'r') as f:
                    lines = f.readlines()
                chunk_text = ''.join(lines[start_lineno:end_lineno])

                # Add file context
                chunk_with_meta = f"File: {file_path}\n\n{chunk_text}"
                chunks.append(chunk_with_meta)
    except SyntaxError:
        # Fallback for non-Python files or complex syntax: simple line-based chunking
        with open(file_path, 'r') as f:
            content = f.read()
        # Simple split for demonstration; consider a more robust fallback for production
        for i in range(0, len(content), max_chunk_size):
            chunk = content[i:i+max_chunk_size]
            chunks.append(f"File: {file_path}\n\n{chunk}")
    return chunks

Step 3: Generating Embeddings and Storing in VectorDB

We create embeddings for each chunk and store them in ChromaDB.

from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings

# Initialize our embedding model and vector DB client
embedder = SentenceTransformer('all-MiniLM-L6-v2')
chroma_client = chromadb.Client(Settings(persist_directory="./code_db", chroma_db_impl="duckdb+parquet"))
collection = chroma_client.create_collection(name="codebase")

def index_repository(repo_path: str):
    repo_path = Path(repo_path)
    all_chunks = []
    ids = []

    # Recursively find all .py files (extend for other languages)
    for file_path in repo_path.rglob("*.py"):
        chunks = chunk_code_file(file_path)
        all_chunks.extend(chunks)
        # Generate IDs like `path/to/file.py::function_name`
        ids.extend([f"{file_path}::chunk_{i}" for i in range(len(chunks))])

    # Generate embeddings in batches for efficiency
    embeddings = embedder.encode(all_chunks, show_progress_bar=True).tolist()

    # Add to ChromaDB collection
    collection.add(
        embeddings=embeddings,
        documents=all_chunks,
        ids=ids
    )
    print(f"Indexed {len(all_chunks)} chunks from {repo_path}")

# Index your project
index_repository("/path/to/your/project")

Step 4: The Retrieval and Query Pipeline

When a question is asked, we retrieve relevant chunks and format a prompt for the LLM.

from llama_cpp import Llama

# Load the local LLM (download the GGUF model file first)
llm = Llama(model_path="./models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", n_ctx=2048, verbose=False)

def ask_codebase(question: str, k_results: int = 5):
    # 1. Retrieve relevant code chunks
    query_embedding = embedder.encode([question]).tolist()[0]
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k_results
    )

    retrieved_docs = results['documents'][0]

    # 2. Construct a precise prompt for the LLM
    context_block = "\n\n---\n\n".join(retrieved_docs)

    prompt = f"""You are an expert software engineer analyzing a codebase. Use the provided code context to answer the question.

Code Context:
{context_block}

Question: {question}

Answer based only on the code context provided. If the context does not contain enough information to answer fully, state what you can infer and what is unclear. Be concise and specific.
Answer:"""

    # 3. Generate the answer
    response = llm(
        prompt,
        max_tokens=512,
        stop=["Question:", "\n\n"],
        echo=False,
        temperature=0.1  # Low temperature for factual, deterministic answers
    )

    return response['choices'][0]['text'].strip()

# Ask a question!
answer = ask_codebase("Where is the main entry point of the application, and what arguments does it accept?")
print(answer)

Leveling Up: Practical Considerations for Production

Our basic prototype works, but a robust system needs more:

Multi-Language Support: Integrate tree-sitter for robust, language-aware parsing of Java, JavaScript, Go, etc.
Metadata Enrichment: Store chunk type (function, class, config), file path, and even derived relationships (e.g., "function X calls function Y") in the vector DB metadata for better filtering.
Hierarchical Retrieval: First find relevant files, then drill down into specific chunks within them. This can improve accuracy for broad questions.
Cross-Reference Awareness: Use static analysis to build a graph of how chunks relate. When answering "How does authentication work?", the system could retrieve the auth() function and all call sites.
Caching & Performance: Cache embeddings for unchanged files and implement batch processing for large repos.

The Takeaway: Your Codebase as a Queryable Interface

The "Google Maps for Codebase" concept is more than a cool demo; it represents a shift in how we interact with software. By implementing a local RAG system, you gain a powerful, private, and customizable tool for navigating complex projects. This approach isn't limited to code—imagine applying it to internal documentation, commit histories, or log files.

Start exploring today. Clone a moderately complex open-source project and point your prototype at it. Ask questions. See where it fails, and iterate. The core building blocks are now in your hands. How will you adapt them to map your own development workflow?

What's the first codebase you'll query? Share your ideas or prototype improvements in the comments below.

DEV Community