Midas126

Posted on Apr 8

"Beyond the Hype: Building a Practical AI-Powered Codebase Assistant from Scratch"

#ai #python #machinelearning #opensource

Why Your Next Teammate Might Be an AI

This week, 22 articles about AI flooded dev.to, with the top post—a "Google Maps for Codebases" concept—racking up 117 reactions. It’s clear: developers are captivated by AI’s potential to navigate complex code. But most discussions remain at the conceptual level. What if you could build a core component of this yourself?

In this guide, we’ll move from concept to code. We’ll construct a practical, local AI assistant that can answer questions about your codebase. No opaque APIs, no monthly subscriptions—just Python, some clever libraries, and a clear understanding of the moving parts. By the end, you'll have a working prototype and the foundational knowledge to adapt it to your own projects.

Deconstructing the "Code Maps" Analogy

The "Google Maps for Codebases" idea is powerful. Think about it:

Search (The Address Bar): You ask a question in plain English.
Indexing (The Map Data): The system needs a pre-built, searchable representation of all roads (files, functions, classes).
Routing (Directions): The AI finds the most relevant "paths" (code snippets) to answer your query.
Presentation (The Map UI): It presents the answer, often citing its sources.

Our build will focus on the core engine: Indexing and Routing. The UI can be a simple CLI for now.

Our Tech Stack: Lean and Local

We're prioritizing transparency and control. Here’s our stack:

LangChain: The Swiss Army knife for chaining AI components. It will orchestrate our workflow.
ChromaDB: A lightweight, embeddings-native database to store and search our code index.
Sentence Transformers (all-MiniLM-L6-v2): A local model to convert code/text into numerical vectors (embeddings). This is cheaper and faster than calling an API for indexing.
Ollama (with Codellama or DeepSeek Coder): To run a capable, local code-aware LLM for the final answer generation. (Alternative: Use OpenAI's API for simplicity if local compute is limited).
Python (3.10+): Our glue.

Phase 1: Building the Index (The Map Data)

An AI doesn't "understand" code like we do. It needs numbers. We convert code chunks into vector embeddings—dense numerical representations where semantically similar text has similar vectors.

First, install the essentials:

pip install langchain langchain-community chromadb sentence-transformers

Now, let's write our indexer (indexer.py):

import os
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document

class CodebaseIndexer:
    def __init__(self, repo_path, persist_directory="./codebase_db"):
        self.repo_path = Path(repo_path)
        self.persist_dir = persist_directory
        # Use a local embedding model
        self.embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
        # Split by lines, functions, classes. Keep chunks coherent.
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", " ", ""]
        )

    def _should_ignore(self, path: Path):
        ignore_patterns = [".git", "__pycache__", "node_modules", "*.pyc", ".env", "dist", "build"]
        for pattern in ignore_patterns:
            if pattern in str(path):
                return True
        return False

    def load_documents(self):
        """Walk the repo and create LangChain Documents from code files."""
        documents = []
        for file_path in self.repo_path.rglob("*"):
            if file_path.is_file() and not self._should_ignore(file_path):
                try:
                    # Basic support for common text-based source files
                    if file_path.suffix in ['.py', '.js', '.ts', '.java', '.cpp', '.md', '.txt', '.rs', '.go']:
                        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                            content = f.read()
                        # Create a document with content and metadata
                        doc = Document(
                            page_content=content,
                            metadata={
                                "source": str(file_path.relative_to(self.repo_path)),
                                "filepath": str(file_path),
                                "extension": file_path.suffix
                            }
                        )
                        documents.append(doc)
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
        return documents

    def create_vectorstore(self):
        """Split docs, generate embeddings, and persist to ChromaDB."""
        print("Loading documents...")
        raw_docs = self.load_documents()
        print(f"Loaded {len(raw_docs)} files.")

        print("Splitting documents into chunks...")
        split_docs = self.text_splitter.split_documents(raw_docs)
        print(f"Created {len(split_docs)} chunks.")

        print("Creating and persisting vector database...")
        # This step computes embeddings and stores them
        vectordb = Chroma.from_documents(
            documents=split_docs,
            embedding=self.embeddings,
            persist_directory=self.persist_dir
        )
        vectordb.persist()
        print(f"Index created and saved to '{self.persist_dir}'.")
        return vectordb

if __name__ == "__main__":
    # Point this to your project's root directory
    indexer = CodebaseIndexer("/path/to/your/codebase")
    vectordb = indexer.create_vectorstore()

Run this once to build your searchable map. It creates a codebase_db folder containing all vectorized code chunks.

Phase 2: The Query Engine (The Routing System)

Now we need to retrieve relevant chunks and ask an LLM to synthesize an answer. This is called Retrieval-Augmented Generation (RAG).

Create query_engine.py:

from langchain.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama  # Or from langchain_openai import ChatOpenAI

class CodebaseQA:
    def __init__(self, persist_directory="./codebase_db"):
        self.embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
        self.vectordb = Chroma(
            persist_directory=persist_directory,
            embedding_function=self.embeddings
        )
        # Initialize the LLM. Choose ONE option below.

        # OPTION A: Local via Ollama (Recommended for privacy/offline)
        self.llm = Ollama(model="codellama:7b")  # Or "deepseek-coder:6.7b"

        # OPTION B: OpenAI API (Easier, but costs $ and requires internet)
        # from langchain_openai import ChatOpenAI
        # self.llm = ChatOpenAI(model="gpt-4-turbo-preview")

        # Create a prompt template to guide the LLM
        self.qa_prompt = PromptTemplate(
            input_variables=["context", "question"],
            template="""
            You are an expert software engineer analyzing a codebase.
            Use the following retrieved code snippets (context) to answer the question.
            If the context doesn't contain enough information, say so. Be precise and cite the source file names.

            Context from codebase:
            {context}

            Question: {question}

            Answer:"""
        )

    def ask(self, question: str, k=4):
        """Retrieve relevant code and generate an answer."""
        # 1. RETRIEVE: Find the k most similar code chunks to the question
        relevant_docs = self.vectordb.similarity_search(question, k=k)
        context_text = "\n\n---\n\n".join([
            f"Source: {doc.metadata['source']}\nContent:\n{doc.page_content[:500]}..."
            for doc in relevant_docs
        ])

        # 2. AUGMENT & GENERATE: Format the prompt and call the LLM
        formatted_prompt = self.qa_prompt.format(context=context_text, question=question)
        answer = self.llm.invoke(formatted_prompt)

        # 3. Return answer and sources for transparency
        sources = [doc.metadata["source"] for doc in relevant_docs]
        return {
            "answer": answer,
            "sources": sources,
            "context_preview": context_text[:500]  # For debugging
        }

if __name__ == "__main__":
    qa = CodebaseQA()
    while True:
        user_question = input("\nAsk a question about the codebase (or 'quit'): ")
        if user_question.lower() == 'quit':
            break
        result = qa.ask(user_question)
        print(f"\n🤖 Answer:\n{result['answer']}\n")
        print(f"📁 Referenced Sources: {', '.join(result['sources'])}")

Running Your Assistant

Index your code: python indexer.py
Install Ollama (if going local) and pull a model: ollama pull codellama:7b
Start querying: python query_engine.py

Example Interaction:

Ask a question about the codebase: Where is the main database connection configured?
🤖 Answer:
Based on the context, the main database connection is configured in the file `src/config/database.py`. The code shows a function `get_db_engine()` that reads the `DATABASE_URL` environment variable and creates a SQLAlchemy engine with connection pooling settings.

📁 Referenced Sources: src/config/database.py, src/utils/helpers.py

Leveling Up: Practical Enhancements

Your prototype works. Now, make it production-worthy:

Metadata Filtering: Enhance similarity_search to filter by file extension (e.g., "only search in .py files").
Code-Aware Splitting: Use a library like tree-sitter to split at actual function/class boundaries instead of arbitrary characters.
Hybrid Search: Combine vector similarity with traditional keyword matching (BM25) for better recall using LangChain's RetrieverQueryEngine.
Caching: Cache embeddings for files that haven't changed (check git diff).
Web Interface: Wrap the engine in a FastAPI server and build a simple React frontend.

The Takeaway: You Built the Brain

The "AI Code Maps" trend highlights a real need: navigating complexity. By building this, you've demystified its core—RAG over your code. You now own a tool you can debug, modify, and scale without relying on a third-party's black box.

Your challenge this week: Clone this prototype, run it on one of your own GitHub repos, and try to extend it. Add one feature from the "Leveling Up" section. Share what you build.

The future of development isn't just about using AI tools; it's about understanding and shaping them. You've just taken a significant step.

DEV Community