Midas126

Posted on Apr 13

Building Your Own "Google Maps for Codebases": A Guide to Semantic Code Search with LLMs

#ai #machinelearning #llm #semanticsearch

From Keyword Chaos to Semantic Understanding

You’ve just cloned a massive, unfamiliar repository. Your mission: find the function that handles user authentication errors. You grep for "auth," but get 500 results. You search for "error," and the world collapses. This is the classic "needle in a haystack" problem in codebases, and it’s a massive productivity sink.

The recent article "Google Maps for Codebases: Paste a GitHub URL, Ask Anything" sparked excitement by showcasing this future. But what if you could build the core of this yourself? Not a full-scale production tool, but a working prototype that understands code semantically? Instead of matching "error," it finds "the function that validates JWT tokens and throws an exception on expiry."

This guide will walk you through building a local, semantic code search engine using open-source Large Language Models (LLMs). We'll move beyond simple keyword matching to create a system that understands the meaning behind the code.

The Core Idea: From Text Chunks to Vector Search

The magic behind semantic search is embeddings. An embedding is a numerical representation (a high-dimensional vector) of a piece of text that captures its semantic meaning. Sentences with similar meanings have similar vectors.

Our plan:

Chunk: Break a codebase into logical, searchable pieces (functions, classes, blocks).
Embed: Convert each chunk into a vector using an embedding model.
Store: Place these vectors in a specialized database for fast similarity search.
Query: Convert a natural language question ("find auth error handlers") into a vector and find the most similar code chunks.

Building the Engine: A Step-by-Step Implementation

We'll use Python, sentence-transformers for embeddings, and ChromaDB as our vector database.

Step 1: Setting Up the Environment

pip install sentence-transformers chromadb tree-sitter-languages

Step 2: The Code Chunker

We need an intelligent way to split code. A naive line-based splitter would destroy function definitions. We'll use tree-sitter, a robust parser generator, to understand the code's Abstract Syntax Tree (AST) and chunk it logically.

import os
from tree_sitter import Language, Parser
from tree_sitter_languages import get_language

class CodeChunker:
    def __init__(self, language='python'):
        self.language = get_language(language)
        self.parser = Parser()
        self.parser.set_language(self.language)

    def chunk_file(self, file_path):
        """Parse a file and yield logical chunks (function/method definitions)."""
        with open(file_path, 'r', encoding='utf-8') as f:
            source_code = f.read()

        tree = self.parser.parse(bytes(source_code, 'utf-8'))
        root_node = tree.root_node

        # Query for function/method/class definitions (simplified for Python)
        query = self.language.query("""
            (function_definition
                name: (identifier) @name
                body: (block) @body) @function
            (class_definition
                name: (identifier) @name
                body: (block) @body) @class
        """)

        captures = query.captures(root_node)
        chunks = []

        # Group captures and extract source code
        for node, tag in captures:
            if tag in ('function', 'class'):
                chunk_text = source_code[node.start_byte:node.end_byte]
                # Simple metadata extraction
                for child in node.children:
                    if child.type == 'identifier':
                        name_node = child
                        break
                chunk_name = source_code[name_node.start_byte:name_node.end_byte]
                chunks.append({
                    "text": chunk_text,
                    "name": chunk_name,
                    "type": tag,
                    "file": file_path
                })
        return chunks

Step 3: Generating & Storing Embeddings

Now, we convert each chunk into a vector and store it in ChromaDB.

from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings

class SemanticCodeIndex:
    def __init__(self, persist_directory="./code_db"):
        # Use a model fine-tuned for code, like 'microsoft/codebert-base'
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2') # Good general model
        self.client = chromadb.Client(Settings(
            chroma_db_impl="duckdb+parquet",
            persist_directory=persist_directory
        ))
        self.collection = self.client.get_or_create_collection(name="code_chunks")

    def index_codebase(self, root_path):
        """Walk through directory, chunk files, and add to index."""
        chunker = CodeChunker('python')
        for dirpath, _, filenames in os.walk(root_path):
            for fname in filenames:
                if fname.endswith('.py'):
                    full_path = os.path.join(dirpath, fname)
                    chunks = chunker.chunk_file(full_path)
                    for chunk in chunks:
                        # Create a unique ID and metadata
                        chunk_id = f"{full_path}:{chunk['name']}"
                        embedding = self.embedder.encode(chunk['text']).tolist()
                        self.collection.add(
                            embeddings=[embedding],
                            documents=[chunk['text']],
                            metadatas=[{
                                "name": chunk['name'],
                                "type": chunk['type'],
                                "file": chunk['file']
                            }],
                            ids=[chunk_id]
                        )
        print(f"Indexing complete. Added chunks to collection.")

    def search(self, query_text, n_results=5):
        """Search the index for code semantically similar to the query."""
        query_embedding = self.embedder.encode(query_text).tolist()
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n_results
        )
        return results

Step 4: Putting It All Together

Let's create a simple command-line interface to test our system.

# main.py
import sys
from semantic_index import SemanticCodeIndex

def main():
    if len(sys.argv) < 3:
        print("Usage: python main.py <index|search> <path_or_query>")
        return

    command = sys.argv[1]
    index = SemanticCodeIndex()

    if command == "index":
        path = sys.argv[2]
        index.index_codebase(path)
        print(f"Indexed {path}")

    elif command == "search":
        query = " ".join(sys.argv[2:])
        print(f"Query: '{query}'\n")
        results = index.search(query)

        for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
            print(f"\n--- Result {i+1} | {meta['name']} ({meta['type']}) in {meta['file']} ---")
            print(doc[:500] + "..." if len(doc) > 500 else doc)  # Preview
            print("-" * 50)

if __name__ == "__main__":
    main()

Run it:

# Index a project
python main.py index /path/to/your/python/project

# Search semantically
python main.py search "function that reads a CSV file and returns a dictionary"

Leveling Up: Practical Enhancements

Our prototype works, but here’s how to make it robust:

Multi-Language Support: Extend the CodeChunker to use different tree-sitter grammars for JavaScript, Go, Java, etc.
Better Chunking: Chunk at the level of individual statements or logical blocks inside large functions for finer-grained search.
Hybrid Search: Combine semantic search with traditional keyword (BM25) search. ChromaDB supports this. This ensures you still find an exact getUserById function when you search for that exact name.
Use a Code-Specific Embedding Model: Swap the general all-MiniLM-L6-v2 for a model trained on code, like microsoft/codebert-base or Salesforce/codet5-base. This dramatically improves understanding of code semantics.
Add an LLM for Summarization/Answering: This is the final step to create a true "ask anything" tool. Use the top search results as context and feed them into a local LLM (like Llama 3.1 or Phi-3 via Ollama) to generate a concise, natural language answer.

# Pseudocode for the LLM Answering step
context = "\n---\n".join(top_5_code_chunks)
prompt = f"""
Based on the following code context, answer the question.
If the answer cannot be found, say so.

Context:
{context}

Question: {user_query}

Answer:
"""
answer = query_llm(prompt) # Using Ollama or LiteLLM

The Takeaway: You Can Build the Future of Dev Tools

Semantic code search isn't just for big tech companies. With open-source LLMs and vector databases, you can build powerful, context-aware tools that understand your codebase's intent. Start by indexing your most complex project. Experiment with different embedding models and chunking strategies.

The goal isn't to replicate a commercial product feature-for-feature, but to understand the principles and gain the ability to create custom, intelligent tooling tailored to your specific workflow. This is the true power of the AI shift—democratizing the capability to build the next generation of developer experience.

Your Challenge: Clone a mid-sized open-source repo and index it with this script. Try to find something obscure using a plain-language query. Then, fork the script and implement one of the enhancements above. Share what you build!

The full code for this guide is available as a starter template on GitHub.

DEV Community