From Keyword Chaos to Semantic Understanding
You’ve just cloned a massive, unfamiliar repository. Your mission: find the function that handles user authentication errors. You grep for "auth," but get 500 results. You search for "error," and the world collapses. This is the classic "needle in a haystack" problem in codebases, and it’s a massive productivity sink.
The recent article "Google Maps for Codebases: Paste a GitHub URL, Ask Anything" sparked excitement by showcasing this future. But what if you could build the core of this yourself? Not a full-scale production tool, but a working prototype that understands code semantically? Instead of matching "error," it finds "the function that validates JWT tokens and throws an exception on expiry."
This guide will walk you through building a local, semantic code search engine using open-source Large Language Models (LLMs). We'll move beyond simple keyword matching to create a system that understands the meaning behind the code.
The Core Idea: From Text Chunks to Vector Search
The magic behind semantic search is embeddings. An embedding is a numerical representation (a high-dimensional vector) of a piece of text that captures its semantic meaning. Sentences with similar meanings have similar vectors.
Our plan:
- Chunk: Break a codebase into logical, searchable pieces (functions, classes, blocks).
- Embed: Convert each chunk into a vector using an embedding model.
- Store: Place these vectors in a specialized database for fast similarity search.
- Query: Convert a natural language question ("find auth error handlers") into a vector and find the most similar code chunks.
Building the Engine: A Step-by-Step Implementation
We'll use Python, sentence-transformers for embeddings, and ChromaDB as our vector database.
Step 1: Setting Up the Environment
pip install sentence-transformers chromadb tree-sitter-languages
Step 2: The Code Chunker
We need an intelligent way to split code. A naive line-based splitter would destroy function definitions. We'll use tree-sitter, a robust parser generator, to understand the code's Abstract Syntax Tree (AST) and chunk it logically.
import os
from tree_sitter import Language, Parser
from tree_sitter_languages import get_language
class CodeChunker:
def __init__(self, language='python'):
self.language = get_language(language)
self.parser = Parser()
self.parser.set_language(self.language)
def chunk_file(self, file_path):
"""Parse a file and yield logical chunks (function/method definitions)."""
with open(file_path, 'r', encoding='utf-8') as f:
source_code = f.read()
tree = self.parser.parse(bytes(source_code, 'utf-8'))
root_node = tree.root_node
# Query for function/method/class definitions (simplified for Python)
query = self.language.query("""
(function_definition
name: (identifier) @name
body: (block) @body) @function
(class_definition
name: (identifier) @name
body: (block) @body) @class
""")
captures = query.captures(root_node)
chunks = []
# Group captures and extract source code
for node, tag in captures:
if tag in ('function', 'class'):
chunk_text = source_code[node.start_byte:node.end_byte]
# Simple metadata extraction
for child in node.children:
if child.type == 'identifier':
name_node = child
break
chunk_name = source_code[name_node.start_byte:name_node.end_byte]
chunks.append({
"text": chunk_text,
"name": chunk_name,
"type": tag,
"file": file_path
})
return chunks
Step 3: Generating & Storing Embeddings
Now, we convert each chunk into a vector and store it in ChromaDB.
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
class SemanticCodeIndex:
def __init__(self, persist_directory="./code_db"):
# Use a model fine-tuned for code, like 'microsoft/codebert-base'
self.embedder = SentenceTransformer('all-MiniLM-L6-v2') # Good general model
self.client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory=persist_directory
))
self.collection = self.client.get_or_create_collection(name="code_chunks")
def index_codebase(self, root_path):
"""Walk through directory, chunk files, and add to index."""
chunker = CodeChunker('python')
for dirpath, _, filenames in os.walk(root_path):
for fname in filenames:
if fname.endswith('.py'):
full_path = os.path.join(dirpath, fname)
chunks = chunker.chunk_file(full_path)
for chunk in chunks:
# Create a unique ID and metadata
chunk_id = f"{full_path}:{chunk['name']}"
embedding = self.embedder.encode(chunk['text']).tolist()
self.collection.add(
embeddings=[embedding],
documents=[chunk['text']],
metadatas=[{
"name": chunk['name'],
"type": chunk['type'],
"file": chunk['file']
}],
ids=[chunk_id]
)
print(f"Indexing complete. Added chunks to collection.")
def search(self, query_text, n_results=5):
"""Search the index for code semantically similar to the query."""
query_embedding = self.embedder.encode(query_text).tolist()
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=n_results
)
return results
Step 4: Putting It All Together
Let's create a simple command-line interface to test our system.
# main.py
import sys
from semantic_index import SemanticCodeIndex
def main():
if len(sys.argv) < 3:
print("Usage: python main.py <index|search> <path_or_query>")
return
command = sys.argv[1]
index = SemanticCodeIndex()
if command == "index":
path = sys.argv[2]
index.index_codebase(path)
print(f"Indexed {path}")
elif command == "search":
query = " ".join(sys.argv[2:])
print(f"Query: '{query}'\n")
results = index.search(query)
for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
print(f"\n--- Result {i+1} | {meta['name']} ({meta['type']}) in {meta['file']} ---")
print(doc[:500] + "..." if len(doc) > 500 else doc) # Preview
print("-" * 50)
if __name__ == "__main__":
main()
Run it:
# Index a project
python main.py index /path/to/your/python/project
# Search semantically
python main.py search "function that reads a CSV file and returns a dictionary"
Leveling Up: Practical Enhancements
Our prototype works, but here’s how to make it robust:
- Multi-Language Support: Extend the
CodeChunkerto use differenttree-sittergrammars for JavaScript, Go, Java, etc. - Better Chunking: Chunk at the level of individual statements or logical blocks inside large functions for finer-grained search.
- Hybrid Search: Combine semantic search with traditional keyword (BM25) search. ChromaDB supports this. This ensures you still find an exact
getUserByIdfunction when you search for that exact name. - Use a Code-Specific Embedding Model: Swap the general
all-MiniLM-L6-v2for a model trained on code, likemicrosoft/codebert-baseorSalesforce/codet5-base. This dramatically improves understanding of code semantics. - Add an LLM for Summarization/Answering: This is the final step to create a true "ask anything" tool. Use the top search results as context and feed them into a local LLM (like Llama 3.1 or Phi-3 via Ollama) to generate a concise, natural language answer.
# Pseudocode for the LLM Answering step
context = "\n---\n".join(top_5_code_chunks)
prompt = f"""
Based on the following code context, answer the question.
If the answer cannot be found, say so.
Context:
{context}
Question: {user_query}
Answer:
"""
answer = query_llm(prompt) # Using Ollama or LiteLLM
The Takeaway: You Can Build the Future of Dev Tools
Semantic code search isn't just for big tech companies. With open-source LLMs and vector databases, you can build powerful, context-aware tools that understand your codebase's intent. Start by indexing your most complex project. Experiment with different embedding models and chunking strategies.
The goal isn't to replicate a commercial product feature-for-feature, but to understand the principles and gain the ability to create custom, intelligent tooling tailored to your specific workflow. This is the true power of the AI shift—democratizing the capability to build the next generation of developer experience.
Your Challenge: Clone a mid-sized open-source repo and index it with this script. Try to find something obscure using a plain-language query. Then, fork the script and implement one of the enhancements above. Share what you build!
The full code for this guide is available as a starter template on GitHub.
Top comments (0)