Midas126

Posted on Apr 11

"Beyond the Hype: Building a Practical AI-Powered Code Query Engine"

#ai #python #opensource #machinelearning

The AI Code Assistant Dream

You've seen the demos: paste a GitHub URL, ask a question in plain English, and get an intelligent answer about the codebase. Tools like GitHub Copilot Chat and the viral "Google Maps for Codebases" concept promise to revolutionize how we understand unfamiliar code. But how do these systems actually work under the hood? More importantly, how could you build a simplified version yourself?

This isn't just about calling an API. It's about understanding the architecture that makes code-aware AI possible. Today, we'll deconstruct the problem and build a practical, local-first code query engine using open-source tools. By the end, you'll have a working prototype and the architectural knowledge to adapt these patterns to your own projects.

Deconstructing the Problem: It's Not Just One AI Call

At first glance, "ask a question about a codebase" seems like a job for a large language model (LLM). But raw codebases are too large for most LLM context windows, and LLMs lack inherent knowledge of your specific project. The magic happens in the retrieval-augmented generation (RAG) pattern, adapted for code.

The system needs to:

Parse & Index: Break down the codebase into searchable chunks.
Retrieve: Find the most relevant code snippets for a given question.
Reason & Generate: Use an LLM to synthesize an answer from those snippets.

Building Our Engine: A Three-Tier Architecture

Let's build a system called CodeExplainer. We'll use Python and focus on local, open-source components where possible.

Tier 1: The Code Indexer

We need to parse code into meaningful chunks. A simple file splitter isn't enough—we should chunk by logical structures (functions, classes).

import ast
from pathlib import Path
from typing import List, Dict
import hashlib

class CodeIndexer:
    def __init__(self, repo_path: str):
        self.repo_path = Path(repo_path)
        self.chunks = []

    def parse_python_file(self, file_path: Path) -> List[Dict]:
        """Parse a Python file into function/class chunks."""
        with open(file_path, 'r') as f:
            tree = ast.parse(f.read())

        chunks = []
        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
                start_line = node.lineno
                end_line = node.end_lineno

                # Get the actual source code
                with open(file_path, 'r') as f:
                    lines = f.readlines()
                code_snippet = ''.join(lines[start_line-1:end_line])

                chunk_id = hashlib.md5(f"{file_path}:{node.name}:{start_line}".encode()).hexdigest()[:8]

                chunks.append({
                    'id': chunk_id,
                    'file_path': str(file_path.relative_to(self.repo_path)),
                    'name': node.name,
                    'type': type(node).__name__,
                    'code': code_snippet,
                    'start_line': start_line,
                    'metadata': {
                        'repo_path': str(self.repo_path),
                        'language': 'python'
                    }
                })
        return chunks

    def index_repository(self):
        """Walk through repo and index all Python files."""
        for py_file in self.repo_path.rglob('*.py'):
            if '.git' in str(py_file):
                continue
            chunks = self.parse_python_file(py_file)
            self.chunks.extend(chunks)

        print(f"Indexed {len(self.chunks)} code chunks from {self.repo_path}")
        return self.chunks

Tier 2: The Semantic Retriever

Now we need to find relevant chunks for a question. We'll use sentence embeddings and vector search.

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class CodeRetriever:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.embedder = SentenceTransformer(model_name)
        self.chunks = []
        self.embeddings = None

    def index_chunks(self, chunks: List[Dict]):
        """Create embeddings for all code chunks."""
        self.chunks = chunks
        # Create text representations for embedding
        texts = []
        for chunk in chunks:
            # Combine metadata and code for better retrieval
            text = f"{chunk['type']} {chunk['name']} in {chunk['file_path']}:\n{chunk['code']}"
            texts.append(text)

        self.embeddings = self.embedder.encode(texts, show_progress_bar=True)

    def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
        """Find most relevant code chunks for a query."""
        query_embedding = self.embedder.encode([query])
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]

        # Get top_k indices
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        results = []
        for idx in top_indices:
            chunk = self.chunks[idx].copy()
            chunk['similarity'] = float(similarities[idx])
            results.append(chunk)

        return results

Tier 3: The Reasoning Engine

Finally, we use an LLM to generate answers from retrieved chunks. We'll use Ollama to run local models.

import subprocess
import json

class CodeExplainer:
    def __init__(self, retriever: CodeRetriever, model: str = "llama3.2"):
        self.retriever = retriever
        self.model = model

    def generate_answer(self, question: str) -> str:
        # Retrieve relevant code
        relevant_chunks = self.retriever.retrieve(question, top_k=3)

        # Build context for LLM
        context_parts = []
        for i, chunk in enumerate(relevant_chunks):
            context_parts.append(f"[Chunk {i+1} from {chunk['file_path']}]")
            context_parts.append(f"```
{% endraw %}
python\n{chunk['code']}\n
{% raw %}
```")
            context_parts.append("")

        context = "\n".join(context_parts)

        # Create prompt
        prompt = f"""You are a helpful code assistant. Answer the question based only on the provided code context.

Code Context:
{context}

Question: {question}

Answer the question clearly and concisely. If the context doesn't contain relevant information, say so.
If referring to code, mention the file and function/class name.
"""

        # Call local LLM via Ollama
        response = self._query_ollama(prompt)
        return response

    def _query_ollama(self, prompt: str) -> str:
        """Query local Ollama instance."""
        cmd = [
            "ollama", "run", self.model",
            prompt
        ]

        try:
            result = subprocess.run(
                cmd,
                capture_output=True,
                text=True,
                timeout=30
            )
            return result.stdout.strip()
        except Exception as e:
            return f"Error querying LLM: {str(e)}"

Putting It All Together

Here's the complete workflow:

def main():
    # 1. Index the repository
    indexer = CodeIndexer("/path/to/your/repo")
    chunks = indexer.index_repository()

    # 2. Setup retriever
    retriever = CodeRetriever()
    retriever.index_chunks(chunks)

    # 3. Create explainer
    explainer = CodeExplainer(retriever)

    # 4. Ask questions!
    questions = [
        "How does this project handle database connections?",
        "Show me the main entry point function",
        "What authentication system is used?",
    ]

    for question in questions:
        print(f"\n{'='*60}")
        print(f"Question: {question}")
        print(f"{'='*60}")
        answer = explainer.generate_answer(question)
        print(f"Answer:\n{answer}")

if __name__ == "__main__":
    main()

Production Considerations & Enhancements

Our prototype works, but production systems need more:

Multi-language Support: Use Tree-sitter for robust parsing across languages
Hierarchical Chunking: Index at multiple levels (file, class, function)
Hybrid Search: Combine semantic search with keyword matching for better recall
Caching: Store embeddings and common queries
Cross-reference Analysis: Build call graphs to understand relationships

# Example enhancement: Hybrid search
class HybridRetriever(CodeRetriever):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.bm25_index = None  # Add traditional search index

    def hybrid_retrieve(self, query: str, top_k: int = 5, alpha: float = 0.7):
        semantic_results = super().retrieve(query, top_k * 2)
        keyword_results = self.keyword_retrieve(query, top_k * 2)

        # Combine scores (simplified)
        combined = self._merge_results(semantic_results, keyword_results, alpha)
        return combined[:top_k]

The Takeaway: AI as Your Code Compass

Building a code query engine demystifies the "AI magic" and reveals a tractable engineering problem. The real value isn't in having an oracle that knows everything—it's in creating a system that can quickly surface the right context for an LLM to reason about.

This architecture pattern extends beyond code. Any domain with structured text—documentation, logs, research papers—can benefit from the RAG approach.

Your Challenge: Fork the example implementation and extend it. Add support for JavaScript, implement call graph analysis, or create a web interface. The tools are now in your hands.

What will you build when you can ask your codebase anything?

Resources to dive deeper:

DEV Community