Building a Local Code Search System with Ollama and AST-Aware RAG

#rag #codesearch #vectordatabase #ai

Most code search tools treat your codebase like a bag of words. They find files containing your query terms. What developers actually need is semantic search: "show me where we validate user tokens" should surface the right middleware even if it never uses the word "validate."

The combination of local LLMs and syntax-aware RAG makes this possible without sending your code to any external API. This post walks through how to wire it together.

Why Standard RAG Fails on Code

The temptation is to throw your source files into a text splitter, embed the chunks, and call it done. This works poorly. Code has structure that character-based chunking destroys: functions get split mid-body, decorators get separated from the functions they modify, class methods float loose from their definitions.

The fix is AST-based chunking. Instead of splitting on character counts, you parse the code into its syntax tree and create chunks at meaningful boundaries. A function stays whole. A class definition includes its methods. The chunk knows its name, its parent class, its signature, and its imports.

The full approach, with a PythonASTChunker implementation and the CodeChunk dataclass, is in this writeup on code retrieval systems. The short version: once you switch from text chunking to AST chunking, retrieval quality improves dramatically because you stop querying against fragments that have lost all semantic context.

Running the Embedding Model Locally

Before indexing anything, you need an embedding model. Ollama handles this cleanly.

ollama pull nomic-embed-text

Nomic Embed Text is fast, produces 768-dimensional embeddings, and runs fine on CPU for batch indexing. For larger codebases where you want GPU acceleration, mxbai-embed-large is worth the extra memory.

The Ollama practical guide covers the full setup, but the embedding API is just one endpoint:

import requests

def embed_chunk(text: str, model: str = "nomic-embed-text") -> list[float]:
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text}
    )
    return response.json()["embedding"]

For batch indexing, use the ollama Python library directly, which is more ergonomic than raw HTTP:

import ollama

def embed_batch(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
    return [
        ollama.embeddings(model=model, prompt=text)["embedding"]
        for text in texts
    ]

The Indexing Pipeline

With AST chunking and local embeddings, the indexing pipeline looks like this:

import ast
import chromadb
from pathlib import Path

client = chromadb.Client()
collection = client.create_collection("codebase")

def index_python_file(file_path: str, source_code: str):
    chunks = PythonASTChunker().chunk_file(source_code, file_path)

    for chunk in chunks:
        search_text = chunk.to_search_text()
        embedding = embed_chunk(search_text)

        collection.add(
            ids=[f"{file_path}:{chunk.start_line}"],
            embeddings=[embedding],
            documents=[chunk.content],
            metadatas=[{
                "file_path": chunk.file_path,
                "chunk_type": chunk.chunk_type,
                "name": chunk.name,
                "start_line": chunk.start_line,
                "end_line": chunk.end_line,
                "parent_class": chunk.parent_class or "",
            }]
        )

# Walk your project
for path in Path("src").rglob("*.py"):
    index_python_file(str(path), path.read_text())

The to_search_text() method on CodeChunk is doing real work here. It concatenates the docstring, signature, and code body before embedding. This means a query like "parse JSON configuration files" can match a function whose docstring says "parse configuration" even if the function body just calls json.loads.

Query and Generation

Retrieval is straightforward. Embed the query, find nearest neighbors, pass the chunks to a local LLM:

def search_code(query: str, n_results: int = 5) -> list[dict]:
    query_embedding = embed_chunk(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    return [
        {
            "content": doc,
            "metadata": meta
        }
        for doc, meta in zip(results["documents"][0], results["metadatas"][0])
    ]

def answer_code_question(question: str) -> str:
    chunks = search_code(question)

    context = "\n\n".join([
        f"# {c['metadata']['file_path']} (line {c['metadata']['start_line']})\n{c['content']}"
        for c in chunks
    ])

    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": "llama3.1:8b",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a code assistant. Answer questions about the codebase using only the provided code snippets."
                },
                {
                    "role": "user",
                    "content": f"Code context:\n{context}\n\nQuestion: {question}"
                }
            ],
            "stream": False,
        }
    )

    return response.json()["message"]["content"]

Where This Breaks Down

Single-function retrieval works well. The failure mode is cross-file reasoning: "trace what happens when a user submits an order" requires understanding call chains that span multiple files. The retrieved chunks might include submit_order, validate_cart, and charge_payment, but the LLM has to infer the connections from context that was never retrieved.

This is the same class of problem that makes vector search insufficient for relational queries in general. When your queries involve multi-hop relationships, the graph structure of your code (function calls, class inheritance, module imports) carries information that embeddings cannot capture. The approach described in GraphRAG for relational retrieval applies here: augment vector search with a graph that explicitly encodes call relationships, then traverse the graph to expand retrieved context with the functions that the matched function calls and is called by.

For most code search use cases, pure vector search with AST chunking is enough. For "explain this entire subsystem" queries, you need the graph layer.

Practical Notes

A few things that bite people when first setting this up:

Chunk size matters more than you think. Large functions that exceed your embedding model's context window get truncated. For functions over ~300 lines, consider creating a summary chunk (signature + docstring + first few lines) in addition to the full chunk.

Re-indexing is fast enough to be automatic. Embedding a 50k-line codebase with nomic-embed-text takes a few minutes on CPU. Set up a file watcher and re-index changed files on save. Stale indexes produce confusing results.

Use metadata filters for scoped search. Chroma and most vector stores support filtering by metadata. "Find authentication logic" becomes more precise when you can restrict to chunk_type == "function" and exclude test files.

Model choice for generation. llama3.1:8b is a reasonable default. For code-specific tasks, deepseek-coder and codestral are worth testing. They have stronger code understanding and tend to give more precise answers about what a function does.

The full setup takes an afternoon and runs entirely on your machine. No API keys, no data leaving your network, no costs per query. For internal tooling and developer productivity, that tradeoff is usually worth it.