Swapnanil Saha

Posted on Jun 9 • Originally published at swapnanilsaha.com

Building Vectr, Part 1: Why grep Fails When You Don't Know the Keywords

#mcp #semanticsearch #developertools #codeindexer

This is Part 1 of the Building Vectr series (1 of 3).

You get dropped into an unfamiliar codebase. Not a toy project — real production code, 8,000 files, three years of accumulated complexity and clever abstractions. Your job is to fix a bug in the request validation pipeline. What does an AI code editor do next?

This post is about a problem I kept running into, a tax I kept paying, and the indexing system I built to eliminate it. It covers the technical decisions behind Vectr's search layer: why naive chunking produces bad embeddings, how tree-sitter solves the code-parsing problem, what BM25 does that vector search can't, and why you need a symbol graph for questions that text search cannot answer at all.

Part 1 — The Problem

The Re-discovery Tax

If you're a human engineer navigating an unfamiliar codebase, here's what you probably do: you ask someone who knows it, or you grep for the error message, or you open the entry point and follow imports until you find the thing. Your brain does semantic compression the whole way — building a model of the system, discarding noise, following intuitions about where complexity tends to live. By the time you've read 20 files, you have a rough map that persists across days and sessions.

An AI code editor has the same tools — read files, run shell commands, grep — but completely different economics. Every Read call costs tokens. Every Bash call for grep costs a turn. Unlike a human who can skim-read at 1,000 words per minute and discard irrelevant content almost for free, an AI editor pays full price for every character it reads: it sits in the context window whether or not it was useful. Read the wrong 500-line file and you've burned context that could have held the answer.

The result, on unfamiliar codebases, is what I started calling the re-discovery tax: a cluster of navigation calls at the start of every session, before any actual implementation begins, spent on figuring out where things are. And because AI editors have no persistent memory between sessions, they pay this tax again and again — every session, on the same codebase.

In benchmarks I ran against real open-source codebases (more detail in Part 3), the re-discovery tax on CPython internals ranged from 6 to 23 tool calls per task before the first file write. Some sessions spent more turns navigating than implementing.

Key observation: The re-discovery tax is paid every session, not once. A human engineer's mental map of a codebase accumulates and compounds. An AI editor's map is fully rebuilt from scratch at the start of each session. The economic gap widens as the codebase grows.

Why grep Fails at the Boundary of Your Knowledge

Before explaining what I built, I want to be precise about where grep breaks down — because "just use grep" is the natural reaction, and it's not obviously wrong until you try to use it systematically on unfamiliar code.

grep is a brilliant tool for confirming hypotheses you already have. If you know what you're looking for, it's nearly perfect. The problem is the case that isn't really an edge case: you don't know what you're looking for.

Say you're trying to understand how a Django application validates incoming JSON payloads before they hit the ORM layer. You might grep for validate. You'll get 200 results across 40 files — field validators, form validators, configuration validators, test fixtures. None of them are obviously the thing you want. You grep for json.loads. You get 30 results. You grep for request.data. That gets you closer, maybe. But you spent four greps and 15 minutes before you found the right file.

The deeper problem: grep requires you to already have a mental model of the codebase's naming conventions. An AI editor running on an unfamiliar codebase doesn't know whether payload validation is called validate_payload, check_request, parse_input, or _pre_process.

Analogy: Think of keyword search as asking for directions by street name in a city you've never visited. "Where is Maple Street?" gets a precise answer. But "where is the street with the good coffee shop near the park?" — keyword search has nothing to offer. You need a different kind of index: one that understands what places are for, not just what they're called.

Semantic search inverts this. It maps your query and every code chunk into the same high-dimensional vector space, then finds the chunks closest to your query by meaning — regardless of whether they share any words. "JWT validation logic" finds verify_token even if neither of those words appears in the function body.

Part 2 — Building the Index

The Chunking Problem: Why Line Windows Break on Code

Prose text has a natural unit of meaning: the paragraph. You can split a Wikipedia article into 200-word chunks, embed each one, and get a reasonable search system. Code doesn't work this way.

The standard naive approach for code indexing is the same line-window strategy borrowed from document search: take a sliding window of N lines with M lines of overlap, create a chunk, embed it, move the window. A common default might be 150-line windows with 50 lines of overlap. Simple, language-agnostic, works on any file format.

The problem is what happens at the window boundaries. Consider this function:

def process_workspace_changes(
    path: Path, db: Database, *, force: bool = False
) -> list[ChangeResult]:
    """Process all pending changes in a workspace, optionally forcing re-indexing."""
    pending = db.get_pending_changes(path)
    if not pending and not force:
        return []

    results = []
    for change in pending:
        if change.kind == ChangeKind.DELETED:
            db.remove_chunks_for_file(change.file)
            results.append(ChangeResult(file=change.file, status="removed"))
        elif change.kind in (ChangeKind.MODIFIED, ChangeKind.CREATED):
            chunks = chunk_file(change.file, db.language_for(change.file))
            db.upsert_chunks(chunks)
            results.append(ChangeResult(
                file=change.file, status="indexed", chunk_count=len(chunks)
            ))

    db.mark_changes_processed(path)
    return results

If a 150-line window happens to cut through this function, neither resulting chunk is independently meaningful. The chunk with just the body is missing the parameter names and return type. The chunk with just the signature has no implementation context. The embedding of a half-function is significantly worse than the embedding of the complete thing.

The fix: split at semantic boundaries. Functions should be complete units. Classes should contain their methods, or each method should be its own chunk with the class header prepended for context.

Why completeness matters: An embedding model compresses everything in its context into a single fixed-size vector. A complete function gives the model everything it needs to capture the function's purpose, parameters, return behavior, and side effects in that vector. A half-function forces the model to compress an ambiguous fragment — the resulting vector is a blurred average of possible interpretations.

Parsing Code with tree-sitter

tree-sitter is a parser library that produces concrete syntax trees for source code — every construct in the language has a named node with exact byte boundaries in the source. Unlike a regex-based approach, tree-sitter actually parses the grammar and handles edge cases correctly: nested functions, decorators on multiple lines, multiline function signatures, arrow functions in JavaScript, generic bounds in Rust.

For Python, the tree-sitter query:

(function_definition
  name: (identifier) @name
  parameters: (parameters) @params
  body: (block) @body) @function

(class_definition
  name: (identifier) @name
  superclasses: (argument_list)? @bases
  body: (block) @body) @class

This matches any function or class definition anywhere in the file and captures the name, parameters, and body as named nodes with precise byte-range positions. You can then slice the original source file at those byte positions to extract complete, syntactically valid chunks.

For classes, Vectr attaches the full class signature — including the base class list captured by @bases — as a header to each method chunk. So the chunk for WorkspaceLock.acquire() includes its inheritance context. A method of AuthenticatedView(LoginRequiredMixin, View) has a meaningfully different semantic context than a method of a plain View.

A subtlety: very large functions. AST-aware chunking breaks down for functions that are genuinely enormous — 500+ lines. Vectr handles this by further splitting large functions at their major control-flow boundaries (default threshold: 200 lines). The resulting sub-chunks each include the function signature as a header to preserve context. Their embedding quality is better than one giant embedding, though still lower than a naturally small function.

Code-Specific Embeddings Running Locally

Not all embedding models are equally good at code. Models trained primarily on prose text have learned representations of natural language semantics. Code has different regularities: symbol names, type signatures, control flow patterns, API call chains. Code-aware models routinely outperform general-purpose models by 10–20% on tasks like "find the function that handles X."

Vectr uses Snowflake/snowflake-arctic-embed-m-v1.5, a 110-million-parameter model that produces 768-dimensional embedding vectors and runs in under 100ms per batch on a modern laptop CPU.

Why local inference instead of an API? Two practical constraints:

Cost: a tool that fires 20–50 search calls per session would accumulate non-trivial API costs quickly. Local inference is free at query time after the one-time model download.
Data privacy: many codebases cannot be sent to third-party APIs. Internal tools, proprietary algorithms, customer data models — many organizations have policies or contractual obligations that prohibit sending source code to external services.

The tradeoff: the model weighs roughly 440MB and needs to be downloaded on first run. This is a real friction point.

One critical detail: queries and chunks are embedded with different input prefixes. Queries use Represent this query for searching relevant code:, chunks use Represent this code snippet:. arctic-embed-m is a single encoder, but it was trained with different prefixes for query-side and document-side inputs. Using the wrong prefix reduces the cosine similarity between semantically related query-chunk pairs — the vectors for "user authentication" and verify_token end up further apart in embedding space than they should be. Getting this wrong costs 5–15% in retrieval quality.

Part 3 — The Search Layer

Hybrid Search: Why BM25 and Vector Search Need Each Other

Vector search handles concept queries well. But if you search for _handle_workspace_lock_conflict — an exact function name — a vector search might not rank it first. The embedding is just one point in a crowded neighborhood of similar-looking function names. BM25, on the other hand, will find it immediately: exact string matches get the highest possible score.

The inverse is also true: BM25 cannot find "retry logic with exponential backoff" if the function is called _schedule_attempt_with_delay and its docstring says nothing about backoff. Zero keyword overlap means zero BM25 score. Vector search finds it because the semantic cluster it belongs to is close to the query in embedding space.

The right system uses both. Every query in Vectr runs both a vector search and a BM25 search in parallel, then combines the two ranked lists using a weighted formula.

BM25 scoring formula:

score(D, Q) = Σᵢ IDF(qᵢ) · [ tf(qᵢ, D) · (k₁ + 1) ] / [ tf(qᵢ, D) + k₁ · (1 − b + b · |D| / avgdl) ]

IDF(qᵢ) = log( (N − nᵢ + 0.5) / (nᵢ + 0.5) )

Where:

tf(qᵢ, D) — term frequency of qᵢ in document D
N — total documents; nᵢ — documents containing qᵢ
|D| — document length in tokens; avgdl — average document length
k₁ = 1.5 (term-frequency saturation), b = 0.75 (length normalization)

This is the Robertson–Sparck Jones variant. Some implementations add +1 inside the IDF log to prevent negative values for very common terms.

The weight assigned to each approach depends on codebase familiarity:

Situation	BM25 weight	Vector weight
Large unfamiliar codebase	0.2	0.8
Small familiar codebase	0.7	0.3
Explicit symbol name in query	0.8	0.2
Natural language concept query	0.2	0.8

These weights are the actual values used in Vectr's implementation, tuned against the benchmark dataset.

The benchmark on Apache Camel (58,000+ Java files) showed a 73% reduction in Read+Bash navigation calls compared to the baseline AI editor with no index.

The Symbol Graph: What Text Search Cannot Answer

Semantic search and BM25 handle "find me the code for this concept" well. But there's a different navigation pattern that neither handles: "find me everything that calls this function."

Vectr builds a symbol graph during indexing. For each file, tree-sitter extracts:

Definitions — every function, class, method, and module-level constant with name and line number
Call edges — every call site, mapping callee name to the calling function's context
Import edges — every import statement, mapping the imported symbol to its likely source module
HTTP routes — Flask/FastAPI @router.get(), Express app.post(), Spring @GetMapping — extracted as named symbols

The resulting graph enables exact lookups. vectr_locate("WorkspaceLock") returns a file path and line number in under 10ms — no embedding, no ranking, pure symbol table lookup. vectr_trace("acquire_lock") returns all callers and all callees in one round-trip. These are not search results — they are graph traversals, and they produce exact answers rather than relevance rankings.

Text search vs. graph traversal: These are not competing approaches — they answer different questions. "Find code that does X" is a search problem. "Find who calls Y" or "find where Z is defined" is a graph traversal problem. Relying only on text search for definition lookups is like looking up a phone number by describing the person rather than looking them up by name.

Six Fallback Strategies in vectr_locate

vectr_locate runs six fallback strategies in sequence, stopping at the first match:

Exact match — direct lookup in the symbol table. Sub-millisecond. Highest confidence.
Suffix match — Lock matches WorkspaceLock, AcquireLock, LockManager.
Same-module priority — if a caller file is provided, search definitions within the same module first.
Unique name — if there is exactly one symbol across the entire codebase whose name contains your query string, return it.
Import chain follow — follow import statements from a given file to find where the name likely comes from.
Fuzzy (Levenshtein ≤ 2) — edit distance ≤ 2 across all symbol names. Catches typos. Lowest confidence.

Each strategy produces a LocateResult with a resolution_strategy field. An exact match means you can act on the result immediately. A fuzzy match with edit distance 2 means you should verify before relying on it. A silent wrong navigation is worse than no navigation at all.

Part 4 — The Runtime Layer

mtime Cache and Incremental Re-indexing

The first time you run vectr start on a large codebase, indexing takes time. CPython's 4,000+ files: about 8 minutes. Django's ~1,800 Python files: about 2 minutes. Apache Camel's 58,000+ Java files: closer to 45 minutes.

During initial indexing, Vectr writes a file at ~/.cache/vectr/{hash}/index_cache.json that stores the modification timestamp of every indexed file. The {hash} is a short SHA-256 hash of the absolute workspace root path. On subsequent runs, only files whose mtime has changed are re-indexed. On a typical active session where you've modified 5–10 files, subsequent re-indexing takes under 5 seconds.

Handling deletions: Vectr also stores the complete set of indexed file paths. At startup, it diffs this set against the current file tree and removes all chunks belonging to deleted files before re-indexing modified ones. Process deletions first, then updates, then new files — this prevents a renamed file from leaving orphaned chunks in the index.

The watchdog listener: During an active session, Vectr runs a watchdog filesystem listener on the workspace root. When a file is saved, the listener queues it for re-indexing in the background. Events are debounced at 300ms — only the last write in a burst counts. Without debouncing, a single save in a project using aggressive auto-formatting would trigger 3–5 redundant re-index operations.

.vectrignore: Keeping the Index Clean

Vectr reads a .vectrignore file from the workspace root using glob patterns. The syntax follows .gitignore conventions — trailing slash for directories, * for single-level wildcard, ** for recursive match (via Python's pathlib.Path.match()) — but Vectr does not implement the full gitignore specification: the ! negation prefix is not supported.

vendor/
node_modules/
dist/
*.pb.go        # generated protobuf Go files
*.min.js       # minified JavaScript
__pycache__/
.venv/
coverage/
*.snap         # Jest snapshots
migrations/    # Django database migrations

A codebase with node_modules/ will typically contain 5–20x more code from installed packages than from the project itself. Excluding vendor directories before the initial index run is the single most impactful configuration change most users can make.

What Actually Happens When You Call vectr_search

1. Query string is embedded using arctic-embed-m with query prefix
   → 768-dimensional float vector, ~15ms on CPU

2. Vector similarity search against ChromaDB store
   → Top-20 chunks by cosine similarity, with scores

3. Same query runs through BM25 index (rank-bm25, in-memory)
   → Top-20 chunks by BM25 score, with scores

4. Two ranked lists are merged
   → Weight BM25/vector based on codebase characterization
   → Normalized scores combined; top-N results selected (default N=5)

5. Symbol names in the query are detected (camelCase, snake_case, PascalCase)
   → If found: also run vectr_locate as a side channel
   → Merge symbol lookup results into final output if relevant

6. Final top-N chunks returned with:
   file path, start line, end line, matched text, search method

Result for vectr_search("workspace lock acquisition and release"):

[1] resolver.rs:214 — WorkspaceLock::acquire()
    Acquires the workspace-scoped lock. Blocks if another process holds it.

[2] resolver.rs:267 — WorkspaceLock::release()
    Releases the workspace-scoped lock. Validates that the current process
    holds the lock before releasing (returns Err if not held).

[3] workspace.py:89 — _acquire_workspace_lock(path)
    Context manager: acquires, yields, releases on exit.

Instead of reading 15 files to find these three functions, the AI editor reads one search result.

Part 5 — Design Decisions I'd Make Differently

The Python 3.14 requirement. The codebase uses match/case pattern matching extensively and some asyncio patterns that behave differently in earlier versions. In retrospect, 3.11 would probably work with a few hours of refactoring. The 3.14 requirement has been the single biggest adoption friction.

ChromaDB as the vector store. A vector store handles embedding persistence and similarity search. ChromaDB works, but the full HNSW index with persistence, the Python client layer, and the inter-process communication overhead add about 200ms specifically to ChromaDB's startup contribution — not total Vectr startup (~280ms including mtime diffing and watchdog initialization). For v2, I'd consider a lighter in-process option.

The BM25 implementation. The rank-bm25 library is pure Python and fast enough for codebases under 50,000 chunks. Beyond that, it starts to show latency. The right long-term solution is integrating BM25 scoring directly into the vector store query pipeline. For current use cases (most codebases are under 20K chunks), it's fine.

Conclusion

The indexing layer is the foundation, not the product. What it enables is an AI code editor that can navigate a large unfamiliar codebase as efficiently as a human engineer who has worked in it for months — finding the right functions in one or two calls instead of fifteen.

But the index tells you where things are. It doesn't tell you why things are the way they are — the non-obvious invariants, the patterns that emerge from reading 50 files, the bugs that were fixed by changing two lines in a place that looks completely unrelated.

That's what Part 2 addresses: a note store where an AI editor can save findings in structured, tagged form — "the lock acquisition logic is at resolver.rs:214, and it acquires an exclusive file lock using fcntl.flock, not a threading primitive" — and retrieve them in under 50ms at the start of any future session. When /compact runs and replaces the conversation with a summary, exact signatures and line numbers evaporate — but notes don't. The indexer tells you where to look. The working memory layer tells you what you already know about what you found.

Summary of core decisions

Decision	Rationale
AST-aware chunking via tree-sitter	Complete functions as the unit of meaning. Biggest quality improvement over naive line windows.
Local embeddings (arctic-embed-m)	No API cost, no data leaving the machine. One-time 440MB download.
Hybrid BM25 + vector search	Concept queries route to vector. Exact symbol names route to BM25.
Symbol graph	Definitions, call edges, import edges, HTTP routes — exact graph traversal for questions text search cannot answer.
Six fallback strategies in vectr_locate	Exact → suffix → same_module → unique_name → import_chain → fuzzy. Each result carries its resolution strategy.
mtime cache + watchdog	Sub-5-second re-indexing on subsequent runs. In-session saves trigger background re-indexing automatically.

DEV Community