Tim Uy

Posted on Feb 26

How we built a hybrid FTS5 + embedding search for code — and why you need both

#python #ai #mcp

How we built a hybrid FTS5 + embedding search for code — and why you need both

srclight is a deep code indexing MCP server — it gives AI agents understanding of your codebase (symbol search, call graphs, git blame, semantic search) in a single pip install.

When you're building AI coding assistants, you need search that works two ways:

Keyword search — I know the function name, find it now
Semantic search — find code that "handles authentication" without knowing the exact term

Most tools pick one. We built both.

The problem with pure keyword search

FTS5 is great for exact matches. But code has naming conventions: calculateTotalPrice, calculate_total_price, CalculateTotalPrice. A single FTS5 index can't handle all of these well.

And sometimes you don't know the name at all. You want to find "code that validates user input" — that's a concept, not a keyword.

The problem with pure embedding search

Embeddings are great for meaning. But they struggle with:

Exact symbol names (searching for handleAuth should find handleAuth)
Substring matches (searching for parse should find parseJSON)
Short queries (embeddings need context)
Naming conventions

Our solution: 4 indexes + RRF fusion

We built three FTS5 indexes, each tuned differently:

1. Symbol names index (unicode61 tokenizer)

Splits on case changes and underscores:

calculateTotalPrice → calculate, Total, Price
handle_user_auth → handle, user, auth

This catches CamelCase, snake_case, and any convention developers throw at it.

2. Source content index (trigram tokenizer)

Indexes every 3-character substring. This catches substring matches even inside words.

3. Docstrings index (porter stemmer)

Stems words to their roots: "running, ran, runner → run". This makes docstring search actually useful.

4. Embeddings (via Ollama)

Semantic vectors for meaning-based matching. We use qwen3-embedding (4096 dims) or nomic-embed-text (768 dims).

The secret sauce: Reciprocal Rank Fusion

Here's how we combine them. We run each query against all 4 indexes, get ranked results, then merge using RRF:

RRF_score(d) = Σ 1 / (k + rank(d))

where k = 60 (standard constant).

A result appearing at rank 1 in FTS5 and rank 2 in embeddings gets:

FTS5: 1 / (60 + 1) = 0.0164
Embeddings: 1 / (60 + 2) = 0.0161
Total: 0.0325

A result at rank 10 in embeddings only gets: 1 / (60 + 10) = 0.0143

This means exact matches can still win even if embeddings also match — and vice versa. You get the best of both worlds.

But wait, there's more

We also built:

GPU vector cache: Embeddings loaded to VRAM once (~300ms cold), then ~3ms per query via CuPy
Incremental indexing: Only re-index changed symbols (tracked via content hash)
Git intelligence: Query "what changed recently?" → git blame, hotspots, uncommitted WIP
Multi-repo workspaces: SQLite ATTACH+UNION across 10+ repos

Why not just use Elasticsearch?

We wanted something that installs in one command:

pip install srclight
srclight index --embed qwen3-embedding
srclight serve

No JVM, no Docker, no Redis, no cloud. Your code never leaves your machine.

Results

We index 13 repos (45K symbols) in a workspace. Claude Code goes from ~20 tool calls per task to about 6 — because it can just ask "who calls this?" instead of grepping 10 times.

The hybrid search is the key. Keyword matches for precision, embeddings for recall. RRF fusion brings them together.

What search challenges are you running into with AI coding assistants? Drop a comment — I'd love to hear what's blocking you.

DEV Community

How we built a hybrid FTS5 + embedding search for code — and why you need both

How we built a hybrid FTS5 + embedding search for code — and why you need both

The problem with pure keyword search

The problem with pure embedding search

Our solution: 4 indexes + RRF fusion

1. Symbol names index (unicode61 tokenizer)

2. Source content index (trigram tokenizer)

3. Docstrings index (porter stemmer)

4. Embeddings (via Ollama)

The secret sauce: Reciprocal Rank Fusion

But wait, there's more

Why not just use Elasticsearch?

Results

Top comments (0)