Sift: Local Hybrid Search Without the Infrastructure Tax

#cli #rag #rust #showdev

sift is a local Rust CLI for document retrieval. Point it at a directory, ask a question, and it runs a full hybrid search pipeline, BM25, dense vector, fusion, optional reranking, and returns ranked results. No daemon. No background indexer. No cloud. One binary.

It's built for agents and developers who need reliable, repeatable search over raw codebases, docs, and mixed-format corpora without spinning up infrastructure to get there.

You can install it now on Mac, Windows & Linux.

The retrieval pipeline

Every query runs through four stages:

Expansion — query variants are generated to broaden recall before retrieval begins.
Retrieval — BM25 (keyword), phrase match, and dense vector retrieval run against the corpus. Each method captures different signal.
Fusion — results are merged using Reciprocal Rank Fusion (RRF), balancing signal across retrieval methods without manual weight tuning.
Reranking — optional local LLM reranking via Qwen applies semantic disambiguation on the fused candidate set.

Each stage is independently tunable. Skip the vector pass if you only need BM25 speed. Run the full stack for best precision.

Architecture

The implementation is split into domain and adapters: domain objects model search plans, candidates, and scoring outputs; adapters implement the concrete BM25, phrase, vector, and reranking backends. A shared search service executes the
same strategy model for CLI, benchmark, and eval flows — nothing changes between a dev run and a CI eval pass.

Performance is local-first by design:

SIMD-accelerated dot-product for vector scoring on CPU-heavy workloads.
Zig-inspired incremental cache — a two-layer design borrowed from Zig's build system. A manifest store tracks filesystem metadata (inode, mtime, size) mapped to BLAKE3 content hashes, so sift knows exactly which files have changed without re-reading them. A content-addressable blob store holds pre-extracted text, pre-computed BM25 term frequencies, and pre-embedded dense vectors — meaning repeat queries never touch the neural network at all. Identical files across different projects share a single blob entry. The result: search performance bounded by dot-product speed, not inference latency.
Per-query embedding reuse across multi-stage pipelines.
Mapped I/O and tight tokenization hot loops to keep latency low on large corpora.

One concrete tradeoff during development: lowering embedding max_length from 48 to 40 recovered latency budget while keeping quality above the BM25 baseline — a good example of how evidence-driven tuning beats guesswork.

Full internals are documented in ARCHITECTURE.md.

Evaluation

Comparative strategy run over 5,185 SciFact documents (~7.8 MB) on an AMD Ryzen Threadripper 3960X:

Strategy	nDCG@10	MRR@10	Recall@10	p50 (ms)
bm25	0.7262	0.7000	0.8000	5.41
legacy-hybrid	0.7893	0.7250	1.0000	50.29
page-index	0.7000	0.6667	0.8000	16.79
page-index-hybrid	0.5701	0.4367	1.0000	41.09
page-index-llm	0.7893	0.7250	1.0000	41.28
page-index-qwen	0.7893	0.7250	1.0000	41.18
vector	0.8262	0.7667	1.0000	25.94

A few things worth noting:

BM25 at 5.41ms p50 is the right default for latency-constrained cases where keyword recall is sufficient.
Vector achieves the best nDCG@10 (0.8262) and perfect recall at 25.94ms — the most balanced strategy for most workloads.
LLM reranking (page-index-llm, page-index-qwen) matches legacy-hybrid quality at comparable speed, validating the local Qwen path as a practical alternative to heavier hybrid pipelines.
page-index-hybrid is the only strategy that underperforms BM25 on nDCG — a useful reminder that adding complexity doesn't always improve quality.

Cache hit rates (100/0/100%) confirm the caching layer is working correctly across all strategies. Verbose output (-v, -vv) surfaces cache hit rates, phase timings, and ranking metadata directly in the CLI.

Why this matters for agents

For agents, latency and reliability are requirements, not nice-to-haves. Tooling loops fail hard when search is slow, drops context, or depends on services that may be unavailable.

sift removes that friction: retrieval is local, deterministic, and cheap to repeat. No daemon to health-check. No embedding service to rate-limit against. No cloud dependency to manage. The binary ships with Homebrew and static Linux
artifact support, so agents can rely on a pinned version without environment drift.

How it was built

This shipped in a focused, nearly uninterrupted 24-hour push — implementation, eval design, benchmarking, performance tightening, packaging, and release prep in one sustained flow. Every major unit had acceptance criteria and measurable
evidence attached before it was marked done.

What made that pace possible is something I'm not ready to talk about in detail yet. But sift is the first real proof that it works at speed, under real constraints, without cutting corners. More on that soon.