DEV Community

Rafael Ferres
Rafael Ferres

Posted on

Building a Production-Grade Vector Database in Rust: What We Shipped

A deep-dive into the latest FerresDB updates — from HNSW auto-tuning and PolarQuant compression to Point-in-Time Recovery, cross-encoder reranking, and a distributed Raft foundation.


Over the past few months, FerresDB has grown from a focused vector search PoC into something that increasingly resembles a production system. This post walks through everything we've shipped recently — the architectural decisions, the tradeoffs, and the honest "here's why we did it this way" behind each feature.

If you're building RAG pipelines, recommendation systems, or any kind of semantic search on top of Rust, this is the kind of update post you'd want to read before picking your stack.


The Baseline: What FerresDB Already Had

Before diving into the new stuff, a quick recap of the foundation:

  • HNSW index with Cosine, Euclidean, and Dot Product metrics
  • WAL (Write-Ahead Log) with periodic snapshots and crash recovery
  • Hybrid search combining vector similarity with BM25 full-text scoring
  • Scalar Quantization (SQ8) — compressing f32 vectors to u8 for ~4× memory reduction
  • Tiered storage (Hot/Warm/Cold) backed by memory-mapped files
  • WebSocket streaming for real-time upserts and subscriptions
  • OpenTelemetry tracing with per-span attributes for every search operation
  • RBAC with API keys, roles, and granular per-collection permissions

That's the baseline. Here's what came next.


PolarQuant: A Different Approach to Vector Compression

SQ8 works by calibrating per-dimension min/max/scale parameters and mapping each f32 value to a u8. It's effective, but it has overhead: 3 × dim × 4 bytes of calibration data per index, and a calibration step that samples up to 10K vectors before the index is usable.

PolarQuant takes a different approach. Instead of per-dimension calibration, it encodes each vector as a polar coordinate decomposition — a final radius f32 plus a sequence of angles, each quantized to u8 using fixed angular boundaries [0, 2π].

polar_encode(v)  (radius: f32, angles: Vec<u8>)
polar_decode(r, angles)  Vec<f32>  // approximate reconstruction
Enter fullscreen mode Exit fullscreen mode

The key insight: angular boundaries are mathematically fixed. There's no calibration step, no per-block parameters, and no "warm-up" phase before the index is ready. You feed points in, and the index is immediately usable.

Search uses asymmetric distance: the query stays in f32, each candidate is decoded on-the-fly via polar_distance_asymmetric. This preserves precision on the query side while keeping the stored data compact.

The HNSW graph is built with reconstructed (polar_decode) vectors for high-quality navigation. This is the same pattern used in QuantizedHnswIndex (SQ8) — the graph navigates via approximate reconstructions, and the final re-ranking uses the asymmetric distance.

We added a Criterion benchmark quantization_comparison that measures build time, search latency, recall@10, and memory footprint for both SQ8 and PolarQuant at dim 128 and 384. The short version: PolarQuant is faster to initialize, SQ8 tends to be slightly more accurate at high dimensions because per-dimension calibration adapts to the actual data distribution.


Dynamic HNSW Auto-Tuning (FerresEngine)

ef_search is the main runtime knob for HNSW: higher values mean better recall at the cost of more computation. The traditional approach is to set it once at startup and leave it.

FerresEngine changes that. Every 60 seconds, the server reads the P95 latency from query_stats per collection and applies a simple feedback loop:

  • If P95 latency is low (the index has headroom), increase ef_search to improve recall
  • If P95 latency is high (CPU is under pressure), decrease ef_search to reduce load
// In collection.rs
pub fn apply_hnsw_auto_tune(&mut self, p95_ms: f64) {
    let current = self.index.current_ef_search();
    let next = if p95_ms < LOW_LATENCY_THRESHOLD_MS {
        (current + EF_STEP).min(EF_MAX)
    } else if p95_ms > HIGH_LATENCY_THRESHOLD_MS {
        (current.saturating_sub(EF_STEP)).max(EF_MIN)
    } else {
        current
    };
    if next != current {
        self.index.set_ef_search(next);
    }
}
Enter fullscreen mode Exit fullscreen mode

The current ef_search_current value is exposed via GET /api/v1/collections/{name}/stats, so you can observe the tuner in action. The dashboard shows a "Optimized by FerresEngine" badge when auto-tuning is enabled.

This is a simple proportional controller, not a full PID loop. It's intentionally conservative — we'd rather miss a few percent of optimal recall than cause oscillation under variable load.


Native Cross-Encoder Re-ranking via ONNX Runtime

Two-stage retrieval is a well-established pattern in information retrieval: first retrieve a broad candidate set with a fast bi-encoder (HNSW + embeddings), then re-rank with a heavier cross-encoder that scores query-document pairs directly.

FerresDB now supports this natively via the optional rerank feature, backed by the ort crate (ONNX Runtime):

cargo build --features rerank
Enter fullscreen mode Exit fullscreen mode

When a cross-encoder model is loaded, search_with_rerank works as follows:

  1. Retrieve limit × 5 candidates from HNSW (intentionally over-fetching)
  2. Score each candidate against the query using the Cross-Encoder
  3. Sort by cross-encoder score and return the top limit

The API response includes rerank_ms when re-ranking was applied, so you can measure the overhead. Models like BGE-Reranker work out of the box if exported to ONNX.

The tradeoff is obvious: cross-encoders are slower (they process each query-document pair independently), so this adds latency proportional to limit × 5. For most RAG use cases, the quality improvement is worth it. For high-throughput, latency-sensitive workloads, you'd keep re-ranking off.


Point-in-Time Recovery (PITR)

Every WAL entry has always included a Unix timestamp. What was missing was the ability to use those timestamps for recovery.

PITR adds that:

  • On each snapshot, the server persists last_snapshot_timestamp to the collection directory
  • A new endpoint POST /api/v1/admin/restore accepts { "timestamp": <unix_sec>, "collection": "<name>?" }
  • The recovery logic loads the most recent snapshot before the target timestamp, then replays WAL entries up to (but not past) the timestamp
Timeline:  [snapshot@T1] ... [WAL entries T1→T2] ... [snapshot@T2] ... [WAL entries T2→now]
                                                                             ^
                                                                    restore target = T_target
Recovery:  load snapshot@T2 → replay WAL from T2 to T_target
Enter fullscreen mode Exit fullscreen mode

GET /api/v1/admin/restore/points lists available recovery points (snapshot timestamps + WAL range) per collection, so you can choose an exact target before triggering recovery. The dashboard has a PITR UI with a datetime picker and a confirmation modal that warns you the operation restores the database state in-place.

The most common use case is accidental bulk deletes — if someone upserts the wrong data or drops a namespace, you can recover to just before the bad operation without a full backup restore.


Namespace-Level Access Control

The previous RBAC model worked at the collection level: an API key could have Read/Write/Create permissions on a named collection. But in multi-tenant deployments, you often want isolation at the namespace level — a tenant should be able to search within their data partition without touching another tenant's.

NamespaceAllowance adds exactly that: API keys can now be restricted to one or more namespaces. The middleware validates the namespace from the ?namespace= query param or the X-Namespace header, and handlers enforce the allowance before any HNSW operation runs.

// PUT /api/v1/keys/:id
{
  "allowed_namespaces": ["tenant-a", "tenant-b"]
}
Enter fullscreen mode Exit fullscreen mode

Keys without allowed_namespaces behave as before (full access based on role). This is backward-compatible — existing keys continue to work without modification.


Physical Namespace Isolation

Namespace-level access control handles authentication. Physical namespace isolation handles storage.

With namespace_physical_isolation enabled (via config.toml or FERRESDB_NAMESPACE_PHYSICAL_ISOLATION), points for each namespace are stored in separate directories:

data/collections/<name>/namespaces/<namespace>/points.bin
data/collections/<name>/namespaces/<namespace>/index.bin  (optional)
Enter fullscreen mode Exit fullscreen mode

This means:

  • You can snapshot a single tenant's data independently
  • You can delete a tenant's physical files without touching other namespaces
  • Indexes are loaded independently per namespace

The tradeoff is storage amplification — you get multiple smaller indexes instead of one large one, which has implications for HNSW graph quality at low point counts. For most multi-tenant use cases, the isolation benefit outweighs the graph quality cost.


Graph Traversal: Connecting Points as a Graph

This one is a bit different from the other features — it's less about performance and more about a new query primitive.

Each Point now has an optional relations: Vec<String> field — a list of related point IDs. Relations are bidirectional and persisted in both JSONL and WAL (Operation::Link { from, to }).

POST /api/v1/collections/{name}/points/link
{ "from": "doc-1", "to": "doc-2" }
Enter fullscreen mode Exit fullscreen mode

On top of relations, there's a BFS traversal (traverse_bfs) and a new search method Collection::search_connected(query_vector, center_point_id, hops, k) that restricts the candidate set to the subgraph reachable from center_point_id within hops steps, then returns the top K by vector similarity.

The GET /api/v1/collections/{name}/graph/subgraph endpoint returns { nodes: [...], edges: [...] } for visualization. The dashboard has a Graph Explorer page using react-force-graph-2d, with force-directed layout, click-to-expand, and a JSON sidebar for the selected node.

The practical use cases are document-to-document links (citations, related articles), entity relationships (knowledge graphs over embeddings), and hierarchical data where you want to constrain search to a subtree.


S3 Backup and Retention Policy

Two operational features that belong together:

S3 Backup (POST /api/v1/admin/backup) generates a tar.gz snapshot of the storage directory and uploads it to a configured S3 bucket. Credentials can come from config.toml, environment variables (FERRESDB_S3_*), or the standard AWS_* variables — the aws-config crate handles credential resolution in the usual priority order.

Retention Policy adds retention_days to CollectionConfig. A background worker runs hourly and compacts the WAL per collection, removing entries older than the configured period. Setting retention_days: null keeps data indefinitely (the default). The dashboard has a Settings section where you can configure retention per collection via PATCH /api/v1/collections/{name}.

These two features are complementary: S3 backup handles disaster recovery ("restore from last known good state"), while retention policy handles data lifecycle ("we only need the last 90 days of embeddings").


Cache Warmup on Startup

Cold starts are a real issue for HNSW indexes: the first few queries after a restart are slow because the index graph has to be paged into RAM and the search cache is empty.

The cache warmup feature addresses this: on startup, the server reads the last 50 queries from queries.log and replays them in a background task. This pre-loads the HNSW graph nodes that were most recently accessed and populates search_cache with likely-to-be-repeated queries.

The query log was extended to optionally store the full query vector (needed for replay). Tracing logs show the warmup progress: warmup: starting cache warmup, warmup: ran query {n}/{total}, warmup: cache warmup completed in {ms}ms.

The effect is visible in the first minute after restart — P95 latency returns to steady-state much faster than without warmup.


Distributed Foundation: Raft and Read Replicas

Two experimental features that lay the groundwork for horizontal scaling:

Raft consensus (--features raft, backed by openraft) adds types and a cluster status API for multi-node operation. The WAL has a replicate_then_confirm path that can be routed through Raft before confirming writes to the client. The dashboard has a Cluster page showing active nodes, the current leader, and replication status from GET /api/v1/cluster.

Read replicas (--replica-of <ADDR>) start the server in replica mode. Write endpoints (POST/PUT/DELETE on collections, points, etc.) return 405 Method Not Allowed. A WAL streaming worker (via gRPC StreamWal, enabled with --features grpc) consumes the leader's WAL and applies it locally, keeping the replica in sync. The dashboard Overview shows Role: Leader or Role: Replica.

Both features are clearly marked experimental. The Raft implementation is not battle-tested, and replica lag handling is basic. The value right now is architectural: the code paths for distributed operation exist, and they're wired up correctly. Running a single-node deployment with either feature is safe; relying on them for production multi-node deployments is not yet advised.


What's Next

A few things we're actively thinking about:

  • Product Quantization (PQ): SQ8 and PolarQuant both reduce memory, but PQ achieves better compression ratios at high dimensions by splitting vectors into subvectors and quantizing each subspace independently. The ANNIndex trait makes this straightforward to add.
  • Hybrid search improvements: Reciprocal Rank Fusion (RRF) as an alternative to the current linear combination of vector and BM25 scores.
  • Stable Raft: Getting from "foundation exists" to "actually reliable" requires a lot of failure injection testing. This is on the roadmap but not imminent.
  • Python and TypeScript SDKs: The REST API is stable; the SDK surface needs to catch up with recent features like PITR, graph traversal, and namespace isolation.

A Note on the ANNIndex Trait

Almost every feature in this post was made easier by one early decision: the ANNIndex trait.

pub trait ANNIndex: Send + Sync {
    fn add_point(&mut self, point: &Point) -> Result<(), FerresError>;
    fn remove_point(&mut self, id: &str);
    fn search(&self, query: &[f32], k: usize, predicate: Option<&dyn Fn(&str) -> bool + Send + Sync>) -> Result<Vec<(String, f32)>, FerresError>;
    fn search_explain(&self, ...) -> Result<Vec<(String, f32, ExplainMeta)>, FerresError>;
    fn current_ef_search(&self) -> usize;
    fn set_ef_search(&self, v: usize);
    // ...
}
Enter fullscreen mode Exit fullscreen mode

HnswIndex, QuantizedHnswIndex, PolarQuantHnswIndex all implement this trait. The factory function create_ann_index() selects the right one based on CollectionConfig. The server, storage layer, and PITR code all work with Box<dyn ANNIndex> — they don't know or care which backend is running.

When we added HNSW auto-tuning, we added set_ef_search to the trait. When we added explain, we added search_explain. Each new backend picks up the interface automatically. The #[serde(default)] on QuantizationConfig in CollectionConfig means old serialized collections deserialize correctly without migration.

If you're building a vector database or any system with pluggable backends, this is the pattern worth copying.


FerresDB is open source and built in Rust. If any of this is interesting to you — whether you want to use it, contribute, or just steal the ideas — the code is there to read.

FerresDB

Top comments (0)