DEV Community: Rafael Ferres

Why I built a High-Performance Vector Database in Rust (and why it’s now Open Source)

Rafael Ferres — Mon, 27 Apr 2026 14:34:39 +0000

Introduction

Building a database from scratch is often called "reinventing the wheel." But sometimes, the wheels available don't fit the car you're building.

I’ve been working on FerresDB (https://www.ferres.io/) for the past few months. It started as a challenge to solve sub-millisecond search latency in AI-driven workflows, and today, I’m excited to share that it’s officially open-source.

The Tech Stack

When performance is the main requirement, there aren't many choices. I went with Rust.

Coming from a background in other high-level languages, moving to Rust for this project allowed me to:

Leverage Zero-cost abstractions.
Implement custom SIMD kernels for lightning-fast vector distance calculations.
Ensure thread safety without a Garbage Collector (GC) slowing down ingestion.

What makes FerresDB different?

Most vector stores are either too heavy for simple projects or too slow for real-time applications. FerresDB was designed with three pillars in mind:

Speed: Sub-millisecond latency using a custom HNSW (Hierarchical Navigable Small World) implementation.
Persistence: A built-in Write-Ahead Log (WAL) to ensure your data is safe even after a crash.
Simplicity: A straightforward API that doesn't require complex infrastructure to set up.

Lessons Learned

Building this wasn't easy. I spent weeks debugging memory alignment for SIMD and fine-tuning the HNSW graph parameters to find the sweet spot between recall and speed.

One of the biggest takeaways? Memory layout matters more than the algorithm itself. If your data doesn't fit in the CPU cache, even the best $O(\log n)$ algorithm will feel sluggish due to cache misses.

Open Source & Contribution

I’m opening the code because I believe the best way to stress-test infrastructure is to let the community break it. Whether you are a Rustacean looking to contribute to a low-level project or an AI engineer looking for a fast, local vector store, I’d love your feedback.

Check it out on GitHub: https://github.com/ferres-db/ferres-db

If you find this project interesting, feel free to leave a ⭐ on GitHub. It helps a lot!

Building a Production-Grade Vector Database in Rust: What We Shipped

Rafael Ferres — Tue, 31 Mar 2026 22:16:17 +0000

A deep-dive into the latest FerresDB updates — from HNSW auto-tuning and PolarQuant compression to Point-in-Time Recovery, cross-encoder reranking, and a distributed Raft foundation.

Over the past few months, FerresDB has grown from a focused vector search PoC into something that increasingly resembles a production system. This post walks through everything we've shipped recently — the architectural decisions, the tradeoffs, and the honest "here's why we did it this way" behind each feature.

If you're building RAG pipelines, recommendation systems, or any kind of semantic search on top of Rust, this is the kind of update post you'd want to read before picking your stack.

The Baseline: What FerresDB Already Had

Before diving into the new stuff, a quick recap of the foundation:

HNSW index with Cosine, Euclidean, and Dot Product metrics
WAL (Write-Ahead Log) with periodic snapshots and crash recovery
Hybrid search combining vector similarity with BM25 full-text scoring
Scalar Quantization (SQ8) — compressing f32 vectors to u8 for ~4× memory reduction
Tiered storage (Hot/Warm/Cold) backed by memory-mapped files
WebSocket streaming for real-time upserts and subscriptions
OpenTelemetry tracing with per-span attributes for every search operation
RBAC with API keys, roles, and granular per-collection permissions

That's the baseline. Here's what came next.

PolarQuant: A Different Approach to Vector Compression

SQ8 works by calibrating per-dimension min/max/scale parameters and mapping each f32 value to a u8. It's effective, but it has overhead: 3 × dim × 4 bytes of calibration data per index, and a calibration step that samples up to 10K vectors before the index is usable.

PolarQuant takes a different approach. Instead of per-dimension calibration, it encodes each vector as a polar coordinate decomposition — a final radius f32 plus a sequence of angles, each quantized to u8 using fixed angular boundaries [0, 2π].

polar_encode(v) → (radius: f32, angles: Vec<u8>)
polar_decode(r, angles) → Vec<f32>  // approximate reconstruction

The key insight: angular boundaries are mathematically fixed. There's no calibration step, no per-block parameters, and no "warm-up" phase before the index is ready. You feed points in, and the index is immediately usable.

Search uses asymmetric distance: the query stays in f32, each candidate is decoded on-the-fly via polar_distance_asymmetric. This preserves precision on the query side while keeping the stored data compact.

The HNSW graph is built with reconstructed (polar_decode) vectors for high-quality navigation. This is the same pattern used in QuantizedHnswIndex (SQ8) — the graph navigates via approximate reconstructions, and the final re-ranking uses the asymmetric distance.

We added a Criterion benchmark quantization_comparison that measures build time, search latency, recall@10, and memory footprint for both SQ8 and PolarQuant at dim 128 and 384. The short version: PolarQuant is faster to initialize, SQ8 tends to be slightly more accurate at high dimensions because per-dimension calibration adapts to the actual data distribution.

Dynamic HNSW Auto-Tuning (FerresEngine)

ef_search is the main runtime knob for HNSW: higher values mean better recall at the cost of more computation. The traditional approach is to set it once at startup and leave it.

FerresEngine changes that. Every 60 seconds, the server reads the P95 latency from query_stats per collection and applies a simple feedback loop:

If P95 latency is low (the index has headroom), increase ef_search to improve recall
If P95 latency is high (CPU is under pressure), decrease ef_search to reduce load

// In collection.rs
pub fn apply_hnsw_auto_tune(&mut self, p95_ms: f64) {
    let current = self.index.current_ef_search();
    let next = if p95_ms < LOW_LATENCY_THRESHOLD_MS {
        (current + EF_STEP).min(EF_MAX)
    } else if p95_ms > HIGH_LATENCY_THRESHOLD_MS {
        (current.saturating_sub(EF_STEP)).max(EF_MIN)
    } else {
        current
    };
    if next != current {
        self.index.set_ef_search(next);
    }
}

The current ef_search_current value is exposed via GET /api/v1/collections/{name}/stats, so you can observe the tuner in action. The dashboard shows a "Optimized by FerresEngine" badge when auto-tuning is enabled.

This is a simple proportional controller, not a full PID loop. It's intentionally conservative — we'd rather miss a few percent of optimal recall than cause oscillation under variable load.

Native Cross-Encoder Re-ranking via ONNX Runtime

Two-stage retrieval is a well-established pattern in information retrieval: first retrieve a broad candidate set with a fast bi-encoder (HNSW + embeddings), then re-rank with a heavier cross-encoder that scores query-document pairs directly.

FerresDB now supports this natively via the optional rerank feature, backed by the ort crate (ONNX Runtime):

cargo build --features rerank

When a cross-encoder model is loaded, search_with_rerank works as follows:

Retrieve limit × 5 candidates from HNSW (intentionally over-fetching)
Score each candidate against the query using the Cross-Encoder
Sort by cross-encoder score and return the top limit

The API response includes rerank_ms when re-ranking was applied, so you can measure the overhead. Models like BGE-Reranker work out of the box if exported to ONNX.

The tradeoff is obvious: cross-encoders are slower (they process each query-document pair independently), so this adds latency proportional to limit × 5. For most RAG use cases, the quality improvement is worth it. For high-throughput, latency-sensitive workloads, you'd keep re-ranking off.

Point-in-Time Recovery (PITR)

Every WAL entry has always included a Unix timestamp. What was missing was the ability to use those timestamps for recovery.

PITR adds that:

On each snapshot, the server persists last_snapshot_timestamp to the collection directory
A new endpoint POST /api/v1/admin/restore accepts { "timestamp": <unix_sec>, "collection": "<name>?" }
The recovery logic loads the most recent snapshot before the target timestamp, then replays WAL entries up to (but not past) the timestamp

Timeline:  [snapshot@T1] ... [WAL entries T1→T2] ... [snapshot@T2] ... [WAL entries T2→now]
                                                                             ^
                                                                    restore target = T_target
Recovery:  load snapshot@T2 → replay WAL from T2 to T_target

GET /api/v1/admin/restore/points lists available recovery points (snapshot timestamps + WAL range) per collection, so you can choose an exact target before triggering recovery. The dashboard has a PITR UI with a datetime picker and a confirmation modal that warns you the operation restores the database state in-place.

The most common use case is accidental bulk deletes — if someone upserts the wrong data or drops a namespace, you can recover to just before the bad operation without a full backup restore.

Namespace-Level Access Control

The previous RBAC model worked at the collection level: an API key could have Read/Write/Create permissions on a named collection. But in multi-tenant deployments, you often want isolation at the namespace level — a tenant should be able to search within their data partition without touching another tenant's.

NamespaceAllowance adds exactly that: API keys can now be restricted to one or more namespaces. The middleware validates the namespace from the ?namespace= query param or the X-Namespace header, and handlers enforce the allowance before any HNSW operation runs.

// PUT /api/v1/keys/:id
{
  "allowed_namespaces": ["tenant-a", "tenant-b"]
}

Keys without allowed_namespaces behave as before (full access based on role). This is backward-compatible — existing keys continue to work without modification.

Physical Namespace Isolation

Namespace-level access control handles authentication. Physical namespace isolation handles storage.

With namespace_physical_isolation enabled (via config.toml or FERRESDB_NAMESPACE_PHYSICAL_ISOLATION), points for each namespace are stored in separate directories:

data/collections/<name>/namespaces/<namespace>/points.bin
data/collections/<name>/namespaces/<namespace>/index.bin  (optional)

This means:

You can snapshot a single tenant's data independently
You can delete a tenant's physical files without touching other namespaces
Indexes are loaded independently per namespace

The tradeoff is storage amplification — you get multiple smaller indexes instead of one large one, which has implications for HNSW graph quality at low point counts. For most multi-tenant use cases, the isolation benefit outweighs the graph quality cost.

Graph Traversal: Connecting Points as a Graph

This one is a bit different from the other features — it's less about performance and more about a new query primitive.

Each Point now has an optional relations: Vec<String> field — a list of related point IDs. Relations are bidirectional and persisted in both JSONL and WAL (Operation::Link { from, to }).

POST /api/v1/collections/{name}/points/link
{ "from": "doc-1", "to": "doc-2" }

On top of relations, there's a BFS traversal (traverse_bfs) and a new search method Collection::search_connected(query_vector, center_point_id, hops, k) that restricts the candidate set to the subgraph reachable from center_point_id within hops steps, then returns the top K by vector similarity.

The GET /api/v1/collections/{name}/graph/subgraph endpoint returns { nodes: [...], edges: [...] } for visualization. The dashboard has a Graph Explorer page using react-force-graph-2d, with force-directed layout, click-to-expand, and a JSON sidebar for the selected node.

The practical use cases are document-to-document links (citations, related articles), entity relationships (knowledge graphs over embeddings), and hierarchical data where you want to constrain search to a subtree.

S3 Backup and Retention Policy

Two operational features that belong together:

S3 Backup (POST /api/v1/admin/backup) generates a tar.gz snapshot of the storage directory and uploads it to a configured S3 bucket. Credentials can come from config.toml, environment variables (FERRESDB_S3_*), or the standard AWS_* variables — the aws-config crate handles credential resolution in the usual priority order.

Retention Policy adds retention_days to CollectionConfig. A background worker runs hourly and compacts the WAL per collection, removing entries older than the configured period. Setting retention_days: null keeps data indefinitely (the default). The dashboard has a Settings section where you can configure retention per collection via PATCH /api/v1/collections/{name}.

These two features are complementary: S3 backup handles disaster recovery ("restore from last known good state"), while retention policy handles data lifecycle ("we only need the last 90 days of embeddings").

Cache Warmup on Startup

Cold starts are a real issue for HNSW indexes: the first few queries after a restart are slow because the index graph has to be paged into RAM and the search cache is empty.

The cache warmup feature addresses this: on startup, the server reads the last 50 queries from queries.log and replays them in a background task. This pre-loads the HNSW graph nodes that were most recently accessed and populates search_cache with likely-to-be-repeated queries.

The query log was extended to optionally store the full query vector (needed for replay). Tracing logs show the warmup progress: warmup: starting cache warmup, warmup: ran query {n}/{total}, warmup: cache warmup completed in {ms}ms.

The effect is visible in the first minute after restart — P95 latency returns to steady-state much faster than without warmup.

Distributed Foundation: Raft and Read Replicas

Two experimental features that lay the groundwork for horizontal scaling:

Raft consensus (--features raft, backed by openraft) adds types and a cluster status API for multi-node operation. The WAL has a replicate_then_confirm path that can be routed through Raft before confirming writes to the client. The dashboard has a Cluster page showing active nodes, the current leader, and replication status from GET /api/v1/cluster.

Read replicas (--replica-of <ADDR>) start the server in replica mode. Write endpoints (POST/PUT/DELETE on collections, points, etc.) return 405 Method Not Allowed. A WAL streaming worker (via gRPC StreamWal, enabled with --features grpc) consumes the leader's WAL and applies it locally, keeping the replica in sync. The dashboard Overview shows Role: Leader or Role: Replica.

Both features are clearly marked experimental. The Raft implementation is not battle-tested, and replica lag handling is basic. The value right now is architectural: the code paths for distributed operation exist, and they're wired up correctly. Running a single-node deployment with either feature is safe; relying on them for production multi-node deployments is not yet advised.

What's Next

A few things we're actively thinking about:

Product Quantization (PQ): SQ8 and PolarQuant both reduce memory, but PQ achieves better compression ratios at high dimensions by splitting vectors into subvectors and quantizing each subspace independently. The ANNIndex trait makes this straightforward to add.
Hybrid search improvements: Reciprocal Rank Fusion (RRF) as an alternative to the current linear combination of vector and BM25 scores.
Stable Raft: Getting from "foundation exists" to "actually reliable" requires a lot of failure injection testing. This is on the roadmap but not imminent.
Python and TypeScript SDKs: The REST API is stable; the SDK surface needs to catch up with recent features like PITR, graph traversal, and namespace isolation.

A Note on the `ANNIndex` Trait

Almost every feature in this post was made easier by one early decision: the ANNIndex trait.

pub trait ANNIndex: Send + Sync {
    fn add_point(&mut self, point: &Point) -> Result<(), FerresError>;
    fn remove_point(&mut self, id: &str);
    fn search(&self, query: &[f32], k: usize, predicate: Option<&dyn Fn(&str) -> bool + Send + Sync>) -> Result<Vec<(String, f32)>, FerresError>;
    fn search_explain(&self, ...) -> Result<Vec<(String, f32, ExplainMeta)>, FerresError>;
    fn current_ef_search(&self) -> usize;
    fn set_ef_search(&self, v: usize);
    // ...
}

HnswIndex, QuantizedHnswIndex, PolarQuantHnswIndex all implement this trait. The factory function create_ann_index() selects the right one based on CollectionConfig. The server, storage layer, and PITR code all work with Box<dyn ANNIndex> — they don't know or care which backend is running.

When we added HNSW auto-tuning, we added set_ef_search to the trait. When we added explain, we added search_explain. Each new backend picks up the interface automatically. The #[serde(default)] on QuantizationConfig in CollectionConfig means old serialized collections deserialize correctly without migration.

If you're building a vector database or any system with pluggable backends, this is the pattern worth copying.

FerresDB is open source and built in Rust. If any of this is interesting to you — whether you want to use it, contribute, or just steal the ideas — the code is there to read.

FerresDB

FerresDB update!

Rafael Ferres — Tue, 10 Feb 2026 11:54:41 +0000

I’ve just released a series of fundamental improvements to FerresDB, focused on low-level performance and native integration with AI ecosystems.

What’s new:

🔌 Embedded MCP (Model Context Protocol): Native support via STDIO. It’s now possible to connect the database directly to Claude Desktop or Cursor IDE.

⚡ SIMD-Accelerated Kernels: Implementation of distance kernels (Euclidean/Dot Product) in Rust using AVX2 and SSE4.1 instructions, with runtime detection.

🔍 Native HNSW Pre-filtering: Metadata filtering integrated directly into graph traversal, ensuring precision and returning the exact requested limit.

🏢 Logical Namespaces: Native multitenancy support, allowing data from multiple clients to be isolated within the same physical collection efficiently.

📊 Real-time Analytics: Updated dashboard with time-series charts for P95 latency and ingestion throughput, plus a hardware acceleration indicator.

📦 Storage Optimization: Added Zstd compression for the WAL and support for binary snapshots via bincode for ultra-fast loading.

🔄 Auto-Reindex & TTL: New background worker for automatic index compaction and support for Time-to-Live data expiration.

The project continues to evolve as a lightweight and resilient solution for vector search infrastructure.

FerresDB

Building a High-Performance Vector Database in Rust from Scratch 🦀

Rafael Ferres — Sun, 08 Feb 2026 22:21:10 +0000

Introduction

Recently, I’ve been heads-down developing FerresDB Core, a high-performance vector search engine designed specifically for semantic search and RAG (Retrieval-Augmented Generation) applications. The goal was to build a tool that balances raw speed with the reliability and visibility that developers need in production.

Why Rust?

Choosing Rust was essential for this project. It provides:

Sub-millisecond performance even with large vector collections.
Thread-safety and memory management without a garbage collector, which is critical for a multi-threaded database server.
A robust ecosystem to implement complex algorithms like HNSW.

Core Features & Architecture

The project is structured as a modular ecosystem, including the core engine, a REST/gRPC server, and a management dashboard:

Vector Engine (HNSW): Supports sub-millisecond searches using the HNSW algorithm with Cosine, Euclidean, and Dot Product metrics.
Persistence & Durability: To ensure data integrity, I implemented a Write-Ahead Log (WAL) and a periodic snapshot system. If the system crashes, it can recover automatically to a consistent state.
Hybrid Search: FerresDB isn't limited to vectors; it supports hybrid search with BM25 to improve accuracy in RAG pipelines.
Observability: Built-in support for OpenTelemetry (OTLP) allows for distributed tracing, giving you a hierarchical view of every search request.

The Developer Experience (DX)

I believe that infrastructure shouldn't be a "black box." That’s why FerresDB includes:

Integrated Dashboard: A modern UI built with React and Tailwind CSS to manage collections, API keys, and test queries visually.
Modern Connectivity: Full support for REST APIs, low-latency gRPC, and WebSockets for real-time log streaming.
Docker Ready: You can spin up the entire stack with a single docker-compose up.

Current Status

I am evolving the project step-by-step. While I plan to make it fully Open Source very soon, it is already in a stage where it can be used for development and testing.

Check it out!

I'd love to get feedback from the community on the performance and the interface. If you're building RAG applications or interested in database internals, let's connect!

👉 FerresDB

DEV Community: Rafael Ferres

Why I built a High-Performance Vector Database in Rust (and why it’s now Open Source)

Introduction

The Tech Stack

What makes FerresDB different?

Lessons Learned

Open Source & Contribution

Building a Production-Grade Vector Database in Rust: What We Shipped

The Baseline: What FerresDB Already Had

PolarQuant: A Different Approach to Vector Compression

Dynamic HNSW Auto-Tuning (FerresEngine)

Native Cross-Encoder Re-ranking via ONNX Runtime

Point-in-Time Recovery (PITR)

Namespace-Level Access Control

Physical Namespace Isolation

Graph Traversal: Connecting Points as a Graph

S3 Backup and Retention Policy

Cache Warmup on Startup

Distributed Foundation: Raft and Read Replicas

What's Next

A Note on the ANNIndex Trait

FerresDB update!

Building a High-Performance Vector Database in Rust from Scratch 🦀

Introduction

Why Rust?

Core Features & Architecture

The Developer Experience (DX)

Current Status

Check it out!

A Note on the `ANNIndex` Trait