WonderLab

Posted on Jun 9

Open Source Project of the Day (#90): turbovec - The Vector Index That Shrinks 10M Docs from 31 GB to 4 GB

#rag #vectordatabase #opensource #embedding

Introduction

"A 10 million document corpus takes 31 GB of RAM as float32. turbovec fits it in 4 GB."

This is article #90 in the Open Source Project of the Day series. Today's project is turbovec — a vector index library that shrinks memory by 8× and still runs faster than FAISS.

The memory cost of vector indexes is one of the most underestimated infrastructure problems in RAG. A 1536-dimensional OpenAI embedding is 6 KB of float32 data. One million documents: 6 GB. Ten million: 62 GB — more than most machines can hold, let alone search efficiently.

turbovec's answer comes from a Google Research paper at ICLR 2026: the TurboQuant algorithm. It compresses each vector from 6,144 bytes to 384 bytes using 4-bit quantization, uses statistical calibration to prevent recall from dropping, and a Rust + SIMD engine to push search speed past FAISS.

What You'll Learn

Why vector quantization is hard, and how TurboQuant solves the recall degradation problem
turbovec's 6-step algorithm: normalization → random rotation → calibration → Lloyd-Max quantization → bit-packing → length renormalization scoring
The Python API: TurboQuantIndex, IdMapIndex, and filtered search
Real benchmarks: memory and recall numbers vs. FAISS PQ
Drop-in replacement for LangChain / LlamaIndex / Haystack with one line of code

Prerequisites

Basic understanding of embedding vectors
Experience with RAG pipelines or vector databases
Basic Python experience; Rust experience optional

Project Background

What Is turbovec?

turbovec is a high-performance vector index library with a Rust core exposed through PyO3/maturin Python bindings. Its technical foundation is Google Research's TurboQuant algorithm — a low-bit-width vector compression scheme that beats FAISS Product Quantization on both compression ratio and search recall.

Its positioning is clear: a local-first, zero-dependency vector index engine that embeds directly into RAG stacks.

Author / Team

Author: RyanCodrai
Algorithm source: Google Research (TurboQuant, ICLR 2026)
License: MIT

Project Stats

⭐ GitHub Stars: 8,900+
🍴 Forks: 813
📦 Install: pip install turbovec / cargo add turbovec
📄 License: MIT
🌐 Language: Python 55.7% + Rust 44.3%

Core Features

What It Does

turbovec attacks two fundamental tensions in vector indexing:

Memory vs. scale: float32 storage makes large corpora require tens of GB; quantization typically trades recall for compression
Speed vs. precision: faster ANN search usually means less accurate — turbovec improves both simultaneously

Core numbers:

Metric	float32	turbovec 4-bit	Gain
10M doc index memory	31 GB	4 GB	8× compression
Per-vector storage (1536-dim)	6,144 bytes	384 bytes	16× compression
ARM search speed vs FAISS FastScan	baseline	+12–20%	Faster

Use Cases

Memory-constrained local RAG systems
- Run semantic search over millions of documents on consumer hardware (16–32 GB RAM), without a cloud vector database
Drop-in replacement for existing framework vector stores
- Swap InMemoryVectorStore in LangChain with zero changes to business logic
RAG with filtered search
- Filter by user permissions, document source, or time range — filtering and search execute together inside the SIMD kernel
Native Rust vector retrieval
- Use as a Rust crate in a Rust service without a Python runtime dependency
Cost-sensitive vector retrieval
- 8× memory compression means the same hardware serves 8× the corpus, or lets you downsize your cloud instance tier

Quick Start

pip install turbovec

# Framework-integrated variants
pip install turbovec[langchain]     # Replace InMemoryVectorStore
pip install turbovec[llama-index]   # Replace SimpleVectorStore
pip install turbovec[haystack]      # Replace InMemoryDocumentStore
pip install turbovec[agno]          # Replace LanceDb

Basic indexing:

import numpy as np
from turbovec import TurboQuantIndex

# Create index: 1536-dimensional, 4-bit quantization
index = TurboQuantIndex(dim=1536, bit_width=4)

# Add vectors — no training phase needed
vectors = np.random.randn(10000, 1536).astype(np.float32)
index.add(vectors)

# Search
query = np.random.randn(1, 1536).astype(np.float32)
scores, indices = index.search(query, k=10)

# Persist to disk
index.write("my_index.tq")
index2 = TurboQuantIndex.read("my_index.tq")

With ID mapping (supports deletion):

from turbovec import IdMapIndex

index = IdMapIndex(dim=1536, bit_width=4)

doc_ids = np.array([1001, 1002, 1003, 1004], dtype=np.uint64)
index.add_with_ids(vectors[:4], doc_ids)

index.remove(1002)  # O(1) removal

Filtered search:

# Search only within a specific set of document IDs
allowed_ids = np.array([1001, 1003, 1004], dtype=np.uint64)
scores, ids = index.search(query, k=5, allowlist=allowed_ids)

# Or use a bitmask for larger-scale filtering
bitmask = create_bitmask(allowed_positions)
scores, ids = index.search(query, k=5, slot_bitmask=bitmask)

LangChain drop-in replacement:

# Before
from langchain.vectorstores import InMemoryVectorStore
store = InMemoryVectorStore(embeddings)

# After — fully API-compatible
from turbovec.langchain import TurboVecStore
store = TurboVecStore(embeddings)

Key Properties

Online ingestion, no training phase
- FAISS PQ requires training a codebook on representative data first. turbovec derives its quantization boundaries from theoretical distributions — just add and go.
SIMD-accelerated search kernels
- ARM: NEON instructions — beats FAISS FastScan by 12–20% on Apple Silicon and equivalent
- x86: AVX-512BW primary path, AVX2 fallback — outperforms FAISS 4-bit across the board
Recall that exceeds FAISS PQ
- +0.4 to +3.4 percentage points on R@1 at 1536/3072 dimensions
Filtered search built into the kernel
- allowlist / bitmask filtering runs inside the SIMD kernel — not a post-search filter
Local-first, fully offline
- No network calls, no managed service dependency, suitable for air-gapped environments
Rust-native with Python bindings
- Use as a Rust crate (cargo add turbovec) or Python package (pip install turbovec)

Deep Dive

TurboQuant: 6 Steps to Quantize Without Losing Recall

Traditional quantization methods like FAISS Product Quantization learn their codebook from data — which means they need a training pass, can degrade on out-of-distribution data, and their buckets are never provably optimal. TurboQuant's key insight: derive optimal quantization boundaries from theory, not data.

Step 1: Normalization

v_norm = v / ‖v‖
r      = ‖v‖   (stored separately)

Decompose every vector into direction (unit vector) and magnitude (scalar). Cosine similarity and dot product comparisons depend only on direction; the magnitude is saved for the scoring correction later.

Step 2: Random Rotation

v_rotated = R × v_norm
(R is a random orthogonal matrix)

Multiply by a random orthogonal matrix. Rotation preserves inner products (it's an isometry), but it distributes the vector's energy evenly across coordinates. After rotation, each coordinate follows a Beta distribution — the theoretical assumption that enables Step 4.

Step 3: Per-Coordinate Calibration (TQ+)

v_calibrated[i] = (v_rotated[i] - shift[i]) / scale[i]

The rotated coordinates follow a Beta distribution in theory, but real data may have small shifts. TQ+ fits a shift and scale per coordinate to align the empirical distribution with the theoretical one, improving quantization accuracy.

Step 4: Lloyd-Max Quantization

2-bit → 4 buckets  (optimal boundaries b₁, b₂, b₃)
4-bit → 16 buckets (optimal boundaries b₁...b₁₅)

Because the coordinate distribution is now known (calibrated Beta), the optimal Lloyd-Max bucket boundaries can be precomputed before deployment — not learned from data. This is the fundamental reason turbovec needs no training phase.

Step 5: Bit-Packing

1536-dim × 4-bit = 768 bytes raw
+ minimal metadata → ~384 bytes total
(vs 6,144 bytes for float32 → 16× compression)

Quantized integers are packed tightly into bit arrays for maximum memory efficiency.

Step 6: Length Renormalization Scoring

score_corrected = score_raw × correction(r_query, r_doc)

Quantization systematically underestimates inner products (quantization error compresses magnitudes). Using the norms r saved in Step 1, multiply by a per-pair correction factor at scoring time. This correction happens outside the SIMD kernel — zero additional search overhead — but meaningfully lifts recall.

Memory Math: Real Numbers

Scenario: OpenAI text-embedding-3-small (1536 dimensions)
Corpus:   10 million documents

float32 storage:
  1536 dims × 4 bytes × 10,000,000 = 61,440,000,000 bytes ≈ 57 GB

turbovec 4-bit:
  384 bytes × 10,000,000 = 3,840,000,000 bytes ≈ 3.6 GB

Compression: ~16× per vector, ~8× end-to-end (including index structure)

What this means in practice: A deployment that required a 64 GB instance now fits in 8 GB. Cloud costs drop by 75%+, or the same hardware serves 8× the data.

Recall Benchmarks (100K vectors, k=64)

Dataset	Dimensions	Bit width	turbovec vs FAISS PQ (R@1)
text-embedding-3-small	1536	4-bit	+3.4 pp
text-embedding-3-large	3072	4-bit	+0.4 pp
GloVe	200	4-bit	+0.3 pp
GloVe	200	2-bit	-1.2 pp (extreme compression penalty)

Takeaway: At high dimensionality — exactly where OpenAI embeddings live — turbovec wins on both memory and recall. The only exception is 2-bit quantization on low-dimensional (≤200-dim) vectors, where the extreme compression pushes past the algorithm's sweet spot.

Framework Integration

turbovec provides drop-in replacements with fully compatible APIs:

LangChain:

from turbovec.langchain import TurboVecStore
store = TurboVecStore.from_documents(docs, embeddings)
results = store.similarity_search("query", k=5)

LlamaIndex:

from turbovec.llama_index import TurboVecVectorStore
vector_store = TurboVecVectorStore(dim=1536)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Haystack:

from turbovec.haystack import TurboVecDocumentStore
document_store = TurboVecDocumentStore()

Rust (native):

use turbovec::TurboQuantIndex;

let mut index = TurboQuantIndex::new(1536, 4);
index.add(&vectors);
let results = index.search(&queries, 10);

Links and Resources

Official Resources

🌟 GitHub: RyanCodrai/turbovec
📦 PyPI: pip install turbovec
📄 Core paper: TurboQuant (ICLR 2026)
📄 Reference paper: RaBitQ (SIGMOD 2024)

Related Resources

FAISS documentation — for direct comparison
PyO3 documentation — the Rust-Python binding mechanism behind turbovec

Conclusion

turbovec isn't "another vector database." It's a direct attack on the memory cost of vector indexes — and it wins on both dimensions simultaneously. The TurboQuant algorithm derives provably optimal quantization from theory rather than learning from data, which is why it needs no training phase and generalizes better. The Rust + SIMD engine converts that theoretical advantage into a measurable speed lead over FAISS.

An 8× memory reduction changes what hardware a RAG system requires. For developers running local semantic search at scale, or teams trying to cut cloud vector retrieval costs, turbovec is the most impactful single-library swap available right now.

Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.

Welcome to my Homepage for more useful insights and interesting products.

DEV Community