DEV Community

Cover image for Open Source Project of the Day (#90): turbovec - The Vector Index That Shrinks 10M Docs from 31 GB to 4 GB
WonderLab
WonderLab

Posted on

Open Source Project of the Day (#90): turbovec - The Vector Index That Shrinks 10M Docs from 31 GB to 4 GB

Introduction

"A 10 million document corpus takes 31 GB of RAM as float32. turbovec fits it in 4 GB."

This is article #90 in the Open Source Project of the Day series. Today's project is turbovec — a vector index library that shrinks memory by 8× and still runs faster than FAISS.

The memory cost of vector indexes is one of the most underestimated infrastructure problems in RAG. A 1536-dimensional OpenAI embedding is 6 KB of float32 data. One million documents: 6 GB. Ten million: 62 GB — more than most machines can hold, let alone search efficiently.

turbovec's answer comes from a Google Research paper at ICLR 2026: the TurboQuant algorithm. It compresses each vector from 6,144 bytes to 384 bytes using 4-bit quantization, uses statistical calibration to prevent recall from dropping, and a Rust + SIMD engine to push search speed past FAISS.

What You'll Learn

  • Why vector quantization is hard, and how TurboQuant solves the recall degradation problem
  • turbovec's 6-step algorithm: normalization → random rotation → calibration → Lloyd-Max quantization → bit-packing → length renormalization scoring
  • The Python API: TurboQuantIndex, IdMapIndex, and filtered search
  • Real benchmarks: memory and recall numbers vs. FAISS PQ
  • Drop-in replacement for LangChain / LlamaIndex / Haystack with one line of code

Prerequisites

  • Basic understanding of embedding vectors
  • Experience with RAG pipelines or vector databases
  • Basic Python experience; Rust experience optional

Project Background

What Is turbovec?

turbovec is a high-performance vector index library with a Rust core exposed through PyO3/maturin Python bindings. Its technical foundation is Google Research's TurboQuant algorithm — a low-bit-width vector compression scheme that beats FAISS Product Quantization on both compression ratio and search recall.

Its positioning is clear: a local-first, zero-dependency vector index engine that embeds directly into RAG stacks.

Author / Team

  • Author: RyanCodrai
  • Algorithm source: Google Research (TurboQuant, ICLR 2026)
  • License: MIT

Project Stats

  • ⭐ GitHub Stars: 8,900+
  • 🍴 Forks: 813
  • 📦 Install: pip install turbovec / cargo add turbovec
  • 📄 License: MIT
  • 🌐 Language: Python 55.7% + Rust 44.3%

Core Features

What It Does

turbovec attacks two fundamental tensions in vector indexing:

  1. Memory vs. scale: float32 storage makes large corpora require tens of GB; quantization typically trades recall for compression
  2. Speed vs. precision: faster ANN search usually means less accurate — turbovec improves both simultaneously

Core numbers:

Metric float32 turbovec 4-bit Gain
10M doc index memory 31 GB 4 GB 8× compression
Per-vector storage (1536-dim) 6,144 bytes 384 bytes 16× compression
ARM search speed vs FAISS FastScan baseline +12–20% Faster

Use Cases

  1. Memory-constrained local RAG systems

    • Run semantic search over millions of documents on consumer hardware (16–32 GB RAM), without a cloud vector database
  2. Drop-in replacement for existing framework vector stores

    • Swap InMemoryVectorStore in LangChain with zero changes to business logic
  3. RAG with filtered search

    • Filter by user permissions, document source, or time range — filtering and search execute together inside the SIMD kernel
  4. Native Rust vector retrieval

    • Use as a Rust crate in a Rust service without a Python runtime dependency
  5. Cost-sensitive vector retrieval

    • 8× memory compression means the same hardware serves 8× the corpus, or lets you downsize your cloud instance tier

Quick Start

pip install turbovec

# Framework-integrated variants
pip install turbovec[langchain]     # Replace InMemoryVectorStore
pip install turbovec[llama-index]   # Replace SimpleVectorStore
pip install turbovec[haystack]      # Replace InMemoryDocumentStore
pip install turbovec[agno]          # Replace LanceDb
Enter fullscreen mode Exit fullscreen mode

Basic indexing:

import numpy as np
from turbovec import TurboQuantIndex

# Create index: 1536-dimensional, 4-bit quantization
index = TurboQuantIndex(dim=1536, bit_width=4)

# Add vectors — no training phase needed
vectors = np.random.randn(10000, 1536).astype(np.float32)
index.add(vectors)

# Search
query = np.random.randn(1, 1536).astype(np.float32)
scores, indices = index.search(query, k=10)

# Persist to disk
index.write("my_index.tq")
index2 = TurboQuantIndex.read("my_index.tq")
Enter fullscreen mode Exit fullscreen mode

With ID mapping (supports deletion):

from turbovec import IdMapIndex

index = IdMapIndex(dim=1536, bit_width=4)

doc_ids = np.array([1001, 1002, 1003, 1004], dtype=np.uint64)
index.add_with_ids(vectors[:4], doc_ids)

index.remove(1002)  # O(1) removal
Enter fullscreen mode Exit fullscreen mode

Filtered search:

# Search only within a specific set of document IDs
allowed_ids = np.array([1001, 1003, 1004], dtype=np.uint64)
scores, ids = index.search(query, k=5, allowlist=allowed_ids)

# Or use a bitmask for larger-scale filtering
bitmask = create_bitmask(allowed_positions)
scores, ids = index.search(query, k=5, slot_bitmask=bitmask)
Enter fullscreen mode Exit fullscreen mode

LangChain drop-in replacement:

# Before
from langchain.vectorstores import InMemoryVectorStore
store = InMemoryVectorStore(embeddings)

# After — fully API-compatible
from turbovec.langchain import TurboVecStore
store = TurboVecStore(embeddings)
Enter fullscreen mode Exit fullscreen mode

Key Properties

  1. Online ingestion, no training phase

    • FAISS PQ requires training a codebook on representative data first. turbovec derives its quantization boundaries from theoretical distributions — just add and go.
  2. SIMD-accelerated search kernels

    • ARM: NEON instructions — beats FAISS FastScan by 12–20% on Apple Silicon and equivalent
    • x86: AVX-512BW primary path, AVX2 fallback — outperforms FAISS 4-bit across the board
  3. Recall that exceeds FAISS PQ

    • +0.4 to +3.4 percentage points on R@1 at 1536/3072 dimensions
  4. Filtered search built into the kernel

    • allowlist / bitmask filtering runs inside the SIMD kernel — not a post-search filter
  5. Local-first, fully offline

    • No network calls, no managed service dependency, suitable for air-gapped environments
  6. Rust-native with Python bindings

    • Use as a Rust crate (cargo add turbovec) or Python package (pip install turbovec)

Deep Dive

TurboQuant: 6 Steps to Quantize Without Losing Recall

Traditional quantization methods like FAISS Product Quantization learn their codebook from data — which means they need a training pass, can degrade on out-of-distribution data, and their buckets are never provably optimal. TurboQuant's key insight: derive optimal quantization boundaries from theory, not data.

Step 1: Normalization

v_norm = v / ‖v‖
r      = ‖v‖   (stored separately)
Enter fullscreen mode Exit fullscreen mode

Decompose every vector into direction (unit vector) and magnitude (scalar). Cosine similarity and dot product comparisons depend only on direction; the magnitude is saved for the scoring correction later.

Step 2: Random Rotation

v_rotated = R × v_norm
(R is a random orthogonal matrix)
Enter fullscreen mode Exit fullscreen mode

Multiply by a random orthogonal matrix. Rotation preserves inner products (it's an isometry), but it distributes the vector's energy evenly across coordinates. After rotation, each coordinate follows a Beta distribution — the theoretical assumption that enables Step 4.

Step 3: Per-Coordinate Calibration (TQ+)

v_calibrated[i] = (v_rotated[i] - shift[i]) / scale[i]
Enter fullscreen mode Exit fullscreen mode

The rotated coordinates follow a Beta distribution in theory, but real data may have small shifts. TQ+ fits a shift and scale per coordinate to align the empirical distribution with the theoretical one, improving quantization accuracy.

Step 4: Lloyd-Max Quantization

2-bit → 4 buckets  (optimal boundaries b₁, b₂, b₃)
4-bit → 16 buckets (optimal boundaries b₁...b₁₅)
Enter fullscreen mode Exit fullscreen mode

Because the coordinate distribution is now known (calibrated Beta), the optimal Lloyd-Max bucket boundaries can be precomputed before deployment — not learned from data. This is the fundamental reason turbovec needs no training phase.

Step 5: Bit-Packing

1536-dim × 4-bit = 768 bytes raw
+ minimal metadata → ~384 bytes total
(vs 6,144 bytes for float32 → 16× compression)
Enter fullscreen mode Exit fullscreen mode

Quantized integers are packed tightly into bit arrays for maximum memory efficiency.

Step 6: Length Renormalization Scoring

score_corrected = score_raw × correction(r_query, r_doc)
Enter fullscreen mode Exit fullscreen mode

Quantization systematically underestimates inner products (quantization error compresses magnitudes). Using the norms r saved in Step 1, multiply by a per-pair correction factor at scoring time. This correction happens outside the SIMD kernel — zero additional search overhead — but meaningfully lifts recall.


Memory Math: Real Numbers

Scenario: OpenAI text-embedding-3-small (1536 dimensions)
Corpus:   10 million documents

float32 storage:
  1536 dims × 4 bytes × 10,000,000 = 61,440,000,000 bytes ≈ 57 GB

turbovec 4-bit:
  384 bytes × 10,000,000 = 3,840,000,000 bytes ≈ 3.6 GB

Compression: ~16× per vector, ~8× end-to-end (including index structure)
Enter fullscreen mode Exit fullscreen mode

What this means in practice: A deployment that required a 64 GB instance now fits in 8 GB. Cloud costs drop by 75%+, or the same hardware serves 8× the data.


Recall Benchmarks (100K vectors, k=64)

Dataset Dimensions Bit width turbovec vs FAISS PQ (R@1)
text-embedding-3-small 1536 4-bit +3.4 pp
text-embedding-3-large 3072 4-bit +0.4 pp
GloVe 200 4-bit +0.3 pp
GloVe 200 2-bit -1.2 pp (extreme compression penalty)

Takeaway: At high dimensionality — exactly where OpenAI embeddings live — turbovec wins on both memory and recall. The only exception is 2-bit quantization on low-dimensional (≤200-dim) vectors, where the extreme compression pushes past the algorithm's sweet spot.


Framework Integration

turbovec provides drop-in replacements with fully compatible APIs:

LangChain:

from turbovec.langchain import TurboVecStore
store = TurboVecStore.from_documents(docs, embeddings)
results = store.similarity_search("query", k=5)
Enter fullscreen mode Exit fullscreen mode

LlamaIndex:

from turbovec.llama_index import TurboVecVectorStore
vector_store = TurboVecVectorStore(dim=1536)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
Enter fullscreen mode Exit fullscreen mode

Haystack:

from turbovec.haystack import TurboVecDocumentStore
document_store = TurboVecDocumentStore()
Enter fullscreen mode Exit fullscreen mode

Rust (native):

use turbovec::TurboQuantIndex;

let mut index = TurboQuantIndex::new(1536, 4);
index.add(&vectors);
let results = index.search(&queries, 10);
Enter fullscreen mode Exit fullscreen mode

Links and Resources

Official Resources

Related Resources


Conclusion

turbovec isn't "another vector database." It's a direct attack on the memory cost of vector indexes — and it wins on both dimensions simultaneously. The TurboQuant algorithm derives provably optimal quantization from theory rather than learning from data, which is why it needs no training phase and generalizes better. The Rust + SIMD engine converts that theoretical advantage into a measurable speed lead over FAISS.

An 8× memory reduction changes what hardware a RAG system requires. For developers running local semantic search at scale, or teams trying to cut cloud vector retrieval costs, turbovec is the most impactful single-library swap available right now.


Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.

Welcome to my Homepage for more useful insights and interesting products.

Top comments (0)