DEV Community

Lars
Lars

Posted on • Originally published at larsroettig.me

turbovec: Local RAG Without the 60 GB Tax

#ai

A 1536-dimensional float32 embedding is 6 KB. A corpus of 10 million documents is roughly 60 GB of raw vectors before any index overhead. That doesn't fit in laptop RAM, and even on a machine with 64 GB you've left yourself no headroom for anything else.

I kept reaching for FAISS. It works, but I kept hitting two friction points: training requires a representative sample of your corpus upfront, and compression quality depends on how well that sample matches the real distribution. If your data distribution shifts, you're rebuilding.

turbovec solves both, and the TurboQuant paper (arXiv April 2025, Google Research + NYU) explains the math behind why it can skip the training step entirely.

What TurboQuant actually does

The core idea is a mathematical trick: apply a random rotation to your vectors before compressing them.

After rotation, each coordinate follows a scaled Beta distribution that converges to Gaussian N(0, 1/d) in high dimensions. The coordinates also become nearly independent. That combination is what makes training-free quantization possible: you can precompute optimal bucket boundaries from pure math, with no data required upfront.

The algorithm in four steps:

  1. Normalize each vector to unit length; store the norm as a float
  2. Apply a fixed random rotation matrix (same matrix for the whole index, computed once at setup)
  3. Quantize each rotated coordinate against precomputed bucket boundaries; at 4-bit that's 16 buckets per coordinate
  4. Pack the integers: a 1536-dim vector goes from 6,144 bytes (float32) to 384 bytes

A 10M-doc corpus: ~60 GB float32 becomes ~7.5 GB at 4-bit, an 8x reduction. The paper proves the MSE distortion lands within a factor of √3·π/2 ≈ 2.7 of the information-theoretic lower bound at any bit-width, which is tight for a training-free method. At 4-bit specifically, MSE is approximately 0.009.

Search doesn't decompress vectors. It rotates the query once into the same domain and scores against codebook centroids using SIMD kernels (NEON on ARM, AVX-512 on x86). Per turbovec's own benchmarks, on ARM it beats FAISS IndexPQFastScan by 12-20%.

The part I initially glossed over: MSE and inner product are different problems

For RAG, what matters is preserving similarity scores, and MSE-optimized quantizers don't do that.

When you search a vector index, you're finding stored vectors with the highest dot product against your query. The TurboQuant paper proves that quantizers optimized purely for reconstruction accuracy introduce bias into inner product estimates. The compressed vectors rebuild accurately, but their similarity scores with a query vector are systematically off. You get wrong nearest neighbors.

TurboQuant fixes this with a two-stage approach. Stage one applies MSE quantization at one fewer bit than your target budget (so 3 bits if you want 4-bit total), which minimizes reconstruction error and shrinks the residual as much as possible. Stage two takes that residual and applies a 1-bit random projection transform called QJL (Quantized Johnson-Lindenstrauss). QJL is an optimal 1-bit inner product quantizer: it reduces the residual to a single bit per dimension using sign(random_matrix · vector), and the paper proves this makes the combined estimator unbiased.

The whole thing is data-oblivious. It works on the first vector you add to the index. The result is near-optimal reconstruction accuracy and unbiased similarity scores at your target bit-width.

For KV cache compression in long-context LLMs (storing attention keys and values), the paper tests Llama-3.1-8B on LongBench-E: 3.5 bits per channel matches unquantized quality, 2.5 bits shows only marginal degradation, while compressing the cache by more than 5x. The inner product unbiasedness property is what makes it work for attention computation.

The practical part: one import swap

turbovec ships drop-in replacements for the in-memory vector stores in LangChain, LlamaIndex, Haystack, and Agno. For LangChain:

pip install turbovec[langchain]
Enter fullscreen mode Exit fullscreen mode
# Before
from langchain_core.vectorstores import InMemoryVectorStore

# After — same API, smaller footprint, faster search
from turbovec.integrations.langchain import TurboVecVectorStore as InMemoryVectorStore
Enter fullscreen mode Exit fullscreen mode

Everything else in the pipeline stays the same. I swapped this into an existing LangChain project in a few minutes. Memory dropped by roughly 8x and retrieval got a bit faster.

For IdMapIndex (when you need stable IDs that survive deletes):

from turbovec import IdMapIndex
import numpy as np

index = IdMapIndex(dim=1536, bit_width=4)
index.add_with_ids(vectors, np.array([1001, 1002, 1003], dtype=np.uint64))

scores, ids = index.search(query, k=10)
index.remove(1002)  # O(1) by id
Enter fullscreen mode Exit fullscreen mode

What the pgvector benchmarks actually show

I have been exploring turboquant for use with pgvector. To evaluate its performance, I ran the RAG benchmarks created by Johann-Peter Hartmann.

The storage and index scan wins are real. At 4-bit, your vector column shrinks by around 8x, and index scans run faster because you're moving far less data through memory. On a large corpus, that gap is meaningful.

The retrieval quality story is less clean. Quantizing inside pgvector degrades recall measurably compared to full float32 search. You can lose real top candidates from your top-k window. The TurboQuant unbiasedness proof is mathematically correct, but unbiased inner product estimates still carry variance at 4 bits, and in dense retrieval that variance pushes results around. The second-best document in float32 might not appear in your top-10 at 4-bit.

Two cases where the trade-off still makes sense: storage-constrained deployments where approximate retrieval is acceptable, or pipelines that rerank with a cross-encoder anyway (the reranker recovers from retrieval noise). If you're running semantic search where missing the true top result matters, measure your recall on a held-out set before committing.

If you want to run this comparison yourself against your own corpus, here's the benchmark setup I used:

import numpy as np
import time
from turbovec import IdMapIndex

dim = 1536
num_vectors = 1_000_000

embeddings = np.random.randn(num_vectors, dim).astype("float32")
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)
ids = np.arange(1, num_vectors + 1, dtype=np.uint64)

queries = np.random.randn(100, dim).astype("float32")
queries /= np.linalg.norm(queries, axis=1, keepdims=True)

t0 = time.perf_counter()
index = IdMapIndex(dim=dim, bit_width=4)
index.add_with_ids(embeddings, ids)
print(f"build: {time.perf_counter() - t0:.2f}s")

latencies = []
for q in queries:
    t0 = time.perf_counter()
    index.search(q, k=10)
    latencies.append(time.perf_counter() - t0)

avg_ms = np.mean(latencies) * 1000
p99_ms = np.percentile(latencies, 99) * 1000
print(f"avg: {avg_ms:.2f}ms  p99: {p99_ms:.2f}ms")
Enter fullscreen mode Exit fullscreen mode

No training pass, no codebook warmup. The index is ready to search after the first add_with_ids call. Swap in your real embeddings and IDs, then run the same timing loop against FAISS IndexPQFastScan at the same bit-width to get a direct comparison.

When FAISS is still the right tool

turbovec is an in-memory flat index: it searches all vectors on every query. For a few million vectors on a single machine, that's fine. At hundreds of millions you need IVF partitioning to reduce the search scope, and FAISS handles that.

The ARM picture is clean: turbovec beats FAISS IndexPQFastScan by 12-20% across typical configurations. x86 is more conditional. At 4-bit, turbovec wins by 1-6% due to tighter cache lines and faster bit-unpacking. At 2-bit single-threaded, they run within 1% of each other. At 2-bit multi-threaded on AVX-512 hardware, FAISS pulls ahead by 2-4%; it exploits AVX-512 VBMI for bit manipulation during concurrent sweeps, an instruction path turbovec doesn't yet use. On enterprise x86 with high thread counts at 2-bit, that edge is real.

At high dimensions (d=1536, d=3072), turbovec matches or beats FAISS at R@1; both converge to 1.0 recall by k=4-8. At d=200 (GloVe territory), turbovec trails at R@1 because the near-Gaussian approximation from the random rotation weakens at low dimensions.

The rule: turbovec for local RAG with modern embedding dimensions, FAISS for very large corpora, GPU-accelerated search, or multi-threaded 2-bit lookups on AVX-512 servers.

What I'm using it for

I'm running turbovec in ThoughtForge for per-space semantic search. The nomic-embed-text-v1.5 model produces 768-dimensional embeddings; at 4-bit compression the full index is small enough that loading at app startup takes under a second. Local embeddings, local index, no data leaves the machine.

If you're building local RAG and hitting the float32 memory wall, this is the first thing I'd try.

Top comments (0)