Binary Quantized Embeddings: 32x Smaller Vectors, Recall Intact

A single text-embedding-3-large vector is 3072 float32 numbers: 12,288 bytes. Store 100 million of them and you are holding 1.2 TB of raw vectors in RAM before you index anything. Binary quantized embeddings turn each of those vectors into 384 bytes — a 32x cut — and if you do the second half of the trick right, your recall barely moves.

Most teams reach for Matryoshka truncation (chop dimensions) or product quantization (learned codebooks) first. Binary quantization is cruder, faster, and in high dimensions it works absurdly well. Here is the mechanism, the failure modes, and the code.

TL;DR

Binary quantized embeddings store 1 bit per dimension (sign of each component) instead of 32, a 32x memory reduction and ~4x smaller than int8.
Search runs on Hamming distance = popcount(a XOR b) — integer XOR plus a hardware POPCNT, no floating-point math, often 10-40x faster than float dot products.
Naive binary search loses recall. Oversample (retrieve 3-5x candidates by Hamming), then rescore the shortlist with int8 or float vectors — this recovers most of the lost recall.
It works because in high dimensions the sign pattern of an embedding already encodes most of its direction; you are doing SimHash/random-projection LSH implicitly.
It breaks on low-dimensional (<256), un-centered, or heavily skewed embedding spaces. Center (subtract the mean) before you binarize.

What are binary quantized embeddings?

Binary quantized embeddings replace each float dimension with a single bit: 1 if the component is positive, 0 if it is negative (or above/below a learned threshold). A 1024-dim vector collapses from 4096 bytes to 128 bytes — you pack 8 dimensions per byte.

import numpy as np

def binarize(vecs: np.ndarray) -> np.ndarray:
    # vecs: (N, D) float32, ideally mean-centered first
    bits = (vecs > 0.0).astype(np.uint8)      # (N, D) of 0/1
    return np.packbits(bits, axis=1)          # (N, D/8) uint8

That is the entire encoder. No training, no codebook, no calibration set. Compare that to product quantization, which needs k-means over your corpus and a codebook lookup at query time. Binary quantization is a threshold.

The storage math is blunt: float32 is 32 bits/dim, int8 is 8 bits/dim, binary is 1 bit/dim. Against float32 you save 32x; against a well-tuned int8 pipeline you still save 4x. For a billion-vector index that is the difference between a rack of RAM and a single box.

Why does 1 bit per dimension keep recall?

Because in high dimensions, the sign pattern of an embedding carries most of its angular information, and semantic search only cares about angle (cosine), not magnitude.

Think of each bit as the answer to "which side of hyperplane i does this vector fall on?" where hyperplane i is an axis of your embedding space. Two vectors pointing in nearly the same direction land on the same side of almost every axis, so their bit strings differ in few positions. Two unrelated vectors disagree on roughly half the bits. That is exactly SimHash / random-projection locality-sensitive hashing — the Hamming distance between sign vectors is a monothonic estimator of the angle between the original vectors.

The key property is dimensionality. With 1024 or 3072 dimensions you have 1024 or 3072 independent-ish sign bits voting on similarity. The law of large numbers does the denoising for you: individual bit flips from quantization noise average out, and the aggregate Hamming distance tracks cosine closely. This is why binary quantization is fine at 1536 dims and a disaster at 128 — fewer bits, more variance per comparison, and the sign of a small-magnitude component is essentially a coin flip that you have now baked into the index.

Modern text embedding models (OpenAI text-embedding-3-*, Cohere embed-v3, the top open MTEB models) are trained at 768–3072 dims specifically in the regime where this holds. Cohere and several vector DBs ship binary output as a first-class option for exactly this reason.

How do you search binary vectors fast?

You compute Hamming distance, which is popcount(a XOR b) — XOR the two bit strings, count the set bits. No multiplications, no floating point.

def hamming_topk(query_bits: np.ndarray, db_bits: np.ndarray, k: int):
    # query_bits: (D/8,), db_bits: (N, D/8), both uint8
    xor = np.bitwise_xor(db_bits, query_bits)          # (N, D/8)
    # popcount per byte via lookup table, then sum across the row
    lut = np.array([bin(i).count("1") for i in range(256)], dtype=np.uint8)
    dist = lut[xor].sum(axis=1)                          # (N,) Hamming distances
    idx = np.argpartition(dist, k)[:k]
    return idx[np.argsort(dist[idx])]

On real hardware you would not use a numpy lookup table — you would use the CPU POPCNT instruction (one cycle for 64 bits) or the AVX-512 VPOPCNTQ variant, and the whole 128-byte comparison for a 1024-dim vector is a handful of SIMD instructions. A float32 dot product over 1024 dims is 1024 multiply-adds and touches 4KB of memory per vector; the binary version touches 128 bytes and does bitwise ops. The memory-bandwidth win alone (32x fewer bytes streamed) usually matters more than the ALU win, because brute-force vector search is memory-bound.

This is the underrated part: binary quantization lets you do exact brute-force scan over tens of millions of vectors on one machine, no ANN graph, no HNSW index to build or keep warm. Simplicity has a real operational value.

Why must you oversample and rescore?

Because raw binary Hamming ranking is a coarse filter, not a final ranking. The fix that makes binary quantization production-viable is a two-stage retrieve: use Hamming to fetch a shortlist much larger than k, then re-rank that shortlist with higher-precision vectors.

def search(query_f32, db_bits, db_int8, k=10, oversample=4):
    # Stage 1: cheap Hamming scan over the whole corpus
    q_bits = binarize(query_f32[None, :])[0]
    cand = hamming_topk(q_bits, db_bits, k * oversample)   # e.g. 40 candidates

    # Stage 2: rescore ONLY the shortlist with int8 (or float) vectors
    q = query_f32.astype(np.float32)
    scores = db_int8[cand].astype(np.float32) @ q          # 40 dot products, not N
    order = np.argsort(-scores)[:k]
    return cand[order]

The economics: stage 1 scans all N vectors but each comparison is trivial and cache-friendly. Stage 2 does full-precision math on only k * oversample vectors — 40, not 40 million. You keep the int8 (or even float) vectors on disk or in cheaper memory and only fault in the shortlist. This "search on binary, rank on int8" pattern is what Qdrant, Vespa, and Weaviate implement under the hood.

How much oversampling? It is a recall/latency dial. Oversample of 1 (no rescoring) leaves real recall on the table. Push oversample to 3-5x and reports consistently show recall climbing back to within a hair of the full-float baseline. I would not quote you an exact percentage — it depends on your model and corpus — but the shape is reliable: the recall-vs-oversample curve rises steeply then flattens, and by ~4x you have paid a small latency cost to buy back nearly all the recall you lost to 1-bit quantization. Measure it on your own data with a labeled query set; do not trust a blog's number, including this one.

When does binary quantization fail?

Three concrete failure modes, all of which show up as a recall cliff you cannot rescore your way out of:

Low dimensionality. Below ~256 dims there are too few bits to average out quantization noise. If your model outputs 384-dim vectors, binary is risky; int8 is the safer floor. This is also why you should binarize before any Matryoshka truncation, not after — truncating to 256 dims then binarizing gives you 256 noisy bits.

Un-centered embeddings. If a dimension is almost always positive across your corpus, its sign bit is almost always 1 and carries zero discriminative information — you wasted a bit. Subtract the corpus mean before thresholding so each bit splits the data roughly in half. Some pipelines go further and apply a random rotation (an orthogonal transform) to spread information evenly across dimensions before binarizing, which is the classic ITQ / iterative-quantization move.

Asymmetric query handling. A subtle win: you do not have to binarize the query. Keep the query in float or int8 and compare it against the binary database vectors with an asymmetric distance (this is asymmetric distance computation, borrowed from PQ). You lose no query-side precision and the database stays 32x smaller. If your library supports asymmetric binary search, use it.

Binary vs int8 vs Matryoshka: which do you pick?

They compose; they are not rivals. Int8 (8 bits/dim, 4x smaller) is the conservative choice with negligible recall loss and no rescoring needed — start here if you are unsure. Matryoshka truncation cuts dimensions and is orthogonal: you can Matryoshka-truncate 3072→1024 dims then binarize those 1024 dims for a combined ~96x reduction, as long as you stay above the low-dim cliff. Binary is the aggressive tier for when memory is the binding constraint and you have the query set to tune oversampling.

The production stack that actually ships at scale is usually a hierarchy: binary vectors in RAM for the fast coarse scan, int8 vectors on SSD for rescoring the shortlist, and full float32 kept only for the handful you finally return or feed to a cross-encoder reranker. Each tier does the precision the previous tier could not afford.

So do binary quantized embeddings hurt recall?

Not if you rescore. Binary quantized embeddings store 1 bit per dimension — the sign of each component — which shrinks vectors 32x versus float32 and lets you search with popcount(XOR) Hamming distance instead of float dot products. On its own, that coarse ranking loses recall. The fix is a two-stage retrieve: scan the whole corpus in binary to build a shortlist 3-5x larger than k, then re-rank that shortlist with int8 or float vectors. This recovers nearly all the lost recall while keeping the 32x memory win on the bulk of your index. It works because high-dimensional embeddings encode most of their direction in their sign pattern, so Hamming distance tracks cosine similarity closely — but only above ~256 dimensions, and only if you center your vectors before thresholding. Measure the recall/oversample curve on your own labeled queries; the mechanism is reliable, the exact numbers are yours.