Abhishek Gautam

Posted on Jul 18

Binary Quantization: the 1-bit trick that turns terabytes of vectors into pocket-sized fingerprints

#ai #vectordatabase #rag

“If you can’t explain it with a single sign bit, you probably don’t understand it yet.” — a very anonymous engineer 😜

🧭 1. Why you’re here – the memory wall

You already pip install pgvector, CREATE EXTENSION vector, and happily insert 1024-D OpenAI embeddings as vector(1024) rows.
At 32-bit float precision, 1 M vectors × 1024 dims × 4 B ~ 4GB.
At 100 M vectors that’s 400 GB – a single m7g.8xlarge instance cannot even hold the index in RAM.
Binary Quantization keeps only the sign bit of every dimension (+1 or –1) + the original L2 norm.
Same 100 M vectors shrink to ≈ 12.8 GB of sign bits + 0.4 GB of norms – 32× smaller – while recall drops only 2–4 % after a cheap re-ranking step.

In this article, we'll:

Ground ourselves in the distance measures we will use.
Unpack the Chakra (angular) intuition behind the binary codes.
Show how to implement binary quantized indexes in PostgreSQL's pgvector.
Walk through full precision vs binary quantized search both with and without re-ranking.

Quick De-tour Hamming distance & L2 distance

1️⃣ Hamming Distance

What are we comparing? Two equal‑length bitstrings, e.g.

  u = 1 0 1 1 0 1  
  v = 1 1 0 1 0 0

Game rule: Count how many positions have different bits.
- Position 1: 1 vs 1 → same
- Position 2: 0 vs 1 → different
- Position 3: 1 vs 0 → different
- …
Hamming distance = total “spots” that differ.
- Here: differences at positions 2, 3, 6 ⇒ Hamming = 3.
Why It Matters: Once vectors are bits, Hamming distance (computed via XOR+popcount) gives a lightning‑fast proxy for angular closeness.
Analogy: Spot‑the‑Difference in two pictures—each mismatch is a “hit” on your scorecard.

2️⃣ L₂ Distance (Euclidean Distance)

What are we comparing? Two real‑valued vectors, e.g.

  x = [3, –1, 2]  
  y = [0,  2, 1]

Game rule: Imagine each vector as a point in 3‑D space. The L₂ distance is the length of the straight line joining x and y.

$$
d_2(x,y) = \sqrt{(3–0)² + (–1–2)² + (2–1)²}
= \sqrt{3² + (–3)² + 1²}
= \sqrt{9 + 9 + 1}
= \sqrt{19}.
$$

Analogy: The shortest path between two cities on a flat map.

🔗 Why Both Matter for Quantization

Hamming distance gives a binary proxy for “angle” or “similarity” when you compress vectors to bits.
L₂ distance (and its cousin, cosine similarity) is the gold standard for comparing the original float vectors.

In our binary‑quantization workflow, we’ll use Hamming as the fast filter, then L₂ (or cosine) on the original floats to refine the final result.

📚 2. A gentle-to-deep walkthrough of Binary Quantization

✅ 2.1 Beginner View (What’s the trick?)

Let’s say you have a vector:

x = [3.2, -0.4, 7.1, 0.0, -2.5, ...]

This is a vector of real numbers (say, length 1024), like you'd get from an embedding (e.g., OpenAI, BERT, etc.).

Now imagine you want to store millions of these — the memory adds up FAST. So here’s a storage trick:

👉 Step-by-step idea:

Throw away the exact values, just keep the sign:

Positive = +
Negative = –
(Usually zero is treated as positive.)

So we get:

   sign(x) = [+, –, +, +, –, ...]

Encode signs as bits:

+ → 1
– → 0

So now this vector becomes a bit string:

   10110...

Store the original magnitude (optional):

Compute the length of the original vector, called its L2 norm: ||x||₂
Store this as a single float (4 bytes).

This means instead of storing 1024 floats (1024 × 4 bytes = 4 KB), you store:

1024 bits = 128 bytes
Plus one float (magnitude) = 4 bytes

Total: ~132 bytes instead of 4096 bytes! 🎉

🧠 2.2 Intermediate View (Why is this useful?)

Even though you’ve thrown away the actual values, you still want to do things like compare vectors (e.g., using cosine similarity or dot products).

So how does comparing just the signs work?

Let’s define:

b(x) = binary version of x, where each element is +1 or –1 depending on the sign

  x = [ 3.2, -0.4, 7.1 ] → b(x) = [ +1, –1, +1 ]

Now if you take two binary vectors b(x) and b(y), their dot product (i.e. sum of element-wise products) can be expressed in terms of Hamming distance:

Formula:

b(x) · b(y) = (# of matching signs) – (# of differing signs)
            = d – 2 × Hamming(b(x), b(y))

Where:

d is the number of elements (e.g., 1024)
Hamming distance = number of positions where the bits differ

What does this give us?

It gives you an approximate similarity score:

Small Hamming distance → more similar
Large Hamming distance → more different

And what about cosine similarity?

Cosine similarity is defined as:

cos(θ) = (x · y) / (||x|| * ||y||)

Since we stored the signs (b(x)), and separately stored the magnitude (||x||), we can roughly approximate cosine similarity by:

Using Hamming distance to filter similar vectors
Optionally recovering a more accurate similarity in second step

🔬 2.3 Advanced View (Why this approximation works surprisingly well)

Let’s assume x and y are random unit vectors (i.e., their length is 1). Then some deep math shows:

The expected dot product of their sign vectors is:
E[b(x) · b(y)] = (2/π) × arcsin(cos(θ))

What this means:

Even though we only stored the signs (i.e. +1/–1), the dot product still tracks the original cosine similarity quite well.
So binary dot product ≈ arcsin of cosine similarity
We can even invert this if needed.

In practice:

People often skip the arcsin (for speed), and use Hamming distance as a fast approximation.
Then, on the top k closest vectors (say, top 1000), we compute the exact cosine using original vectors.

This is called a two-stage retrieval:

Fast filter: binary Hamming distance
Slow rerank: exact cosine similarity

Every metric you will ever need - explained

Metric	What it really means	pgvector 0.8.0 operator	Good target
Index size	RAM needed to keep HNSW graph in memory	`pg_relation_size('idx_name')`	< 30 % of float index
Build time	`CREATE INDEX` wall clock	`psql \timing`	linear in `ef_construction`
QPS	queries per second under steady load	`pgbench -P 1 -T 60`	↑ with smaller vectors
p99 latency	99 % of queries finish below this	`EXPLAIN (ANALYZE, BUFFERS)`	< 5 ms for chat UX
Recall\@k	% of true top-k returned	ANN-Benchmarks	≥ 90 % for RAG

Which Index to Use with Which Metric

Choosing the right index is like picking the right vehicle for your road trip. 🚗🏍️🚐

Index Type	Best Metric	When to Use
FLAT	L₂ / Cosine	Brute‑force exact search. Ideal for small datasets or one‑off analyses where speed isn’t critical.
IVF	L₂ / Cosine	“Partition your vectors into Voronoi cells” – good for medium‑large data. Tweak `nlist` (clusters) & `nprobe` (cells to search) for speed vs. accuracy.
HNSW	L₂ / Cosine	Graph‑based, super low‑latency, high‑recall. Go‑to for real‑time apps (recommendations, search engines).
Binary‑HNSW	Hamming	Compressed graph on bitcodes—lightweight, blazing Hamming ops for initial filter; rerank with full floats.

IVF is like speed‑dating your vectors—quickly cluster into small groups. HNSW? More like a LinkedIn network—you hop graph‑links. FLAT? Well, that’s a group hug: you compare everyone to everyone. 🙃

SQL Schema

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE docs (
    id        bigserial PRIMARY KEY,
    title     text,
    body      text,
    full_vec  vector(1536),          -- original for re-rank
    sign_bits bit(1536),             -- 1-bit signature (192 bytes)
    vec_norm  real                   -- 4-byte scalar
);

-- Optional GIN for exact KNN on sign_bits
CREATE INDEX docs_sign_gin ON docs USING gin (sign_bits gin_trgm_ops);

-- HNSW on original vector (for ANN baseline)
CREATE INDEX docs_vec_hnsw ON docs USING hnsw (full_vec vector_cosine_ops)
WITH (m=24, ef_construction=256);

Two‑Phase Search: ANN + KNN Hybrid in SQL

Imagine a bouncer at a club who first does a quick glance (ANN filter) and then a proper ID check (exact KNN). In SQL:

-- Phase 1: ANN filter via HNSW + binary quantization
WITH candidates AS (
  SELECT id, binary_quantize(embedding)::BIT(3072) AS code
  FROM documents
  ORDER BY code <~> binary_quantize($1)  -- Hamming distance
  LIMIT @ef_search
)
-- Phase 2: Precise rerank via KNN (cosine or L2)
SELECT c.id
FROM candidates c
JOIN documents d ON d.id = c.id
ORDER BY d.embedding <=> $1  -- exact distance
LIMIT @k;

candidates: fast bit‑ops over compressed codes.
rerank: join back to full embeddings for exact sorting.

This hybrid gives the best of both worlds: speed + accuracy.
The coarse scans (1M rows) and then re-rank - 1000 vectors.

When to Use Binary Quantization in RAG

Retrieval‑Augmented Generation (RAG) pipelines often embed documents and user queries, then fetch top‑k for context injection. When should you slice those embeddings down to bits?

Scenario	Use Binary Quantization?
Massive document corpora (100M+ vectors)	✅ Yes: storage & memory are at a premium. Use BQ + rerank.
Low‑latency chatbots (sub‑100 ms targets)	✅ Yes: Hamming = nano‑seconds per comparison.
Small knowledge bases (< 100k docs)	❌ Probably not: FLAT or IVF with scalar quantization suffices.
Fine‑grained accuracy (e.g., legal texts)	❌ No: an extra bit per dimension may lose nuance.

Pro Tip: Azure AI Search reports up to 96 % index size savings and 40 % query latency reduction, while regaining recall via oversampling + reranking.

Experimental Highlights

Space Savings: Up to 19× smaller index for 960‑D vectors; 96 % smaller on 1536‑D (Azure AI Search benchmarks).
Build‑Time Speedup: 2×–4.5× faster indexing on large dimensions.
Recall Trade‑off: Without rerank, recall can plunge to single digits on low‑diversity sets; rerank recovers >90 % on high‑diversity corpora.
Throughput Gains: 1.3×–2× QPS boost; 25–30 % p99 latency drop when reranking at moderate ef_search.

Key Takeaways & Recommendations

Pick your index wisely:

FLAT for small data; IVF for balanced scale; HNSW for real‑time; Binary‑HNSW for ultra‑light filter.
1. Always rerank: Binary quantization without rerank is like firing rubber bullets—fast but often inaccurate.
2. Measure bit‑diversity: High‑dim, varied vectors fare best. If recall lags, scale ef_search or bump rerank size.
3. Optimize costs: Smaller indexes fit in cheaper instances—big win for RAG at scale.

Future Directions

Half‑Precision + Binary: Quantize floats to 16‑bit then to 1‑bit; duel‑compression!
SIMD & AVX‑512: PostgreSQL 17 aims to accelerate Hamming distance functions—speed geek dream.
Jaccard vs. Hamming: Evaluate bitset vs. set‑based distances in pgvector.
Billion‑Scale Benchmarks: How does recall hold up at 1 B vectors?

Conclusion

Binary quantization is not a silver bullet, but in the hands of a discerning engineer it’s a sports car to drive to your destination. Pair it with the right index, a two‑phase hybrid, and reranking, and you’ll tame even the wildest embedding herds—without sacrificing recall or breaking the bank.

Happy vector hunting! 🎯

References
https://qdrant.tech/articles/binary-quantization/?utm_source=chatgpt.com "Binary Quantization - Vector Search, 40x Faster - Qdrant"
https://www.pinecone.io/learn/series/faiss/vector-indexes/?utm_source=chatgpt.com "Nearest Neighbor Indexes for Similarity Search | Pinecone"
https://www.macrometa.com/docs/search-views/semantic-search/concepts/index-type?utm_source=chatgpt.com "Index Type | Macrometa"
https://medium.com/%40noorulrazvi/understanding-index-types-in-vector-databases-when-and-why-to-use-them-46ac9a559994?utm_source=chatgpt.com "Understanding Index Types in Vector Databases: When and Why to Use Them | by Razvi Noorul | Medium"
https://techcommunity.microsoft.com/blog/azure-ai-services-blog/binary-quantization-in-azure-ai-search-optimized-storage-and-faster-search/4221918?utm_source=chatgpt.com "Binary quantization in Azure AI Search: optimized storage and faster search"

DEV Community